arXiv:1708.08744v1 [cs.CY] 17 Aug 2017

Machine Learning Based Student Grade

Prediction: A Case Study

Zafar Iqbal

*

, Junaid Qadir

**

, Adnan No or Mian

*

, and Faisal Kamiran

*

*

Department of Computer Science,

**

Department of Electrical Engineering,

Information Technology University,

Lahore, Pakistan

{mscs13039, junaid.qadir, adnan.noor, faisal.kamiran}@itu.edu.pk

In higher educational institutes, many students have to struggle hard to complete different courses since

there is no dedicated support offered to students who need special attention in the registered courses.

Machine learning techniques can be utilized for students’ grades prediction in different courses. Such

techniques would help students to improve their performance based on predicted grades and would enable

instructors to identify such individuals who might need assistance in the courses. In this paper, we use

Collaborative Filtering (CF), Matrix Factorization (MF), and Restricted Boltzmann Machines (RBM)

techniques to systematically analyze a real-world data collected from Information Technology University

(ITU), Lahore, Pakistan. We evaluate the academic performance of ITU students who got admission in

the bachelor’s degree program in ITU’s Electrical E ngineering department. The RBM technique is found

to be better than the other techniques used in predicting the students’ performance in the particular course.

1. INTRODUCTION

Since universities are prestigio us places of higher education, students’ retention in these univer-

sities is a mat ter of high concern (Aud et al., 2013). It has been found that m ost of the students’

drop-out from the universities during their ﬁrst year is du e to lack of proper support in under-

graduate courses (Callender and Feldman, 20 09) (MacDonald, 1992). Due to this reason, the

ﬁrst y ear of the undergraduate student is referred as a “make o r break” year. Without getting

any support on the course do main and its complexity, it may demot ivate a stu dent and can be the

cause to withdraw the course. There is a great n eed to d evelop an appropriate solution to assi st

students retention at higher education institutions. Early grade prediction is one of the solutions

that have a tendency to monitor students’ progress in the degree cou rs es at the University and

will lead to improving the students’ learning process based on predicted grades.

Using machine learning with Educational Data Mining (EDM) can improve the learning

process of students. Different mo dels can be developed to predict students’ grades in the enroll ed

courses, which provide valuable information to facilitate students’ retention in those courses.

This information can be used to early identify students at-risk based on which a s ystem can

1

suggest the inst ructors to provide sp ecial attention to thos e students (Iraji et al., 2012). This

information can also help in predicting t he students’ grades in different courses to monitor their

performance in a better way that can enhance the students’ retention rate of the universities.

Several research studies have been conducted to assess and predict students’ performance

in the universities. In (Iqbal et al. , 20 16), we analyzed various existing international studies

and examined the admission crit erio n of ITU to found which admission criterion factor can

predict the GPA in the ﬁrst semester at the undergraduate level. From the results, we found that

Higher Secondary School Certiﬁcate (HSSC) performance and entry test performance are the

most signiﬁcant factors in predicting academic su ccess of the students in t he ﬁrst semester at

university. In th is study, we are furt h er extending this research and examining t he effectiveness

of the performance of st udents of ITU i n enrolled courses using m achine learning techniques.

In this study, we applied various techniques (CF, SVD, NMF, and RBM) on the real-world

data of ITU stu dents. The CF techniq ues are one of the most popular techniques for predicting

students’ performance (Sarwar et al., 1998), which works by discovering similar characteris-

tics of users and items in the database; CF, however, does not provide an accurate prediction

for a sparse database. The SVD techniqu e makes better predictions as compared to CF algo-

rithms for sparse databases by captu ri ng the hidden latent features in the dataset while avoid-

ing overﬁtting (Berry et al., 1995). The NMF technique allows meaningful interpretations of

the possible hidden features compared to other dimensionality reduction algorithms such as

SVD (Go lub and Van Loan, 2012). Finally, RBM can als o be used for collaborative ﬁltering and

was used for collaborative ﬁltering during the Netﬂix competition (Salakhutdinov et al., 2007).

(Toscher and Jahrer, 2010) tried to use RBM on the KDD Cup dataset and got promisin g results .

The contributions of this paper are:

1. We systematically reviewed the literature about grade/GPA prediction and comprehen-

sively presented them.

2. We analyzed a real world data collected from 225 undergraduate students of Electrical

Engineering Department at ITU.

3. We evaluated state of the art machine learning techniques (CF, SVD, NMF, and RBM) in

predicting the performance o f ITU students.

4. We proposed a feedback model to calculate the student’s knowledge for particular course

domain and provide feedback if the student needs to put more effort in that course based

on the predicted GPA.

5. We proposed a ﬁtting procedure for hidden Markov model to determine the student per-

formance in a particular course wit h u tilizing the knowledge of course domai n .

Rest of the paper is organized as follows. In Section 2, we will describe related work pro-

posed in the literature. Different machine learning techniqu es that can b e utilized to predict

students’ GPA are brieﬂy outlined in section 3. The methodology of the study for this paper

and the performance of the ITU students in different courses are described in Section 4. We

present the results and ﬁndings of our study in Section 5. We described the insights that hol d

for our stu dy in Section 6. We highlight some limitations of this study in Section 7. Finally, we

conclude the paper in Section 8.

2

2. RELATED WORK

Numerous research studies have been conducted to predict s tudents’ academic performance ei-

ther to facilitate degree planning or to determine stu dents at risk.

2.1. MATRIX FACTORIZATION

(Thai-Nghe et al., 2011) proposed matrix factorization models for predicting student perfor-

mance of Algebra and Bridge to Algebra courses. The factorization techniques are u seful in

case of sparse data and absence of stud ents’ background kn owledge and tasks. They split the

data into trainset and testset. The data represents th e log ﬁles of interactions between students

and computer aided tutoring systems. (Thai-Nghe et al . , 20 11) extended the research and used

tensor-based factorization to predi ct student success. They form ulated the prob lem of predicti ng

student performance as a recommender syst em problem and proposed tensor-based factorization

techniques to add the temporal effect of student performance. The system saves success/failure

logs of students on exercises as they interact with the system.

2.2. PERSONALIZED MULTI-LINEAR REGRESSION MODELS (PLMR)

Grade prediction accuracy using Matrix Factorization (MF) method degrades when dealing wi th

small sample sizes. (Elbadrawy et al., 2016) investigated different recommender system tech-

niques to accurately predict the s tudents’ next term course grades as well as withi n the class

assessment performance of George Mason University (GMU), University of Minnesota (UMN)

and Stanford University (SU). Their study revealed that both Personalized Multi-Linear Re-

gression models (PLMR) and advance Matrix Factorization (MF) techniques could predict next

term grades with lower error rate than traditional methods. PLMR was also useful for predict-

ing grades on assessments within a traditional class or on line course by incorporating features

captured through students’ int eraction with LMS and MOOC server logs.

2.3. REGRESSION AND CLASSIFICATION MODELS

The ﬁnal grade prediction based on the limited initial data of student s and courses is a chall eng -

ing task because, at the beginning of undergraduate studi es, most o f the students are motivated

and perform well in the ﬁrst semester but as the t ime passed there might be a decrease in motiva-

tion and performance of the students. (M eier et al., 20 16) proposed an algorithm to predict the

ﬁnal grade of an individual student when the expected accuracy of th e prediction is sufﬁcient.

The algorithm can be used in both regression and classiﬁcation settings to predict students’

performance in a course and classify them into two grou ps (the student wh o perform well and

the stud ent who perform poorly). Their study s howed that in-class exams were better predic-

tors of the overall performance of a student than the homework assignment. The study also

demonstrated that timely prediction of the performance of each student would allow instructors

to intervene accordingl y. (Zimmermann et al., 2015) considered regression models in com bi-

nation with variable selection and variable aggregation approach to predict the performance of

graduate student s and their aggregates. They have used a dataset of 171 students from Eid -

gen¨ossische Technische Hochschule (ETH) Z¨urich, Switzerland. According to their ﬁndings,

the undergraduate performance of the students could explain 54% of the variance in graduate-

level performance. By analyzing the s tructure of the undergraduate program, they assessed a set

3

of students’ abilities. Their results can be used as a methodologi cal basis for deriving principle

guidelines for admissions committees.

2.4. MULTILAYER PERCEPTRON NEURAL NETWORK

Educational Data Mining utilizes data mining techniques to discover novel knowledge originat-

ing in educational settings (Baker and Yacef, 2009). EDM can be used for decision making in re-

ﬁning repetitive curricula and admission criteria of educational instituti ons (Calders and Pechenizkiy, 2012).

(Saarela and K¨arkk¨ai nen, 2015) applied the EDM approach to analyze t he effects of core Com-

puter Science courses and provide novel information for reﬁning repetitive curricula to enhance

the success rate of the students. They utilized the historical lo g ﬁle of all the students of the

Department of Mathematical Information Technology (DMIT) at the University of Jyv¨askyl¨a in

Finland. They analyzed patterns observed in the historical log ﬁle from the student database

for enhanced proﬁlin g of the core courses and the indication of study skills that support timely

and successful graduation. They trained m ultilayer perceptron neural network model with cross-

validation to demonst rate the constructed nonlinear regression model. In their study, they found

that the general learning capabilities can better predict the students’ success than speciﬁc IT

skills.

2.5. FACTORIZATION MACHINES (FM)

Next term grade predictio n methods are developed to predict the grades that a student will ob-

tain in the courses for the next term . (Sweeney et al., 2015) developed a system for predicting

students’ grades using simple baseli n es and MF-based method s for the dataset of George Mason

University (GMU). Their study showed th at Factorization Machines (FM) model achieved the

lowest p redi ction error and can be us ed to predict bot h cold-start and non-cold-start predictions

accurately. In subsequent st udies, (Sweeney et al ., 2 016) explored a variety of methods th at

leverage content features. They used FM, Random Forests (RF), and the Personalized Multi-

Linear Regression (PMLR) models to learn patterns from hi storical transcript data of students

along with add itional i nformation about the courses and the instructors teaching them. Their

study showed that h ybrid FM-RF and the PMLR mod el s achieved the lowest prediction error

and could be used to predict grades for both new and returning students.

2.6. DROPOUT EARLY WARNING SYSTEM (DEWS)

Dropout early warning sy stems help higher education institut ions to identi fy students at risk,

and to identify interventions that may help to increase the student retention rate of the insti-

tutes. (Knowles, 2015) utilized th e Wisconsin DEWS approach to predict the st udent dropout

risk. They introduced ﬂexible series of DEWS software modul es that can adapt to new data,

new algorithms, and new out come variables to predict the dropo ut risk as well as impute key

predictors.

2.7. HIDDEN MARKOV MODEL AND BAYESIAN KNOWLEDGE TRACING

Hidden Markov model has been used widely to model student learning . (Van De Sande, 2013)

investigated solutions of hidden Markov model and concluded that the utilizatio n of a maxi-

mum likelihood test should be the preferred method for ﬁnding parameter values for the hidden

Markov Model. (Hawkins et al., 2014) in a separate st udy developed and analyzed a new ﬁtting

4

procedure for Bayesian Knowledge Tracing and conclu ded that empirical probabilities had the

comparable predictive accuracy to th at of expectation maximization.

In Table 1, we have systematically summarized the studies that are related to o ur study in a

comprehensive way to present a big picture of literature. Our work is related to grade prediction

systems, recommender systems, and early warning systems within the context of education.

In our study, the app roach is to use machine learning techniques to predict course grades of

students. We used the state of the art techniques t hat are described and implem ented in this

section to do a comparative analysis of different techniques t hat can predict students’ GPA in

registered courses. We also develop a model that can be u sed in a tutoring sy stem indicating

the weak students in the course to the instructor and providing early warnings to the student if

he/she needs to work hard t o complete the course.

Table 1: Systematic Literature Revi ew

Study Study Purpose Dataset Methods / Techniques Relevant Findings

(Thai-Nghe et al., 2011) Factorization ap-

proaches to predict

student performance.

Two real-world datasets

from KDD Cup 2010.

Matrix Factorization MF technique can take

slip and guess factors to

predict performance.

(Thai-Nghe et al., 2011) Matrix factorization

models for predicting

student performance.

Two real-world datasets

from KDD Cup 2010.

Matrix Factorization and

Tensor based Factoriza-

tion

MF techniques are use-

ful for sparse data to pre-

dict the performance.

(Hawkins et al., 2014) Analyze a new ﬁtting

procedure for Bayesian

Knowledge Tracing.

1,579 students working

on 67 skill-builder prob-

lem sets.

Bayesian Knowledge

Tracing

Probabilities have ac-

curacy to Expectation

Maximization.

(Zimmermann et al., 2015) Predict graduate perfor-

mance using undergrad-

uate performance.

171 students data from

ETH Zurich.

Regression models. Third year GPA of un-

dergraduate can predict

graduate performance.

(Saarela and K¨arkk¨ainen, 2015)Analysing students per-

formance using sparse

dataset.

Students data of DMIT

2009 - 2013.

Multilayer perceptron

neural network

General learning capa-

bilities can predict the

students’ success.

(Sweeney et al., 2015) Predict students’ course

grades for the next en-

rollment term.

33000 GMU students

data of fall 2014.

Factorization Machine FM model can predict

performance with lower

prediction error.

(Knowles, 2015) Build a dropout early

warning system.

2006-07 grade 7 cohorts. Dropout Early Warning

Systems (DEWS).

DEWS can predict the

dropout risk as well as

impute key predictors.

(Elbadrawy and Karypis, 2016)Investigate the student

and course academic

features.

1,700,000 grades from

the University of Min-

nesota.

Collaborative Filtering

and Matrix Factorization

Features-based groups

make better grade

predictions.

(Elbadrawy et al., 2016) Predict next term course

grades and within-class

assessment performance

30,754 GMU, 14,505

UMN and 13,130 SU

students’ data.

Personalized Multi-

Linear Regression

models (PLMR)

PLMR and MF can pre-

dict next term grades

with lower error.

(Sweeney et al., 2016) Predict students’ grades

in the courses they will

enroll in the next term.

33000 GMU students

data.

Hybrid FM-RF and the

PMLR models

Hybrid FM-RF and

PMLR methods can

predict students’ grades.

(Meier et al., 2016) Predict grades of indi-

vidual students in tradi-

tional classrooms.

700 UCLA undergradu-

ate students data.

Regression and classiﬁ-

cation

In-class evaluations en-

ables timely identiﬁca-

tion of weak students.

(Xu et al., 2017) Machine learning

method for predicting

student performance.

1169 UCLA undergrad-

uate students data.

Latent factor method

based on course cluster-

ing

Latent factor method

performs better than

benchmark approaches.

5

3. BACKGROUND

Machine Learning with EDM has gained much more attention in the last few years. Many

machine learning techniques, such as collaborative ﬁltering (Toscher and Jahrer, 20 10), matrix

factorization (Thai-Nghe et al., 2011), and artiﬁcial neural networks (Wang and Liao, 2011) are

being used to predict students’ GPA or grades. In this section, we will describ e these machine

learning techniqu es and how they are being used to predict students’ GPA in registered courses

within the context of education.

3.1. COLLABORATIVE FILTERING

Collaborative ﬁltering (CF) is one of the most popular recommender system technique to date.

In the educational context, the CF algorithms make predictions of GPA by ident ifying simi lar

students in t he dataset. In this m ethod, predicti ons are m ade by selecting and aggregating the

grades of o ther students. In particul ar, there is a list of m st u dents S = {s

1

, s

2

, ..., s

m

} and a

list of n courses C = {c

1

, c

2

, ..., c

n

}. Each student s

i

has a list of courses C

si

, which represents

student GPA in a course. Th e task of CF algorithm is to ﬁnd a student whose GPAs are similar to

some other s tudent. User-based Collaborative Filtering (UBCF) is one of the types of collabora-

tive ﬁltering technique. To predict the student GPA in a course, the UBCF algorithm considers

similar students that have similar GPA in s ame courses. The mai n s teps are:

1. The algorithm measures how similar each st udent in the database to the active student by

calculating the similarity matrix.

2. Identify the mos t similar students by u sing k nearest neighbors.

3. Predict the GPA of the cou rs e of the active user by aggregating the GPA of that course

taken by the mos t sim ilar students. The aggregation can be a simple mean or weig h ted

average by taking similarity between students into account.

The k nearest n ei g hbour techniq u e is used to select the neighbourhood for the active user

N( a) ⊂ U. The average rating of the neighbourhood u sers is calculated u sing the equation 1,

which becomes the predicted rating for the active use. The g rade prediction becomes extremely

challenging for the student wi th a few courses attended which is a well-known drawback of CF

technique over the sparse dataset.

ˆr

aj

=

1

|N(a)|

X

i∈N(a)

r

ij

(1)

3.2. MATRIX FACTORIZATION

Matrix factorization is a decomposition of a matrix into two or more matrices. Matrix factor-

ization techniques are used to discover hidd en latent factors and to predict missing values of the

matrix. In our study, we formulated the problem of predicting student performance as a recom-

mender system problem and used matrix factorization methods (SVD and NMF) which are the

most effective approaches in recommender systems.

6

3.2.1. Singular Value Decomposition

Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes students-

courses matrix R into

R = UΣV

T

, (2)

where;

• U is an m × r orthogon al matrix, where m represents number of users and r represents

the rank of the matrix R,

• Σ is an r × r diagonal matrix with singular values along the main diagonal entries and

zero everywhere else,

• V is an r × n orthogonal matrix where n represents the number of courses.

Figure 1: Decomposition of Matrix R by SVD

The graphical representation of SVD is shown in Figure 5. In newly con structed matrices, r

represents the rank of the matri x R. The values in the matrix Σ are known as singular values σ

i

,

and they are stored in decreasing order of t h ei r magnitude. Each singu lar value σ

i

of the m at ri x

Σ represents hidden latent features, and their weights have variance on the values o f matrix R.

The sum of all elements represents the total variance of matrix R.

SVD is widely being used to ﬁnd the best k-rank approximation for the matrix R. The rank

r can be reduced to k, where k < r, by taking only the largest singular value k which is the ﬁrst

diagonal value of the matri x Σ and then reduce both U and V according ly. The obtained result

is a k-rank approximat ion R

k

= U

k

Σ

k

V

T

k

of the matrix R, in such a way that the Frobenius

norm of R − R

k

is minimized. The Frobenius norm (||R − R

k

||F ) is deﬁned as simply the sum

of squares of elements in R − R

k

(Deerwester et al., 1990). To make a prediction of the GPA

in a course, SVD assumes that each stud ent grade is composed of the sum of preferences of the

various latent factors of the courses. To predi ct t he grade of a student i for course j is as simple

as taking the dot product of vector i in t he student feature matrix and the vector j in the course

feature matrix.

The problem with SVD is that it is not effective o n big and sparse datasets. Simon Funk

proposed to use a Stochastic Gradient Descent (SGD) algorithm to compute the best rank-k

7

matrix approximation using only the known ratings of original matrix (Funk, 200 6). Stochastic

Gradient Descent (SGD) is a convex optim ization technique that gets the most accurate values

of those two featured matrices that are obtained during the decomposition of the original matrix

in the method of SVD. SGD has following steps:

1. Re-construct the target student s-courses matrix by multiplying the two lower-ranked ma-

trices.

2. Get the difference between the target matrix and th e generated matrix.

3. Adjust the values of the two lower-ranked matri ces by distributing the difference to each

matrix according to their contribution to the product target matrix.

Above is a repeated process til l the difference is lower than a preset threshold. By reducing

the dimensionality of th e students-courses matrix, the execution speed is reduced, and the ac-

curacy of the prediction is increased because of considering only the courses that contribute to

the reduced data. Di mensionality reduction leads to the reduction of nois e and over-ﬁtting. This

method is also used in recommender systems for the Netﬂix challenge (Koren et al., 2009).

3.2.2. Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) is a matrix factorization technique that decompos es a

matrix V into two non-negative factor matrices W and H such that

V ≈ W H, (3)

where;

• W is a u × k orthogonal m atrix,

• H is a k × v orthogonal matrix.

Figure 2: Decomposition of Matrix V by NMF

Graphical representation of NMF is shown in Figure 2. NMF is a powerful technique that

uncovers the latent hidden features in a dataset and provides a non-negative representation of

data (Koren et al., 2009). The problem wit h NMF is to ﬁnd W and H w hen th e dataset is large

and sparse. A sequential coordinate-wise descent (SCD) algorithm can be used with NMF to

impute the m issing values (Franc et al., 2005). NMF imputation using SCD takes all ent ries into

account when imputing a single missi ng entry.

8

3.3. RESTRICTED BOLTZMANN MACHINES

The method of Restricted Boltzmann Machines (RBM) is an unsupervised machine learning

method. Unsupervised algorithms are used to ﬁnd t he structural patterns within the dataset. We

have used RBM to predict the students’ performance in different courses. An RBM is in the

form of a bipartite graph that creates two layers of nod es . The ﬁrst layer is called the visible

layer, which contains th e input data (Course Grades). These nodes are connected t o the second

layer whi ch is called the hi dden l ayer that contains symmetrically weighted connections. From

the Figure 3 we can see that the g raph have ﬁve visib le nodes (Course Grades) denoted by v

i

and four hidden nodes indicated by h

j

. T he weights between the two nodes are w

ij

. H ere each

visible node v

i

represents the grade for course i, for a particular student.

Hidden

Nodes

h

j

Weights

w

ij

Visible

Nodes

v

i

Course 1 Course 2 Course 3 Course 4 Course 5

Figure 3: A Restricted Boltzmann M achines (RBM) with ﬁve courses and four hid den nodes

for a speciﬁc student.

RBM is a form of Markov Random Field (MRF). MRF is a type of probabilistic model

that encodes the structu re of the model as an undirected graph for which the energy function is

linear i n its free parameters. The energy function E(v, h) for RBM can be calculated us ing the

equation 4.

E(v, h) = −a

T

v − b

T

h − h

T

W v (4)

In the above equation, W represents the weights between the hidden and visible nodes and

a, b are the offsets of t he visible and hidden layers respectively. The probability distributions

P (v, h) of visible and/or hid den nodes can be calculat ed using the equation 5.

P (v, h) =

1

Z

e

−E(v,h)

(5)

Where Z is a partition function that deﬁnes normalization of the distribution. To predict a

student grade, one can include an additional visible node v

p

, for which t he value is unknown,

but it can be determined by using the energy function given in the equ ation 6.

P (v

p

|v, h) ∝

1

Z

e

−E(v

p

,v,h)

(6)

9

4. METHODS

We used CF (UBCF), MF (SVD and NMF) and RBM techniques to predict GPA of the student

for the courses. A feedback model is developed based on the predicted GPA of t he student in a

course.

4.1. DATASET DESCRIPTION

A real world student data i s collected from Electrical Engineering Department at ITU across

students of the batch (2013, 2014, 2015). Th e dataset contains data of 225 undergraduate

students enrolled in the Electrical Engineering program. Th e data of each student contains

the students pre-university traits (secondary school percentage, high school percentage, entry

test scores and interview), the course credits and the obtain ed grades of 24 different courses

that the stu dents take in different semesters. We consider only letter-grade courses but not

fail courses. The information of courses and their domain is shown in Table 2, which was

obtained from the curricul um for Electrical Engineering designed for Pakistani Universities

(Higher Education Commission of Pakistan, 2012).

Table 2: Courses Domain Table

Course Domain Courses

Humanities Communication Skills I, Communication Skills II, Is-

lamic Studies

Management Sciences Industrial Chemistry, Entrepreneurship, D Lab

Natural Sciences Linear Algebra, Calculus and Analytical G eometry,

Complex Variables and Transforms, Probability &

Statistics

Computing Object Oriented Programming, Computing Fundamen-

tals and Programming

Electrical Engineering Foundation Linear Circuit Analysis, Electricit y and Magnetism,

Electronics Workbench, Electroni c Devices and Cir-

cuits, Digital Logic Design, Electrical Network Anal-

ysis, Electronic Circuit and Design, Signals & System s

Electrical Engineering Core Solid State Electronics, Microcontrol lers and Interfac-

ing, Electrical Machines, Power Electronics

4.2. PROBLEM FORMULATION

For this study, we would like to predict student GPA from the scale 0.0 - 4.0. The given data

we have is hStudent, Course, GP Ai triplet and we need to predict GPA for each student fo r

the courses he/she will enroll in the future. In general, we have n students and m courses,

comprising an n × m sparse GPA matrix G, where {G

ij

∈ R | G

ij

≤ 4} is the grade student i

earned in course j.

For training machi n e learning models, students grades need to be converted to GPA. These

grades are converted to numerical GPA values using the ITU grading policy on a 4 point GPA

10

scale with respect to the letter grades A+=4, A=4, A-=3.67, B+=3.33, B=3.0, B-=2.67, C+=2.33,

C=2.0, C-=1.67, D+=1.33, D-=1.0 and F=0.0. Figure 4 shows the frequency distribution of

grades for the students whose grades are available in the dataset. We can see most of the students

have B or B- grades in the courses they have taken.

Figure 4: Distribution of students’ g rades received for the taken courses

As prediction algorithm works best with centering predictor variables, so all the data were

transformed by centering (average GPA of a course is su btracted from all GPAs of that course).

The main characteristics of the dataset are shown in the Table 3.

Table 3: Description of ITU dataset used in th is study

Characteristic Number

Total students 225

Total courses 24

Total cells 5400

Elements (grades) avai lable 1736

Elements (grades) missing 3664

Matrix density 32.14%

4.3. PREDICTION OF STUDENT GRADES

As our objective is to predict students GPA in the courses for which he/she needs to enroll in the

future, we used CF (UBCF), MF (SVD and NMF) and RBM techniques to predict courses GPA

of students . We take the data into a matrix in the form of hStudent, Course, G P Ai triplet. For

illustration, here we have taken a few students and courses to d isplay their grades. In the Table

4 we can see that a student with Id. SB145 have a GPA 3.67 in th e course Electronic Circuit

and Design and have a GPA of 4.0 in the D-Lab course. While this student needs to enroll into

11

Linear Circuit An al ysis, Islamic Studies, and Signal s and System. A student with Id. SB185

have similar GPA in Electronic Circuit and Design cours e like the student with Id. SB145 and

this student need to enroll into Linear Circuit Analysis , Islamic Studi es, Signals and Systems,

and D-Lab courses.

Table 4: Students -Courses mat ri x with students’ GPA in particular courses

Student Id. LCA ECD IS SS DL

SB145 3.67 4

SB161 4 3.67

SB185 3.67

SB229

SB304 2 2.67

Linear Circuit Analy sis (LCA) E lectronic Circuit and Design (ECD) Islamic Studies (IS)

Signals & Systems (SS) D-Lab (DL)

Collaborative Filtering: We have used UBCF to predict the students’ grades in courses.

UBCF do grade prediction of a student s in a course c by identifying s tudent grades in same

courses as s. For prediction of grades, the neighborhood students ns similar to student s are

selected that have taken at least nc courses that were taken by student s. To apply UBCF model

we ﬁrst converted the students-courses matrix R into a real-valued rating matrix h aving stu dent

GPA from 0 to 4. To measure the accuracy of this mo del we have split the d at a into 70% trainset

and 30% testset. In UBCF model The similarity between students and courses is calculated

using k nearest neighbors.

Matrix Factorization: Matrix factorization is the decomposition of a matri x V into the

product of two mat ri ces W and H, i.e. V ≈ W H

T

(Koren et al., 2 009). In this study, we have

used SVD and NMF matrix factorization techniques to predict the student GPA. The main issue

of MF techniques is to ﬁnd out the optimized value of matrix cells for W and H.

In SVD approach, the students’ dataset is converted into real-valued rating matrix having

student grades from 0 to 4. The dataset is split into 70% for training the model and 30% for

testing the model accuracy. We used Funk SVD to predict GPA in the courses for which the

students are shown in Table 4 have not yet taken the courses. The largest ten singular values are

191.8012, 18.8545, 14.7946, 13.8048, 12.4328, 11.8258, 11.1058, 10.2583, 9.5020 and 9.1835.

It can be observed from the Fig ure 5 th at the distribution of the singular values of stud ents-

courses m at ri x dim inishes quite fast s u ggesting th at the matrix can be approximated by a low-

rank matrix wi th high accuracy. This encourages the adopti on of low-rank matrix compl et ion

methods for solving our grade/GPA predicti o n p roblem.

By applying Funk’s proposed heuristic search technique called Stochastic Gradient Descent

(SGD) gradient to the matrix G we obtained two matrices student and courses dimensional

spaces (with the number of hidden features set to two, so as to ease the task of visualizing the

data). The stochastic g radi ent descent techni que estimates the best approximation matrix of th e

problem using greedy improvement approach (Pel´anek and Jaruˇsek, 2015).

Table 5 represents th e students’ features dimension al space, and Table 6 represents courses’

features di mensional space. With the dot product of t hese features dim ensional space we can

12

Figure 5: Singular vales distribution of students-courses matrix

predict GPA in t he courses for which the students are shown in Table 4 n eeds to enrol l. Please

note that we usually do not know the exact meaning of the values of these two-dimensional

space, we are just interested in ﬁnding the correlation between th e vectors in that dimensional

space. For understanding, take an example of a movie recommender system. After matrix

factorization, each u ser and each movie are represented by two-dimension al space. The values

of the dimensional space represent the genre, amount of action involved, quali ty of performers

or any other concept. Even if we do no t know what these values represent, but we can ﬁnd the

correlation between users and m ovies usi ng the values of dimensional space.

Table 5: Students’ features dimensional space

Name V1 V2

SB145 0.39 0.18

SB161 0.45 0.20

SB185 0.42 0.20

SB229 -0.31 0.02

SB304 0.09 0.12

Table 6: Courses’ features dimensional space

Name V1 V2

Linear Circuit Analy sis 1.19 -0.04

Electronic Circuit and Design 0.94 0.10

Islamic Studies 1.77 -0.03

Signals and Systems 0.34 0.20

D-Lab 0.46 0.18

In NMF approach, we have a u × v matrix V with non-negative ent ries of student grades

from 0 - 4 that decomposes into two non -n egative, rank-k matrices W (u×k) and H(k ×v) such

that V ≈ W H. Before decomposin g a matrix into two matrices ﬁrst, we need to choos e a rank-k

for NMF that gives the smallest error for grade predict ions o f the students-courses matrix. In our

experiments with NMF, the rank-k 2 gives the minimum Mean Squared Error (MSE) as shown

in the Figure 6. So, we have used two as rank-k value and decomposed the matrix into W and

H.

13

Figure 6: Rank-k using NMF

Restricted Boltzmann Machines: We have also used RBM an unsupervised learning tech-

nique to predict the student grades in different courses. RBM has been used to ﬁll the mi ssing

data in a students-courses matrix. We have split the data into 70 % trains et and 30% testset. We

have trained the RBM method wi th a learning rate of 0.1, momentum constant of 0.9, the bat ch

size of 180, and for 1000 epochs.

4.4. FEEDBACK METHODOLOGY

Machine learning techniques can be utilized to id entify the weak students who need appropriate

counseling/advising in the courses, by early predicting the courses grades. A feedback model

that we have developed will calculate the student’s knowledge in the particular course domain

based on the results it gives feedback to the instructor about the courses in which a student is

weak. The detail o f the feedback model is given below and represented in a Figure 7.

Figure 7: Main steps of feedback mo del

14

1. Build Student Proﬁle: In the ﬁrst phase of feedback model; we have to p arse students and

courses data into t he form of hStudent, Course, GP Ai triplet to built students’ proﬁle.

A students-courses matrix R is created that contains students’ performance in each course

taken. In a matrix R, students are represented in rows and courses are represented in

columns. The value of each cell of matrix R is R

ij

, that can be calculated using the

equation 7.

R

ij

=

student’s i mark on course j, if the student enrolled in course j

empty, if the student di d n o t enroll in course j

(7)

For the courses in which a student did not enroll, R

ij

will b e empty. For illustration,

a small chunk of the dataset is presented i n matrix given below. This matrix holds the

dataset of ﬁve different students and ﬁve different courses.

R

ij

=

3.67 4

4 3.67

3.67

2 2.67

(8)

2. Predict Course GPA: Now we have a matrix R, for which we are interested to ﬁnd the

unknown GPAs for the courses, which the student has not taken yet. To ﬁnd the predicted

GPA we have used CF (UBCF), MF (SVD and NMF), and RBM techniques. Detailed

methodology for these techniques is described i n s ection 4.

3. Students’ Knowledge in Course Domain: In o ur feedback model, student knowledge in

different course domains is calculated by taking an average of GPAs for the courses the

student has taken which fall in to the same domain by using the course do main t able (Table

2).

4. Knowledge Inference: Hidden Markov Model (HMM) is a model used to predict stu-

dents’ performance based on their histo ri cal performance. According to the model, the

probability of knowledge P (L

j

) increases with every step j and can be calculated using

the equation 9.

P (L

j

) = P (L

j−1

) + P (T )(1 − P (L

j−1

)), (9)

where;

• P (L

j

) is the probability of knowledge in the step j,

• P (L

j−1

) is the probability of knowledge i n the previous step,

• P (T ) is the probability of learning,

15

• (1 − P (L

j−1

) is the knowledge that is unknown.

Using the equation 9, student knowledge is measured b y inferring his kn owledge in the

course domain. As we know the probability o f the knowledge in the p revious step is th e

predicted GPA for the student in the subject. To calculate the knowledge gain course

domain average has been converted i nto the range (0 to 1) and multiplied by the learning

rate 0.005.

5. Feedback: After compu ting the s tudent knowledge in particul ar cours e domain and knowl-

edge inference, the feedback is made. If the student knowledge inference results are less

than 2.67 GPA in a course, then the system generates a warning that the student needs ef-

fort in that course. In thi s way, feedback results can inform the instructors that the student

is weak in a particular course.

5. RESULTS

5.1. CORRELATION ANALYSIS

To ﬁnd the pre-admission factors (SSC, HSSC, entry test and in terview) that can predict stu-

dent performance in the university Pearson Correlation has been applied. The result shows that

there is a positive correlation between entry test and Cumulative G rade Point Average (CGPA)

and also between HSSC and CGPA. The correlation coefﬁcients (r) between the entry test and

CGPA, and HSSC and CGPA are very close (r = 0.29 and r = 0.28 respectively), indicating that

both entry test and H SSC are equally important in p redi cting the CGPA of a student. Figure

8 s hows the correlation between the entry test of the student s and their CGPA, and Figure 9

shows th e correlation between the high er secondary school performance and the CGPA. These

ﬁgures show that the students with a higher score in entry test and a higher percentage in HSSC

performance obtain higher CGPA in the degree program.

50 55 60 65 70 75 80

0 1 2 3 4

Entry Test

CGPA

Linear Regression Fitting

Student Performance

Figure 8: Correlation between entry test

and CGPA

65 70 75 80 85

0 1 2 3 4

HSSC

CGPA

Linear Regression Fitting

Student Performance

Figure 9: Correlation between HSSC and

CGPA

16

5.2. GRADE PREDICTION

For students, GPA prediction, stud ents-courses matrix G is constructed. The data were trans-

formed by centering the predictor variables by taking average GPA of a course and subtracted

it from all GPA s of that course. 70% of the dat aset is used for training the CF MF and RBM

models. Student GPAs for the courses has been predicted and di splayed in Table 7.

Table 7: Student GPA prediction in courses b ased on CF, SVD, NMF and RBM technique

Student Id. Method LCA ECD IS SS DL

SB145 RBM 2.67 3.67 2.33 3 4

NMF 1.86 3.67 1.99 3.61 4

SVD 3.48 3.67 3.86 3.1 4

UBCF 2.99 3.67 2.99 2.91 4

SB161 RBM 2.67 4 3 2.67 3.67

NMF 2.77 4 2.88 3.44 3.67

SVD 2.99 4 3.86 2.63 3.67

UBCF 2.41 4 2.81 2.39 3.67

SB185 RBM 2.67 3.67 3 3.33 3

NMF 2.53 3.67 2.64 3.42 3.60

SVD 2.31 3.67 3.36 3.35 2.12

UBCF 1.84 3.67 2.51 2.93 2.12

SB229 RBM 2.33 2 3.33 2 1.33

NMF 2.03 0.98 2.04 0.63 1.03

SVD 2.09 1.79 2.12 1.27 2.25

UBCF 2.77 2.19 3.14 1.42 2.43

SB304 RBM 2 3 2.67 2 3

NMF 2 3.32 2.67 3.33 3.43

SVD 2 2.36 2.67 1.61 2.57

UBCF 2 2.19 2.67 1.42 2.43

Predicted GPAs are in bold Linear Circuit Analysis (LCA) Electronic Circuit & D esign (ECD)

Islamic Studies (IS) Signals & Systems (SS) D-Lab (DL)

5.3. EVALUATION ON MODEL PERFORMANCE

There are several types of measures for evaluating the success of models. However, the eval-

uation of each model depends heavily on the domain and system’s goals. For our system, our

goal is to predict students’ GPA and make decisions if a student needs to work hard to com-

plete the course. These decisio n s work well wh en o ur predictions are accurate. To achieve it,

we have to compare t he predict ion GPA against the actual GPA for the students-courses p ai r.

Some of the most used metrics for evaluation of the models are the Root Mean Squared Error

(RMSE), Mean Squared Error (M SE) and Mean Absolute Error (MAE ). We evaluated m odel

predictions by repeated random su bsample cross-validation. We performed ten repeti tions. In

17

each run, we choose randomly 70% of st udents data into the train set and 30% of students data

into the test set. We have computed RMSE, MSE, and MAE for each model. From Figure 10

the results show that t he RBM model provides a clear improvement over the CF and MF models.

Please note we are not performing student-level cross-validation of predicted results on newly

registered students in this study but the currently enrolled stud ents.

Figure 10: Evaluation of grade prediction models

5.4. FEEDBACK MODEL

The results of feedback model th at was discussed in detail in section 4 are shown in Table 8.

Here we put on e of the students (SB185) to demonstrate the results of feedback model. We can

see that the knowledge inference results of a student in Linear Circuit Analys is are less than

2.67, so th e system gives a warning that the effort is needed in this course. These results are

helpful for an instructor to identify weak students in a course by early predicting the grades and

inferring student knowledge i n the course domain.

Table 8: Feedback Mod el Result of Student (SB185)

Course Predicted

Grade

Predicted

GPA

Course Domain Domain

Average

Knowledge

Inference

Effort

Needed

LCA B 3 Electrical Engin eering

Foundation

3.07 2. 12 YES

IS B 3 Humanities 3.19 3. 83

SS B- 3.07 Electrical Engin eering

Foundation

3.4 2.91

DL B+ 3. 33 Management Sciences 3.12 3.44

Linear Circuit Analy sis (LCA) E lectronic Circuit and Design (ECD) Islamic Studies (IS)

Signals & Systems (SS) D-Lab (DL)

18

6. INSIGHTS

In this study, we have used CF (UBCF), MF (SVD and NMF) and RBM techniques to predict

the students’ performance in the courses. CF is a popular method to predi ct the students’ per-

formance due t o its simplicity. In this technique, the students’ performance is analyzed by using

the previous data. It provides feedback to enhance the students’ learning process based on the

outcome of the analysis . However, this method h as several di sadvantages: since it depends upon

the historical data of users or items for predicting the results. It shows po o r performance when

there is too much sparsity in the data, due to which we are not able to predict the students’

performance accurately. Comparatively, in SVD techniqu e, the data mat ri x R is decomposed

into users-features space and items-features space. When SVD techniqu e is used with gradi-

ent descent algorithm to compute the b est rank-k matrix approximati on using only the known

ratings of R, the accuracy of predicting the stud ents’ performance enhances but it may contain

negative values which are hard to interpret. NMF technique enhances the m eaningful inter-

pretations of t h e possible hidden features that are obtained during matrix factorization. RBM

is an unsupervised machine learning technique th at is suitable for modeling tabular data. It

provides efﬁcient learning and inference better prediction accuracy than matrix factorization

techniques. The use of RBM in recommender systems and e-commerce have also shown good

results (Kanagal et al., 2012). From the above d iscussion, it is clear that the RBM technique out-

performs CF and MF techniques with less er chances of error. The overall result obtained in this

study also shows that RBM surpasses ot her techniques in predicting the student’s performance.

7. LIMITATIONS

We no te t hat the report ed ﬁndings of this study have been based on the dataset of th e perfor-

mance of the undergraduate students from ITU. The dataset used in the study is limited with

GPAs available for stud ents in the p articular courses. After using CF (UBCF), MF (SVD and

NMF) and RBM techniq ues on the dataset, w e can see that the RMSE for RBM technique is

lower compared to the RMSE of other techniques. RMSE can be estimated with more clear

results if more information of th e students’ GPAs is available. Student motivation during stud-

ies also plays a si gniﬁcant role in the predictio n of student success which can be considered in

future study related to the grade predictio n. Moreover, there is a need to improve the prediction

results by dealing with the cold-start problems. Also, models based on tensor factorization can

be investigated to take the temporal effect into account in the student performance prediction.

Despite thes e limitations, our research ﬁndings have important practical implications for the

universities and institutes in enhancing their students’ retention rate.

8. CONCLUSION

Early GPA p redi ct ions are a valuable source for determin ing student ’s performance in the uni-

versity. In this study, we discussed CF (UBCF), MF and RBM techniques for predicting stu-

dent’s GPA. We use RBM machine learning t echnique for predicting student’s performance in

the courses. Empirical validation on real-world dataset shows the effectiveness of the used RBM

technique. In a feedback model approach, we measure the students’ kn owledge in a particular

course domain, whi ch provides appropriate counselin g to them about different courses in a par-

ticular domain by estimating the performance of other students in that course. This feedback

19

model can be used as a component of an early warning system that will lead to students’ mot iva-

tion and provides them early warnings if they need to improve their knowledge in the courses. It

also helps the course instructor to determine weak students in th e class and to provide necessary

interventions to improve their performance. In this way rate of the students’ retention can be

increased.

REFERENCES

AUD, S., NACHAZEL, T., WILKINSON-FLICKER, S., AND DZIUBA, A. 2013. The condition of educa-

tion 2013. Government Printing Ofﬁce.

BAKER, R. S. AND YACEF, K. 2009. The state of educational data mining in 2009: A review and future

visions. JEDM-Journal of Educational Data Mining 1, 1, 3–17.

BERRY, M. W., DUMAIS, S. T., AND O’BRIEN, G. W. 1995. Using linear algebra for intelligent infor-

mation retrieval. SIAM review 37, 4, 573–595.

CALDERS, T. AND PECHENIZKIY, M. 2012. Introduction to the special section on educational data

mining. ACM SIGKDD Explorations Newsletter 13, 2, 3–6.

CALLENDER, C. AND FELDMAN, R. 2009. Part-time undergraduates in higher education: A literature

review. Prepared for H ECSU to inform Futuretrack: Part-time students. London, Birkbeck, University

of London.

DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W., LANDAUER, T. K., AND HARSHMAN, R. 1990.

Indexing by latent semantic analysis. Journal of the American society for information science 41, 6,

391.

ELBADRAWY, A. AND KARYPIS, G. 2016. Domain-aware grade prediction and top-n course recommen-

dation. Boston, MA, Sep.

ELBADRAWY, A., POLYZOU, A., REN, Z., SWEENEY, M., KARYPIS, G. , AND RANGWALA, H. 2016.

Predicting student performance using personalized analytics. Computer 49, 4, 61–69.

FRANC, V., HLAV

´

A

ˇ

C, V., AND NAVARA, M. 2005. Sequential coordinate-wise algorithm for the non-

negative least squares problem. In Computer Analysis of Images and Patterns. Springer, 407–414.

FUNK, S. 2006. Netﬂix update: Try this at home. http://sifter.org/

˜

simon/journal/20061211.html.

Online; accessed 11 Jan 2017.

GOLUB, G. H. AND VAN LOAN, C. F. 2012. Matrix computations. Vol. 3. JHU Press.

HAWKINS, W. J., HEFFERNAN, N. T., AND BAKER, R. S. 2014. L earning bayesian knowledge tracing

parameters with a knowledge heuristic and empirical probabilities. In International Conference on

Intelligent Tutoring Systems. Springer, 150–155.

HIGHER EDUCATION COMMISSION OF PAKISTAN. 2012. Cur-

riculum of electrical engineering b.sc./be/bs & m.sc./me/ms.

http://hec.gov.pk/english/services/universities/RevisedCurricula/Documents/2011-2012/ElectricalEngineering-2011-12.pdf.

Online; accessed 10 Feb 2017.

IQBAL, Z., QADIR, J., AND MIAN, A. N. 2016. Admission criteria in pakistani universities: A case

study. In 2016 International Conference on Frontiers of Information Technology (FIT). IEEE, 69–74.

IRAJI, M. S., ABOUTALEBI, M., SEYEDAGHAEE, N. R., AND TOSINIA, A. 2012. Students classiﬁ-

cation with adaptive neuro fuzzy. International Journal of Modern Education and Computer Sci-

ence 4, 7, 42.

20

KANAGAL, B., AHMED, A., PANDEY, S., JOSIFOVSKI, V., YU AN, J., AND GARCIA-PUEYO, L. 2012.

Supercharging recommender systems using taxonomies for learning user purchase behavior. Proceed-

ings of the VLDB Endowment 5, 10, 956–967.

KNOWLES, J. E. 2015. Of needles and haystacks: Building an accurate statew ide dropout early warning

system in wisconsin. JEDM-Journal of Educational Data Mining 7, 3, 18–67.

KOREN, Y. , BELL, R., VOLINSKY, C., ET AL. 2009. Matrix factorization techniques for recommender

systems. Computer 42, 8, 30–37.

MACDONALD, I. 1992. Meeting the needs of non-traditional students: Challenge or opportunity for

higher education. Scottish Journal of Adult E ducation 1, 2, 34–46.

MEIER, Y., XU, J., ATAN, O., AND VAN DER SCHAAR, M. 2016. Predicting grades. IEEE Transactions

on Signal Processing 64, 4, 959–972.

PEL

´

ANEK, R. AND JARU

ˇ

SEK, P. 2015. Student modeling based on problem solving times. International

Journal of Artiﬁcial Intelligence in Education 25, 4, 493–519.

SAARELA, M. AND K

¨

ARKK

¨

AINEN, T. 2015. A nalysing student performance using sparse data of core

bachelor courses. JEDM-Journal of Educational Data Mining 7, 1, 3–32.

SALAKHUTDINOV, R., MNIH, A., AND HINT ON, G. 2007. Restricted boltzmann machines for collab-

orative ﬁltering. In Proceedings of the 24th international conference on Machine learning. ACM,

791–798.

SARWAR, B. M., KONSTAN, J. A., BORCHERS, A., HERLOCKER, J., MILLER, B., AND RIEDL, J.

1998. Using ﬁltering agents to improve prediction quality in the grouplens research collaborative

ﬁltering system. In Proceedings of the 1998 ACM conference on Computer supported cooperative

work. ACM, 345–354.

SWEENEY, M., LESTER, J., AND RANGWALA, H. 2015. Next-term student grade prediction. In Big

Data (Big Data), 2015 IEEE International Conference on. IEEE, 970–975.

SWEENEY, M., RANGWALA, H., LESTER, J., AND JOHRI, A. 2016. Next-term student performance

prediction: A recommender systems approach. arXiv preprint arXiv:1604.01840.

THAI-NGHE, N., DRUMOND, L., HORV

´

ATH, T., KROHN-GRIMBERGHE, A., NANOPOULOS, A., AND

SCHMIDT-THIEME, L. 2011. Factorization techniques for predicting student performance. Educa-

tional Recommender Systems and Technologies: Practices and Challenges, 129–153.

THAI-NGHE, N., DRUMOND, L., HORV

´

ATH, T., NANOPOULOS, A., AND SCHMIDT-THIEME, L. 2011.

Matrix and tensor factorization for predicting student performance. In CSEDU (1). Citeseer, 69–78.

THAI-NGHE, N., DRUMOND, L., HORV

´

ATH, T., SCHMIDT-THIEME, L., ET AL. 2011. Multi-relational

factorization models for predicting student performance. In Proc. of the KDD Workshop on Knowl-

edge Discovery in Educational Data. Citeseer, 27–40.

TOSCHER, A. AND JAHRER, M. 2010. Collaborative ﬁltering applied to educational data mining. KDD

cup.

VAN DE SANDE, B. 2013. Properties of the bayesian know ledge tracing model. JEDM-Journal of Edu-

cational Data Mining 5, 2, 1–10.

WANG, Y.-H. AND LIAO, H.-C. 2011. Data mining for adaptive learning in a tesl-based e-learning

system. Expert Systems with Applications 38, 6, 6480–6485.

XU, J., MOON, K. H., AND VAN DER SCHAAR, M. 2017. A machine learning approach for tracking

and predicting student performance in degree programs. IEEE Journal of Selected Topics in Signal

Processing.

21

ZIMMERMANN, J., BRODERSEN, K. H., HEINIMANN, H. R., AND BUHMANN, J. M. 2015. A model-

based approach to predicting graduate-level performance using indicators of undergraduate-level per-

formance. JEDM -Journal of Educational Data Mining 7, 3, 151–176.

22