DATA QUANTITY AND DATA CHARACTERISTICS FOR MODELING APPROACHES IN EDUCATIONAL DATA MINING
Degree type
Graduate group
Discipline
Education
Subject
Educational data mining
Knowledge tracing
Learning analytics
Machine learning
Predictive modeling
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The purpose of this dissertation was to determine the necessary amount of data to generate reliable, generalizable, and replicable machine learning models for educational contexts. Algorithms are ubiquitous across a range of educational settings and used to detect or predict an increasing number of student performance metrics, conceptualizations, and behaviors. But determining the correct amount of data to use for the construction and use of these algorithms often comes down to ‘rules of thumb’ rather than empirically generated benchmarks. The first study explored the amount of data necessary to generate stable predictions of student knowledge using the Bayesian Knowledge Tracing (BKT) algorithm, while the second study explored the differences in algorithm performance on predicting student stopout behavior in real data from Massive Open Online Courses (MOOCs). In both studies, subsets of data of varying sizes were taken from a larger overall dataset, and model performance on these subsets was compared to the performance of a model that used all available data. Findings from Study 1 showed that BKT is able to generate good predictions of student mastery at sample sizes as low as 25 and assessments as short as three problems, while findings from Study 2 showed that sample sizes of around 500 are suitable for more complex prediction tasks like stopout behaviors. In discussing future avenues for research, this dissertation explained how particular characteristics of a dataset, such as its homogeneity or the quality of features used for the analysis, could further influence the performance of models alongside sample size exclusively. The conclusion also examined whether the sample size requirements in this work can be applied to under-studied populations and demographics, in order to ensure that a dataset has sufficient representation for these populations when modeling and prediction tasks are undertaken.