DATA QUANTITY AND DATA CHARACTERISTICS FOR  MODELING APPROACHES IN EDUCATIONAL DATA MINING

Slater, Stefan

DATA QUANTITY AND DATA CHARACTERISTICS FOR MODELING APPROACHES IN EDUCATIONAL DATA MINING

Files

Slater_upenngdas_0175C_16920.pdf (1.46 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Education

Discipline

Education
Education

Subject

Data science
Educational data mining
Knowledge tracing
Learning analytics
Machine learning
Predictive modeling

Copyright date

2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61283

View all metadata

Author

Slater, Stefan

Abstract

The purpose of this dissertation was to determine the necessary amount of data to generate reliable, generalizable, and replicable machine learning models for educational contexts. Algorithms are ubiquitous across a range of educational settings and used to detect or predict an increasing number of student performance metrics, conceptualizations, and behaviors. But determining the correct amount of data to use for the construction and use of these algorithms often comes down to ‘rules of thumb’ rather than empirically generated benchmarks. The first study explored the amount of data necessary to generate stable predictions of student knowledge using the Bayesian Knowledge Tracing (BKT) algorithm, while the second study explored the differences in algorithm performance on predicting student stopout behavior in real data from Massive Open Online Courses (MOOCs). In both studies, subsets of data of varying sizes were taken from a larger overall dataset, and model performance on these subsets was compared to the performance of a model that used all available data. Findings from Study 1 showed that BKT is able to generate good predictions of student mastery at sample sizes as low as 25 and assessments as short as three problems, while findings from Study 2 showed that sample sizes of around 500 are suitable for more complex prediction tasks like stopout behaviors. In discussing future avenues for research, this dissertation explained how particular characteristics of a dataset, such as its homogeneity or the quality of features used for the analysis, could further influence the performance of models alongside sample size exclusively. The conclusion also examined whether the sample size requirements in this work can be applied to under-studied populations and demographics, in order to ensure that a dataset has sufficient representation for these populations when modeling and prediction tasks are undertaken.

Advisor

Baker, Ryan, S.

Date of degree

2025

Collection

Dissertations and Theses