Psychometric properties of U.S. state standardized achievement tests and implications for kids
Degree type
Graduate group
Discipline
Education
Educational Assessment, Evaluation, and Research
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
All U.S. public school students in grades 3-8 take annual standardized achievement tests. Depending on the student’s grade level and the state in which they reside, test scores can inform highly consequential (“high stakes”) decisions about their academic lives, such as whether they are retained in grade. Students at the focal point of test-based policies typically report low total test scores and occupy the left tail of the achievement score distribution, where measurement is less precise in the Item Response Theory (IRT) statistical models used to score standardized tests. This study explores the intersection of test psychometric properties and test-based policies and research. It is guided by three objectives: (1) document the practical problems with measurement at the low tail of the achievement distribution and how states and vendors tackle these problems in standardized achievement tests; (2) provide the first comprehensive review of the psychometric properties of standardized tests from every state and the District of Columbia; and (3) describe the nature and severity of state test churn. First, I find that precision inequality in IRT measurement models is a well-documented issue, most commonly addressed through computer adaptive test designs. Other issues, including narrow domain sampling and insufficient representation of low trait levels in calibration, are not often addressed. Second, I construct and analyze a unique dataset comprised of information from 51 technical reports for the grade 3-8 tests of Reading/English Language Arts and Mathematics for school year 2018-2019. I examine descriptive patterns in tests’ properties and identify areas where comparisons are not possible because of test developers’ different reporting conventions. Third, I construct a unique dataset documenting states’ standardized tests and vendors for every year since the federal government first mandated grade 3-8 achievement testing (2002-2022). Rates of test churn are significant in some states, with states using three to four different test instruments since the passage of NCLB, on average. Study findings have practical implications for researchers, commercial test developers, and state assessment directors and policymakers.