

(We might also posit that, since there's only one direction in which you can err at the maxima, the readers are more accurate at the extrema, but that assumption turns out to be an even poorer fit to reality than the one we're making here.)

For the middle scores, there's an equal chance that the error can occur in either direction. For the minimum and maximum scores, the direction of the error will always be away from the extrema. In the remaining 1- p% of cases, we'll assume they're off by one point. (You can think of this underlying ability as continuous if you want, but here we're assuming that ability can be placed into discrete categories that have sharp boundaries.) Suppose further that all of our essay graders are of equal ability and can score the essay correctly p percent of the time. In other words, each essay has a real level of achievement that, if we had perfect knowledge and ability, we could determine and correctly categorize. Suppose that there is "true" score (on the 1-4 scale) distribution. Since the 0 essays don't get that score from a single cause, and some of the reasons for getting a 0 are unrelated to the construct we're trying to measure, I omit them from my analysis. About 2.4% of the essays in the Maine data set were scored as a 0. You can also get a 0 if the essay is off-topic or otherwise unscorable (illegible handwriting, written in another language, etc.). If the scores are more than a point apart, the essay is regraded by a third reader and that third reader's score is doubled to give the final score. If the two scores are no more than one point apart, they are added to form a 2-8 score. You can think of each of these levels as reflecting a particular level of performance: For an explanation of the rubric, see here. In other words, we assume that they do sometimes misclassify essays, but when they do so the direction of their error is random (if they can err in either direction).Įvery essay is scored by two raters, each of whom evaluates three different dimensions of the essay. In this scenario, we're going to assume that the graders make occasional mistakes but that they are unbiased. To see why I think there are too few top scores, consider an idealized situation involving a single subscore and see what distribution we would expect. It's not adequate for a complete validity study, but there's enough here to do some exploratory analysis. The scores on this essay are also noticeably lower than the national mean, but the report does include nearly 13,000 graded essays and provides contingency tables breaking down how the different subscores relate to each other and how the two raters assigned. This data has obvious limitations, as it is only from one administration in one state, and only includes juniors. My data for this analysis mostly comes from Maine's statistical report for its April 2017 school-day administration. Thus although the essay score scale creates the illusion of providing reasonably fine distinctions among ability, at best it can distinguish adequate from inadequate essays. There are far fewer top scores than we would expect under reasonable assumptions about the distribution of student ability, and graders appear to disagree significantly when evaluating higher-quality work. In this post, I'm going to argue that the SAT essay score appears similarly unhelpful for distinguishing among high-ability students. In my last post, I argued that the SAT essay might be inappropriately difficult for its intended purposes and that floor effects make it essentially worthless for distinguishing among lower-ability students.
