Scoring Essays Automatically Using Surface Feature


Scoring Essays Automatically Using Surface Features
Randy M. Kaplan, Susanne Wolff, Jill C. Burstein,
Chi Lu, Don Rock, and Bruce Kaplan
GRE Board Report No. 94-21P
August 1998
This report presents the findings of a
research project funded by and carried
out under the auspices of the Graduate
Record Examinations Board.
Educational Testing Service, Princeton, NJ 0854 1
Researchersa re encouragedto expressf reely their professional
judgment. Therefore, points of view or opinions stated in Graduate
RecordE xaminationsB oard Reportsd o not necessarilyr epresenot fficial
GraduateR ecordE xaminationsB oardp ositiono r policy.
The GraduateR ecordE xaminationsB oard and EducationaTl estingS ervicea re
dedicated to the principle of equal opportunity, and their programs,
services, and employment policies are guided by that principle.
and GRE are registeredt rademarkso f EducationalT estingS ervice.
Copyright @3 1998 by Educational Testing Service. All rights reserved.
Surface features of essays include nonlinguistic characteristics- such as total number of
words per essay (essay length), sentence length, and word length - as well as linguistic
characteristics - such as the total number of grammatical errors, the types of grammatical
errors, or the kinds of grammatical constructions (e.g., passive or active) that appear in an
essay. This study examines the feasibility of using linguistic and nonlinguistic surface
characteristicsi n an automatic procedure for predicting essays cores.I t builds on earlier
work by R.M. Kaplan, Burstein, Lu, B. Kaplan, Wolff (1995) Page (1966) and Page
and Petersen (1995).
As testing programs move away from multiple-choice items, the incorporation of an
increasing number of constructed-responseit ems in tests becomes inevitable. Essaysi tems,
a constructed-responseit em type, are becoming more and more common in large-scale
testing programs. Scoring essay items is time-consuming. Because it requires human
judges and, furthermore, multiple judges, it is quite costly. With cost such an important
consideration for test design, the role of computers in the scoring process and its potential
for cost reduction merits close attention.
Page (1966) claims that effective and accurate essay scoring can be done by computer. His
method, analyses, and results are described in a series of papers referenced in Page and
Petersen (1995). Page recommends an indirect, statistically based approach that relies on
nonlinguistic surface characteristicso f an essay. Surface features of an essayi tem include
the total number of sentences, the number of words per essay, and the average length of
words. A more linguistically-baseda pproach might attempt to score an essayb y producing
a structured representationo f syntactica nd semanticc haracteristicsa nd by analyzingt his
representation. By using surface features, Page circumvents the need to understand and
represent the meaning (content) of an essay. His methodology nevertheless achieves
scoring results that correlate .7 or better across multiple judges.
Page makes only passing reference to the surface criteria, orpruxes, employed by his
system, and presents us essentially with a black box. Here we explore an approach similar
to the one taken by Page, concentratingo n surface characteristicso f essayi tems and
evaluating their use asproxes in our automated approach to large-scale scoring tasks. We
scored the same 1,3 14 PRAXIS essays scored by the PEG system in the Page and
Petersen (1995) study.’ Our criteria, unlike Page’s, can be evaluated as to whether or not
they satisfy the general requirements of scoring systems employed by ETS.
Criteria for automatic scoring
An automated scoring system developed and used by ETS must satisfy several
requirements to ensure that the risks of employing it do not outweigh its benefits.
Criterion 1. Defensibility. Any scoring procedure must be educationally defensible. The
score produced by a scoring procedure must be traceable to its source, namely, the
rationale used to produce the score. The mechanism that created the score must be
defensible by a capability to rationally explain how the score was determined.
Criterion 2. Accuracy. Any scoring procedure (manual or automated) must maintain a
specified level of accuracy. The accuracy of a scoring procedure is typically determined by
comparing scores produced by the automated procedure to the scores produced by
’ The rubric usedt o scoret he essaysh olistically can be found in the Appendix.
human raters. The more accurate the scoring procedure, the more highly correlated the
two sets of scores.’
Criterion 3. Coachability. The scoring procedure should not be coachable. Suppose that
a particular scoring procedure for essays bases its score on the number of words in the
essay. It does not matter what is written or how it is written - all that matters is that the
essay contains a specific number of words. A student can be easily coached with this
information and write an essay consisting of nonsense words and nonsense sentences.
Such an essay would receive the highest score possible. Although this example is extreme,
it exemplifies an non-acceptable situation. A scoring procedure should not be transparent
or otherwise simple so that it can be discovered and coached to the test-taking
Criterion 4. Cost. The scoring procedure must be cost-effective. The purpose of an
automatic procedure for scoring essays is to reduce the cost of scoring and to improve
consistency. Therefore, the cost to score should not be more than an apriori determined
acceptable cost. A second important cost is the one incurred for setting up the scoring
procedure. Setup might include operations like creating a computer-based rubric for the
computer-based scoring process. Setup costs may significantly increase expenses for an
automated scoring process and should be taken into consideration when evaluating any
new scoring procedure.
We have dwelt on these criteria for an acceptable scoring procedure at some length
because a procedure that seems acceptable on the surface, on closer analysis may not
meet one or more of these criteria and must be rejected.
Scoring with surface features
Surface features of natural language are those characteristicst hat can be observed
directly; that is, no inference is necessary for their identification. An example of a surface
feature present in any written or spoken utterance is the number of words in a sentence.
There are computer programs designed specifically to identify and extract surface
characteristicso f texts -- grammar-checkingp rograms.
A grammar-checking program automatically checks sentence construction, analyzes one
or several consecutive sentences, and determines whether any grammar rules of the
language have been violated. The idea is certainly worthwhile, but in practice the task is
very difficult. Many grammar-checking programs can correctly account for only 35% to
40% of the errors in any written passage. One program we have evaluated accounts for up
to 60% of the errors. Though grammar checkers do not have a high rate of error-detecting
accuracy, they have proven useful in assisting writers to identify potential problems.
A recent study by Kaplan et al. (1995) developed a decision model for scoring essays
written by nonnative English speakers. It evaluated the effectiveness of a variety of
grammar-checking programs for predicting essay scores and selected the best performing
2 We assumteh at the grade assignedb y a humang rader is always acceptablea nd therefore
consideredc orrect in the contexto f comparisono f humana nd machines coring.
grammar checkers to construct the decision model. A total of 936 essays were analyzed.
The essays were first scored by human experts using a set of criteria that corresponded
closely to criteria used in grammar-checking programs. Two hundred forty-two
characteristicsa crossf our grammar checkersw ere classifiedi nto categorieso f balance,
cohesion, concision, discourse, elegance, emphasis, grammar, logic, precision,
punctuation, relation, surface, transition, unity, and usage. Each essay was then
automatically analyzed by calculating the number of errors in each of the categories. A
logistic regression model was built based on these error category aggregates and then used
to predict scores for the 936 essays. There was a 30% agreement between computed and
the manual (human) scores.
One of the conclusions reached in this study was that the aggregation of raw errors into
categories may have been detrimental to the score prediction process. A linguistic analysis
confirmed that the raw error data may be better suited than the aggregate data for
predicting scores.
A Characteristics-basedS coring;P rocedure
We decided to analyze essays using an an approach similar to that of Page and Petersen
(1995) in conjunction with that developed in Kaplan et al. (1995) The results of the
Kaplan study were to guide the scoring procedure process. Specifically, the Kaplan study
had determined that of the four grammar-checking programs, one was most accurate in
predicting essay scores. This grammar checker, RightWriter, was chosen to analyze the
essays in the current study. As discussed above, the raw error score data rather than the
error score aggregates were taken to construct the predictive regression model. The
analysis procedure and results of the analysis are described in the two following sections.
As indicated, the following analysis builds on earlier work (Kaplan et al., 1995, Page and
Petersen, 1995) dealing with the feasibility of computerized scoring of essays. We
propose several scoring models that employ surface characteristicsa, mong them the best
of the computerized procedures using a grammar checker (RightWriter) from our earlier
analysis. Page and Petersen state that the fourth root of the length of an essay was a
relatively accurate predictor of essay scores for the essays. Three of our models therefore
include the fourth root of essay length as a scoring variable. The criterion for selecting the
best of the five models is how well -it predicts the average of the human ratings of the
essays (a) in the validation sample of 1,014 student essays and (b) in a cross-validation
sample of an additional 300 student essays. These are the two sample sets of the Page and
Petersen study. The variations of model were fit on the validation sample and then
replicated on the cross-validation sample. This validity model assumes that the average
over six raters is a reasonably accurate criterion and has considerable face validity.
The validation sample consisted of 1,014 essays drawn from the computer-based writing
assessmenpt art of the Praxis Series. This assessmenmt easuresw riting skills of beginning
teachers through essays on general topics. The 1,014 human-scored essays had been
analyzed previously by the Page procedure but not by the RightWriter model. Table 1
presents the multiple correlation squared on the validation sample for five alternative
prediction models:
(1) The RightWriter model (Ml), which includes 18 predictors
(derived from the RightWriter grammar-checking program)
(2) a model that just includes the linear count of words (M.2)
(3) a model that just includes the fourth root of the length of the
essay measured by number of words (which seems to be a
component of the PEG model) (M3)
(4) a modification of the Page model that includes both a linear and
fourth root of the length of the essay (M4)
(5) the combined model, i.e., using both the RightWriter predictors
and the linear and fourth root of the length of an essay (M5)
RightWriter 1 Ml
# of words I M2
# of words
fourth root
# words
linear count and
fourth root
RightWriter + #
words linear
count and fourth
rediction Models on the Validation SamDle 1
#of predictors
R’ F Prob
.50 502.04 .ooo
.61 77.18 .ooo
Table 1 shows that essay length has a relatively large impact on the prediction of the
average of the human ratings. Both the simple count of words per essay and the fourth
root of the number of words do significantly better than all 18 predictors from
RightWriter in the validation sample. The fact that the fourth root does better than the
simple count (.49 vs .46) suggests that the fknctional relationship between the number of
words in the essay and the average essay rating is not strictly linear.
Figure 1 is a plot of the average score by average essay length for 11 groups of essays
arranged by increasing score intervals along the word-length scale. Essays are assigned a
score on a scale ranging from 0 to 6 with 0 being the poorest score assigned and 6
representing an excellent score. The relationship between essay length and average essay
grades in Figure 1 shows a rather rapid acceleration for lOO- to 400-word essays; rate of
increase slows as the essays become longer.
Average Essay Length versus Average Essay Score
Average Essay Length
(number of essays)
Figure 1 Average Essay Length (in words) vs. Average Essay Score
The fact that there is a significant increase in R2 when information from both the essay
length and the RightWriter predictors are included in the equation suggests that although
there is considerable overlap the two different kinds of measures, grammar checkers and
measures of essay length are also measuring somewhat different abilities. The total
predictable variance on the validation sample , that is, the R2 with both the RightWriter
predictors and the essay-length predictors in the “full” model (M5), can be partitioned into
variance uniquely due to (1) grammar skills, (2) essay length, and (3) shared variance:
= Variance Unique + Variance Unique + Shared Variance
to Essay Length to Grammar
= .21 + .ll + .29
= 35% + 18% + 47%
Almost 35 % of the total predictable variance is unique to the measures of essay length,
and about half of that is unique to grammar measures. Clearly essay length is measuring a
writing skill other than the knowledge and use of grammar assessedb y the RightWriter
computer program.
The above results, however, need to be replicated on the cross-validation sample to ensure
that they do indeed generalize to an independent sample. This is particularly critical when
there are many predictors, as in the RightWriter model, and the potential for overfitting.
Similarly, nonlinear finctional relationships as described by essay length may also benefit
unduly from over-fitting.
Table 2 presents the cross-validated multiple correlations squared for the five models that
were fitted in the validation sample.
All of the models seem to have cross-validated. In fact, there were some gains in
predictive accuracy for the essay length prediction systems. An approximate partition of
variance carried out on the cross-validated results yields the following partition of
Pred. Variance = Variance unique + Variance unique + Shared Variance
to essay length to grammar
.65 = .25 + .05 + .35
100% = 38% + 8% + 54%
Clearly, essay length continues to measure something appearing to be a proxy for
component(s) of writing skills not directly measured by the grammar checkers.
Figures 2 through 6 present additional results on the relative accuracy of the predictions of
each of the models in the cross-validated sample.
Exact +/- 3
Exact +/- 1
Exact +/- 2
Figure 2 Scoring results for Ml
Figure 3 Scoring results for M2
Figure 4. Scoring results for M3
Figure 5. Scoring results for M4
Figure 6. Scoring results for MS
The pie charts show the number of scores that exactly match a human score as well as the
number and extent of over- and under-predictionF. or example, Figure 2 shows that the
RightWriter model (Ml) gave exact predictions for 166 out of 300 essays in the crossvalidation
sample. Similarly, 126 essaysw ere either over- or under-predictedb y one scale
point on the six-point rating scale. Seven essays were off by two points on the six-point
rating scale. When the RightWriter model (Ml in Figure 2) is compared with the simple
fourth root essay-length model (M3 in Figure 4), we see that the essay-length model
shows a significant improvement over the grammar model in the number of exact “hits”
and shows fewer of the more serious errors in prediction. These results are, of course,
consistent with the multiple correlation results. It would seem that essay length measures
some aspect of writing outside of simple grammar competence. The increased acceleration
in the functional relationship between the number of words and the ratings in the
neighborhood of 100 to 450 words suggests that essay length may be aprux for topic
knowledge, which in turn reflects a coherent writing presentation. Some knowledge of the
topic content would seem to be a necessary condition for a demonstration of writing skills
in this type of essay. In addition to showing content knowledge, it may take 300 to 400
words to present evidence of writing ability. Word counts beyond the 450 or 500 level
become superfluous for demonstrating writing ability. Reading exercises show similar
results where demonstrations of adult literacy are partly a function of how much prior
familiarity the reader has with the content presented.
Note (see Appendix) that essay length is an implicit criterion in the rubric for human
scoring of Praxis essays: developing ideas with examples and supporting statements
goes hand in hand with essay length. This might explain in part the strong influence of
essay length as a predictor of holistic scores.
The generalizability of measures of essay length to more scientific topics might be of
interest. In science essays, the functional relationship might even be more nonlinear in the
senseth at the point at whicha dditionalw ordsb ecomes uperfluoums ay occure arlier.T his
couldb e readilyh andledb y measuringe ssayle ngth,p ossiblyt akingt he fifth or sixthr oot
rather than the fourth root.
Our goal was to explore the possibility of predicting essay scores automatically. We
constructed various scoring models, including two based on data produced by an off-theshelf
grammar checking program. Four of our models (M2-M5) predicted exact scores
60% or more of the time, and within +/-1 of the score over 90% of the time, within the
cross-validation sample.
These results seem to indicate an extremely viable scoring procedure. However, the results
should be examined with some caution and with regard to how well the scoring procedure
meets the four criteria for scoring procedures of Section 2. Accuracy is not at issue here:
we can achieve, in general, approximately 90% scoring accuracy. Another question
entirely is whether the procedure is defensible. In a commercial or public testing situation,
the question will eventually arise as to how we arrived at a certain score for a particular
item In the case of our models, we might be hard put to provide a satisfying answer if the
largest contributing factor to a score were the number of words in the essay. Clearly, this
might not be acceptable for a short, but to-the-point essay. Here, the scoring procedure
would have to be augmented with some type of content analysis procedure. And this
would put us back to we where we started in the search for a scoring procedure which
does automated content analysis.
Similarly, if our model were chiefly based on essay length, coachability would be another
problem. A coaching organization could readily prepare students to write essays tailored
to our scoring procedure. In its simplest form, this could be the instruction “to write the
longest essay more or less about the subject specified by the item.” The essay would not
necessarily have to make sense. Even a content analyzer might not be able to resolve the
problem of scoring the essay in a defensible way. Hopefully, the scoring procedure would
select such an essay as a candidate for a human grader.
An automated scoring procedure relying on surface characteristicsm eets the “cost”
criterion. The preparation cost is minimal (no rubric needs to be prepared), and the scoring
procedure itself is fairly rapid.
An automated scoring procedure based solely on surface features of a writing passage,
although cost-effective and, in most cases accurate, carries with it some notable problems.
First, such a procedure is not defensible unless we know precisely how the predictor(s)
account for the various aspects of writing skills as analyzed by human graders. The
apparent coachability of a procedure based solely on these features would not seem to
make it acceptable in a high-stakes testing program. On the other hand, it might be
acceptable in conjunction with other more complex manual or automated procedures. It
could be deployed to screen papers for subsequent scoring, to add information to a
manuals coringp rocesso, r to augmenta morec omplexc ontent-baseadn alysis.
Scoring systems based on surfacec haracteristicsc ould be used in conjunction with human
scoring of essays as a “second rater“. With suffkient diagnostic information in addition to
the score computed by such a procedure, such systems may prove useful tools for scoring
essays, provided the results are-always interpreted by a human grader as well.
Kaplan, R. M., Burstein J., Lu, C., Rock, D., Kaplan, B., Wolff, S. (1995).
Evaluating a prototype essay s coringp rocedureu singo ff the shelfs oftware
PrincetonN, J: (EducationaTl estingS ervice ResearchR eportR R-95-21 ).
Page,E . B. & PetersenN, . (1995, March). The computerm ovesi nto essay
grading: updating the ancient test. Phi Delta Kappan, 561-65
Page,E . B (1966, January)T. he imminenceo f gradinge ssaysb y computer.
Phi Delta Kappan, 23 8-43.
Readers will assign scores based on the following scoring guide. The essays must respond to the
assigned task, although parts of the assignment may be treated by implication.
6 A 6 essay demonstrates a high degree of competence in response to the assignment but
may have a few minor errors.
An essay in this category
- is well organized and coherently developed
- clearly explains or illustrates key ideas
- demonstrates syntactic variety
- clearly displays facility in the use of language
- is generally free from errors in mechanics, usage, and sentence structure
5 A 6 essay demonstrates clear competence in response to the assignment but may have
minor errors.
An essay in this category
- is generally well organized and coherently developed
- explains or illustrates key ideas
- demonstrates some syntactic variety
- displays facility in the use of language
- is generally free from errors in mechanics, usage, and sentence structure
4 A 4 essay demonstrates competence in response to the assignment.
An essay in this category
- is adequately organized and developed
- explains or illustrates some of the key ideas
- demonstrates adequate facility with language
- may display some errors in mechanics, usage, or sentence structure, but not a
consistent pattern of such errors
3 A 3 essay demonstrates some degree of competence in response to the assignment but is
noticeably flawed.
An essay in this category reveals one or more of the following weaknesses:
- inadequate organization or development
- inadequate explanation or illustration of key ideas
- a pattern or accumulation of errors in mechanics, usage, or sentence structure
- limited or inappropriate word choice
2 A 2 essay demonstrates only limited competence and is seriously flawed.
An essay in this category reveals one or more of the following weaknesses:
- weak organization or very little development
- little or no relevant detail
- serious errors in mechanics, usage, sentence structure, or word choice
1 A 1 essay demonstrates fundamental deficiencies in writing skills.
An essay in this category contains serious and persistent writing errors or is incoherent
or is undeveloped.
Copyright Q 1984, 1987, 1994 by Educational Testing Service. All rights reserved. Unauthorized reproduction is prohibited.
you can either compress it or separate it as two or three files. Give it a try please.
回复: Scoring Essays Automatically Using Surface Feature

The tread above scoring essays automatically using surface feature is very helpful for me.