Testing
and Assessment:
An Ecological Approach
Clifford Hill
Inaugural Lecture as
Arthur I. Gates Professor
in Language and Education
April 2, 1992
Teachers
College, Columbia University
© 1992 Clifford Hill. All rights reserved.
I
would like to welcome friends, former students, and colleagues
from various schools, universities, and other institutions to Teachers
College. I extend a special welcome to those who have come some
distance to be here. The Hausa people in West Africa have a proverbial
saying, Zumunta a k'afa ta ke which
translates literally as "Friendship is in the foot." Hausa people
use the words to remind each other that if they are to keep friendship
alive, they have to get up and go to where their friends are. I
welcome every friend's foot that has made it here today. I also
welcome those of you who have come out of a particular interest
in the subject that I will address.
Throughout
the 20th century Teachers College has played a major role in educational
testing and assessment; and the person we remember today—Arthur
Gates—was a major contributor. He was both a doer and a thinker.
Apart from the many reading tests that bear his name, he was actively
involved in research on reading. He was critical of certain parochial
tendencies in such research; as he once put it, "It is too limited
to the more obvious, the more practical problems; it does not show
sufficient activity in many promising lines now developing within
sociology, anthropology, experimental psychology, and other new
types of scientific approach" (1965:3). Somehow linguistics did
not make it onto his list, but I'd like to think he would have
welcomed the research that doctoral students and I have been doing
here at the College during the past few years.
I'd
like to open our deliberations with a little story that says a
good deal about the rather awkward relations between how we think
and act in the real world and how we perform on tests. A young
woman took her baby out to the park to get some fresh air. As she
was pushing him along in his stroller, an older man stopped to
admire him and exclaimed, "My, what a beautiful baby you have there." The
proud mother quickly replied, "Oh, this is nothing. You should
see his pictures."
As
we approach the end of the century, public debate has been spirited
around the role of assessment in achieving the educational goals
articulated in America 2000. A good deal of this
debate has centered on what we might call how-questions: Should
all students take the same tests and be held to the same standards?
If so, who should be responsible for setting the standards and
developing the tests? And what kinds of tests should they be? Are
paper-and-pencil tests appropriate? If they are, is the multiple-
choice format still serviceable, or does it need to be replaced
by a more discursive format, one that requires students to write
real words and numbers rather than merely shade in bubbles? If
we do use discursive tests, should they be supplemented by performance
tests that require students to do things: for example, carry out
a science experiment? And if we opt for performance tests, how
do we go about evaluating what students do?
Or
do we need to replace testing itself—or at least supplement it—with
methods of documenting student work over an extended period of
time? Such methods allow us to examine work of a more complex nature,
but how do we insure that an individual has not received too much
assistance? Or do we want to forsake the model of the self-sufficient
individual and encourage group work? Certainly those in the workplace
tell us that the capacity of an individual to work with others
is a quality that they highly value. Do we want to assess this
quality? If so, how do we go about doing it in a reliable way?
As we examine more extended work as well as work carried out with
others, we confront evaluation problems of increasing complexity.
This
focus on how-questions often obscures two other kinds of questions,
which we can call why-questions and what-questions. This
afternoon I would like to bring these other kinds of questions
into focus. As we address our reasons for doing assessment and
what actually goes on as students engage in it, we are led to think
in new ways about how we are going to do it. Indeed, if we don't
consider these why- and what-questions, we are likely to end up
debating how-questions in a simplistic way. Assessment is such
an inherently difficult enterprise, particularly when conducted
on a large scale, that we often settle for what is easy and convenient.
The very machinery of testing has a way of taking over and dictating
practices that we know are not all that good.
In
addressing the why- and what-questions of assessment, we can also
give meaning to the notion of ecotogy. I will be using this notion
in two ways: the first is related to why-questions and has to do
with the fit between assessment practices and fundamental goals
of education; the key question here is whether our approach to
assessment reinforces educational practices that help us to achieve
these goals. The second way of using ecotogy is related to what-questions
and has to do with the integrity of assessment practices; the key
question here is whether these practices have an appropriate relation
to real-world modes of thinking and doing.
Over
the years I have been addressing this question, with particular
attention to the assessment of literacy skills. I have used the
tools of discourse analysis to take a fairly close look at what
specific assessment tasks call for as well as how representative
students respond to them. One way of thinking about test demands
and student responses is to view each as constituting a set of
norms for interpreting text. In examining the test makers' and
test takers' interpretive norms, we are engaging in what the sociolinguist
Dell Hymes (1962) has described as "ethnography of communication." Another
way of thinking about this communication—or what is often miscommunication—is
to view the two sets of norms in relation to everyday ways of making
sense of text. In other words, how do interpretive norms used in
a testing situation relate to those used in real-world reading?
This is not an easy question to address, since the way we read
varies with what we read, and so the widely different texts that
we encounter elicit multiple ways of reading. Such multiplicity
does not proceed only from the text but from the reader as well.
Our varying ways of interpreting text are ultimately grounded in
distinctive patterns of ethnocultural language, thought, and experience.
It
is important to remind ourselves that the word ethnocultural applies
not just to the test takers but to the test makers as well; as
anthropologists have pointed out, school-based literacy—and tests
are a particularly vivid species of it—embodies an ethnocultural
view of the world. Its very quest for universality provides palpable
evidence of its particularistic origins within certain ethnic traditions
in Western Europe.
Before
I undertake discourse analysis of representative test material,
which is my way of addressing what-questions, I would first like
to discuss why-questions. Let me say, at the outset, that
the ultimate reason we engage in assessment is because it's our
nature to do it. Our everyday experience is shot through with little
testings: before we jump into the morning shower, we check to see
if the water's too hot or too cold, and we sip our coffee before
we gulp it. And so our day goes—a multitude of verifications, which
largely go unnoticed, is there to insure that our interaction with
the world will be safe and comfortable. Our social interaction,
too, is filled with similar testings; as sociolinguists have pointed
out, in many cultures people engage in elaborate greeting rituals
to get a sense of each others mood and adjust their communication
accordingly. In effect, we test that we might know and act.
As
inhabitants of a technological world, our continual dependence
on testing has become more deliberate. We depend on well-defined
procedures to insure that planes fly, bridges stand, and computers
remember, as well as stafree from viruses. Proponents of educational
testing are drawn to analogies based on medical testing. Just as
a CAT-scan can uncover what is in our bodies, so an educational
test can uncover what is in our minds. In effect, a test can have
crucial diagnostic value—it can provide practitioners the information
they need to intervene effectively with individual students.
Another
major reason for testing has to do with discriminating among students:
selecting them for specialized courses of study, placing them at
appropriate levels, or certifying that they have acquired certain
knowledge and skills. Indeed, the origins of mass testing were
for reasons of selection. As early as the 2nd century B.C., China
instituted a national system of exams for determining who would
serve in government. The use of testing for purposes of selection
is prevalent in modern educational systems. In this country students
who graduate from high school take the SAT and those who graduate
from college take the GRE or exams for professional schools; and
even at lower levels of education, testing is used for selection
and placement purposes. Later on we'll take a brief took at test
material used in New York City to screen and place kindergarten
children.
Since
this is an election year, we are often reminded of another basic
reason for educational assessment. Given the massive resources
invested in public education, political candidates like to focus
on the question of accountability. They point out how we need accountability
at every level—teachers and administrators, particular schools,
school districts, state systems of education, and, in these days
of intense international competition, even our national system
of education. Here we can observe the why of assessment
driving the how. Once the goal of external monitoring has
been established, standardized testing becomes attractive: it is
efficient (designed to fit school timetables); it is inexpensive
(at least when compared to the labor-intensive methods now being
developed); and it yields numbers—numbers which can, in principle,
be used to evaluate educational performance at different levels.
Both the individual student and the entire nation can get a score.
As
we shift to yet another major reason for educational assessment—motivating
students to strive for excellence—the case for standardized testing
becomes less attractive. Researehers such as Grant Wiggins (1989)
point out that such testing is not sufficiently aligned with curriculum
and instruction to be an effective means of fostering a high level
of achievement in the classroom. Moreover, it has forced teachers,
particularly when the stakes are high, to spend too much time preparing
students for multiple-choice tasks that focus on low-level details.
These
problems lead us directly to what-questions, so let's now turn
to some test material and take a took at how it works. The following
material is taken from the Test of' Adult Basic
Education (1976). This test, which goes by the familiar acronym
TABE, is the most widely used test in adult education in this country.
You might take a moment and do the task that follows the passage.
The
big store looked very far away. Al wondered if he would get there
in time. It was very hot and crowded on the sidewalks. He had to
squeeze between the people as he walked. At last he came to the
big glass doors. The doors swung open and he was soon inside the
cool building. He rushed down the stairs to the shoe department.
Right in front of him were the black shoes that he had come to
buy.
Task
20 (one of 10 tasks based on this passage):
Al was
in a hurry to get to the store because he was
excited
hungry
late
tired
I
would be curious to know your choices on task 20. If you happen
to be like other groups of adults, most of you chose late but
a number of you were attracted to excited. You may even
have found yourself suspended between the two choices, a not uncommon
experience on a multiple-choice test. If so, this suspension may
well reflect your capacity to see things from differing points
of view. We say that we value such flexibility in education, but
it often works against our interests when we take a multiple-choice
test that requires quick and clean decisions.
So
what's the case for the choice of late which is the response
that the test makers have designated as correct? Well, to begin
with, this is the usual response of adults to this task when it
is removed altogether from the passage. To confirm this fact, we
gave the task in isolation, with all four choices removed, to 50
students here at the College. Even though the choice late was
no longer provided, it was still selected by 41 of them. This choice
reflects the strength of an everyday schema in which we associate
people hurrying with being late; or to be more precise, with fearing
that they will be late. Whether they end up being late may well
depend on how effective their hurrying is. This kind of precision
is often blurred in multiple-choice questions, where the choices
have to approximate each other in surface form. The more precise
but cumbersome choice afraid that he would be late is
a luxury that test makers cannot afford. As we will see, they are
not so willing to tolerate what they take to be imprecision on
the part of test takers.
There
is a further reason for the choice late—and that's the presence
of the phrase in time that ends the second sentence: "Al
wondered if he would get there in time." When we asked test takers
to write down their reasons for choosing late, the majority
of them focused on this phrase. Some simply wrote in time, whereas
others aligned these words with late in some way (a few
even wrote out the equation late=in time). Other
test takers were more elaborate, specifying what Al was trying
to be in time for; these elaborations, as I will discuss shortly,
were largely built around the store ctosing. But no matter what
form their explanations took, nearly all of them were centrally
concerned with aligning the target response with a crucial detail
in the passage.
Despite
these reasons for choosing late, a number of adults, whether
experienced or inexperienced readers, are still attracted to the
choice of excited. It is, of course, not surprising news
that people end up choosing the wrong answer on a test question.
Indeed, if a particular question doesn't sufficiently discriminate
among people during its trial runs, it doesn't even get included
on the test. The need for test questions to pull their own weight
may well contribute to the rather awkward relations they often
bear to the passage; as Hill and Larsen (1983) and Fillmor and
Kay (1983) have shown, test passages are often deliberately restructured
so that they can accommodate questions that will discriminate among
test takers. Such restructuring often results in a violation of
discourse norms, which, in turn, leads to a greater divergence
of human responses.
So
why are certain adults attracted to excited? Three doctoral
students—Lauric Anderson, Yasuko Watt, Scratton Rayand—and I conducted
research with nearly 500 adult readers to find out the reasons
for their choices on task 20 (we worked with such a large number
because we included representative samples of non-native speakers,
who constitute about two-thirds of enrollment in adult literacy
programs in urban areas like New York City). We asked these adults
to do three tasks:
- to write down
their recall of the passage after an exposure of 45 seconds
- to answer
task 20 and then explain their answer
- to estimate Al's age and then explain their estimate
While
doing these tasks, the adults did not have access to the passage;
moreover, as they did each task, they did not have access to what
they had done on the preceding task.
To
give you an idea of how readers handled these tasks, I here include
the responses of an experienced reader who selected excited:
(1) recall of
the passage
The big store seemed very far away. AJ had
to walk through a crowd to get there. It was hot & sticky out. At last he
came to the big glass doors and opened them to step into the cool building.
Then he went downstairs to the shoe department asaw in front of him the black
shoes he wanted to buy.
(2) response
to talk 20 and accompanying explanation
* excited
* It seems as though he is anxious to reach the store and that getting the
shoes is his goal. Everything seems magnified like a kid sees the world when
he is excited.
(3) estimation of Al's age and accompanying explanation
* about 10
* Story details (e.g., the building is big) give the feeling of a child and
his perceptions.
Her
written recall of the passage includes certain details that I have
italicized: big store, very far away, big glass doors. These
details form what we might call a youth-gestalt. If you
look at this woman's response to the second task, you see that
she chose excited and then explained this choice by focusing
on how a magnified world suggests a child's point of view; and
if you look at her response to the third task, you see she estimated
Al to be about 10 years old and again explained her choice with
reference to the point of view. Despite her capacity for precise
recall and subtle interpretation, this woman would receive no credit
for her response to task 20. From the perspective of the machine
scoring the test, she would have marked the wrong bubble.
Our research uncovered
robust correlations between what I am calling the youth-gestalt details
and the choice of excited. Using a statistical technique known
as linear discriminatory analysis, we were able to show that it was
the presence of these details in the written recalls that best predicted
the choice of excited. Apart from these content details, the
adults' recall of unusual phrasing in the original passage strongly
predicted the choice of excited as well. In the example above,
the woman maintains the sense of excited discovery found in the passage
when she carries over the frontshifting of the phrase in front of
him:
Reader response:
...saw in front of him the black shoes he wanted to buy. Original
text:
Right in front of him were the black shoes that he
had come to buy.
By
way of contrast, neither this syntactic detail nor the content
details had any predictive value for the choice of late.
It was, of course, the recall of the phrase in time that
strongly predicted this choice.
When
we examine adult responses to the age-estimation task, we uncover
other robust correlations. To begin with, those who represented
Al as young tended to select excited on task 20. This age-estimation
task even uncovered a significant correlation between viewing Al
as an old man and selecting tired on task 20. It's as if
point of view associated with the magnified world can be embodied
in an old man as well as a young boy. Either the young or the old
can be viewed as experiencing the physical world as overwhelming.
I
would like to describe one other small experiment that is particularly
revealing. As already mentioned, those selecting late not
only focus on the presence of in time but relate this detail
to Al reaching the store before it ctosed. Hence we decided to
ask 50 students to complete the second sentence, which is, after
all, elliptical. In effect, what was Al trying to be in time for?
When we correlated their responses to this question with their
responses to task 20, we discovered two firm patterns: (1) those
who focused on the store closing tended to select late and
(2) those who focused on getting to the shoes tended to select excited.
The
second pattern reflects a more active stance toward the passage,
for readers are using later information—Al's concern with buying
a particular pair of shoes—to resolve an elliptical structure—in
time for what?—that occurs earlier. Certain of these readers even
pointed out that Al's concern with time is not related to the store
closing, since this concern is further manifested even when he
is inside the store, as indicated by the sentence, "He rushed down
the stairs to the shoe department." Clearly these readers are attentive
to local detail, but they are concerned with integrating such detail
into a larger whole. As we well know, the use of later information
to resolve earlier indeterminacies is fundamental to real-world
reading, but in the case of a reading test this more active stance
toward reading often needs to be suppressed.
In
the passage above, if readers do not have access to later text,
they are forced to resolve the elliptical second sentence by focusing
on the store closing. We confirmed this fact by conducting another
small experiment in which readers had access to only the first
two sentences:
The big
store looked very far away. Al wondered if he would get there in time.
When
we asked the question "in time for what?" nearly all referred to
the store closing; at this point in the story they have nothing
to work with but the fact that there's a big store far away.
The
confusing nature of this test material can be better understood
if we trace where it comes from. In developing a test for adults,
the test makers have adapted material from the California Achievement Test (1970),
a test designed for children in elementary school (using material
written for children in teaching adults is not uncommon). In order
to facilitate your comparing the two versions of the test material,
I have placed the children's version on the left and the later
version for adults on the right:

In
examining the two versions of the passage, we see that they differ
only minimally (the five places changed in the original are italicized
and marked by an arrow). When we turn to the task, however, we
discover a difference of major consequence: the target response
for the children's version, as indicated by the italics, is excited rather
than late. How can the task reflect a change of such magnitude
when the passage changed only minimally? To try and answer this
question, I decided to stop off in Monterey, California, and talk
to the test makers. As I have often discovered, detective work
in our great test factories does not yield all that much, so I
have been forced to construct a scenario that might explain this
shift in target response.
The
Examiner's Manual for the adult test states that the guiding principle
in adapting the material designed for children was to preserve
it as much as possible. This quest for stability is not surprising
when one learns that the psychometric norming carried out on the
original material was simply transferred to the adapted material,
a practice which, of course, violates the standards of the testing
industry. The manual goes on to say that the material was occasionally
changed so that it might "reflect adult usage, experience, and
interests" (1976:3). It is for this reason, I suppose, that we
get black shoes rather than shiny skates. We also get a change
from two characters to one, presumably because the test makers
didn't want us to imagine two grown men out shopping together for
shoes.
Once
these changes were made, the test makers faced a problem. It's
one thing to imagine two boys excited about shiny skates but quite
another to imagine a grown man excited about black shoes. Clearly
the lest makers had to do something, so they decided to go for late rather
than excited. After all, they were now working with an adult
world where people tend to run behind schedule and worry about
being late. They could thus solve their problem by inserting the
detail in time, which would lead a seasoned test
taker to select late. In effect, such a test taker would
know to draw on one of their own crucial interpretive norms: a
literal interpretation based on tocal detail takes precedence over
a more inferential interpretation based on holistic pattern. But
it is, of course, holistic inferencing that is more characteristic
of what we ordinarily do in real-world reading.
Continue to Part 2 |