Testing and Assessment: An Ecological Approach

Clifford Hill

Inaugural Lecture as Arthur I. Gates Professor
in Language and Education
April 2, 1992

Teachers College, Columbia University
© 1992 Clifford Hill. All rights reserved.

I would like to welcome friends, former students, and colleagues from various schools, universities, and other institutions to Teachers College. I extend a special welcome to those who have come some distance to be here. The Hausa people in West Africa have a proverbial saying, Zumunta a k’afa ta ke which translates literally as “Friendship is in the foot.” Hausa people use the words to remind each other that if they are to keep friendship alive, they have to get up and go to where their friends are. I welcome every friend’s foot that has made it here today. I also welcome those of you who have come out of a particular interest in the subject that I will address.

Throughout the 20th century Teachers College has played a major role in educational testing and assessment; and the person we remember today—Arthur Gates—was a major contributor. He was both a doer and a thinker. Apart from the many reading tests that bear his name, he was actively involved in research on reading. He was critical of certain parochial tendencies in such research; as he once put it, “It is too limited to the more obvious, the more practical problems; it does not show sufficient activity in many promising lines now developing within sociology, anthropology, experimental psychology, and other new types of scientific approach” (1965:3). Somehow linguistics did not make it onto his list, but I’d like to think he would have welcomed the research that doctoral students and I have been doing here at the College during the past few years.

I’d like to open our deliberations with a little story that says a good deal about the rather awkward relations between how we think and act in the real world and how we perform on tests. A young woman took her baby out to the park to get some fresh air. As she was pushing him along in his stroller, an older man stopped to admire him and exclaimed, “My, what a beautiful baby you have there.” The proud mother quickly replied, “Oh, this is nothing. You should see his pictures.”

As we approach the end of the century, public debate has been spirited around the role of assessment in achieving the educational goals articulated in America 2000. A good deal of this debate has centered on what we might call how-questions: Should all students take the same tests and be held to the same standards? If so, who should be responsible for setting the standards and developing the tests? And what kinds of tests should they be? Are paper-and-pencil tests appropriate? If they are, is the multiple- choice format still serviceable, or does it need to be replaced by a more discursive format, one that requires students to write real words and numbers rather than merely shade in bubbles? If we do use discursive tests, should they be supplemented by performance tests that require students to do things: for example, carry out a science experiment? And if we opt for performance tests, how do we go about evaluating what students do?

Or do we need to replace testing itself—or at least supplement it—with methods of documenting student work over an extended period of time? Such methods allow us to examine work of a more complex nature, but how do we insure that an individual has not received too much assistance? Or do we want to forsake the model of the self-sufficient individual and encourage group work? Certainly those in the workplace tell us that the capacity of an individual to work with others is a quality that they highly value. Do we want to assess this quality? If so, how do we go about doing it in a reliable way? As we examine more extended work as well as work carried out with others, we confront evaluation problems of increasing complexity.

This focus on how-questions often obscures two other kinds of questions, which we can call why-questions and what-questions. This afternoon I would like to bring these other kinds of questions into focus. As we address our reasons for doing assessment and what actually goes on as students engage in it, we are led to think in new ways about how we are going to do it. Indeed, if we don’t consider these why- and what-questions, we are likely to end up debating how-questions in a simplistic way. Assessment is such an inherently difficult enterprise, particularly when conducted on a large scale, that we often settle for what is easy and convenient. The very machinery of testing has a way of taking over and dictating practices that we know are not all that good.

In addressing the why- and what-questions of assessment, we can also give meaning to the notion of ecotogy. I will be using this notion in two ways: the first is related to why-questions and has to do with the fit between assessment practices and fundamental goals of education; the key question here is whether our approach to assessment reinforces educational practices that help us to achieve these goals. The second way of using ecotogy is related to what-questions and has to do with the integrity of assessment practices; the key question here is whether these practices have an appropriate relation to real-world modes of thinking and doing.

Over the years I have been addressing this question, with particular attention to the assessment of literacy skills. I have used the tools of discourse analysis to take a fairly close look at what specific assessment tasks call for as well as how representative students respond to them. One way of thinking about test demands and student responses is to view each as constituting a set of norms for interpreting text. In examining the test makers’ and test takers’ interpretive norms, we are engaging in what the sociolinguist Dell Hymes (1962) has described as “ethnography of communication.” Another way of thinking about this communication—or what is often miscommunication—is to view the two sets of norms in relation to everyday ways of making sense of text. In other words, how do interpretive norms used in a testing situation relate to those used in real-world reading? This is not an easy question to address, since the way we read varies with what we read, and so the widely different texts that we encounter elicit multiple ways of reading. Such multiplicity does not proceed only from the text but from the reader as well. Our varying ways of interpreting text are ultimately grounded in distinctive patterns of ethnocultural language, thought, and experience.

It is important to remind ourselves that the word ethnocultural applies not just to the test takers but to the test makers as well; as anthropologists have pointed out, school-based literacy—and tests are a particularly vivid species of it—embodies an ethnocultural view of the world. Its very quest for universality provides palpable evidence of its particularistic origins within certain ethnic traditions in Western Europe.

Before I undertake discourse analysis of representative test material, which is my way of addressing what-questions, I would first like to discuss why-questions. Let me say, at the outset, that the ultimate reason we engage in assessment is because it’s our nature to do it. Our everyday experience is shot through with little testings: before we jump into the morning shower, we check to see if the water’s too hot or too cold, and we sip our coffee before we gulp it. And so our day goes—a multitude of verifications, which largely go unnoticed, is there to insure that our interaction with the world will be safe and comfortable. Our social interaction, too, is filled with similar testings; as sociolinguists have pointed out, in many cultures people engage in elaborate greeting rituals to get a sense of each others mood and adjust their communication accordingly. In effect, we test that we might know and act.

As inhabitants of a technological world, our continual dependence on testing has become more deliberate. We depend on well-defined procedures to insure that planes fly, bridges stand, and computers remember, as well as stafree from viruses. Proponents of educational testing are drawn to analogies based on medical testing. Just as a CAT-scan can uncover what is in our bodies, so an educational test can uncover what is in our minds. In effect, a test can have crucial diagnostic value—it can provide practitioners the information they need to intervene effectively with individual students.

Another major reason for testing has to do with discriminating among students: selecting them for specialized courses of study, placing them at appropriate levels, or certifying that they have acquired certain knowledge and skills. Indeed, the origins of mass testing were for reasons of selection. As early as the 2nd century B.C., China instituted a national system of exams for determining who would serve in government. The use of testing for purposes of selection is prevalent in modern educational systems. In this country students who graduate from high school take the SAT and those who graduate from college take the GRE or exams for professional schools; and even at lower levels of education, testing is used for selection and placement purposes. Later on we’ll take a brief took at test material used in New York City to screen and place kindergarten children.

Since this is an election year, we are often reminded of another basic reason for educational assessment. Given the massive resources invested in public education, political candidates like to focus on the question of accountability. They point out how we need accountability at every level—teachers and administrators, particular schools, school districts, state systems of education, and, in these days of intense international competition, even our national system of education. Here we can observe the why of assessment driving the how. Once the goal of external monitoring has been established, standardized testing becomes attractive: it is efficient (designed to fit school timetables); it is inexpensive (at least when compared to the labor-intensive methods now being developed); and it yields numbers—numbers which can, in principle, be used to evaluate educational performance at different levels. Both the individual student and the entire nation can get a score.

As we shift to yet another major reason for educational assessment—motivating students to strive for excellence—the case for standardized testing becomes less attractive. Researehers such as Grant Wiggins (1989) point out that such testing is not sufficiently aligned with curriculum and instruction to be an effective means of fostering a high level of achievement in the classroom. Moreover, it has forced teachers, particularly when the stakes are high, to spend too much time preparing students for multiple-choice tasks that focus on low-level details.

These problems lead us directly to what-questions, so let’s now turn to some test material and take a took at how it works. The following material is taken from the Test of’ Adult Basic Education (1976). This test, which goes by the familiar acronym TABE, is the most widely used test in adult education in this country. You might take a moment and do the task that follows the passage.

The big store looked very far away. Al wondered if he would get there in time. It was very hot and crowded on the sidewalks. He had to squeeze between the people as he walked. At last he came to the big glass doors. The doors swung open and he was soon inside the cool building. He rushed down the stairs to the shoe department. Right in front of him were the black shoes that he had come to buy.

Task 20 (one of 10 tasks based on this passage):

Al was in a hurry to get to the store because he was
excited
hungry
late
tired

I would be curious to know your choices on task 20. If you happen to be like other groups of adults, most of you chose late but a number of you were attracted to excited. You may even have found yourself suspended between the two choices, a not uncommon experience on a multiple-choice test. If so, this suspension may well reflect your capacity to see things from differing points of view. We say that we value such flexibility in education, but it often works against our interests when we take a multiple-choice test that requires quick and clean decisions.

So what’s the case for the choice of late which is the response that the test makers have designated as correct? Well, to begin with, this is the usual response of adults to this task when it is removed altogether from the passage. To confirm this fact, we gave the task in isolation, with all four choices removed, to 50 students here at the College. Even though the choice late was no longer provided, it was still selected by 41 of them. This choice reflects the strength of an everyday schema in which we associate people hurrying with being late; or to be more precise, with fearing that they will be late. Whether they end up being late may well depend on how effective their hurrying is. This kind of precision is often blurred in multiple-choice questions, where the choices have to approximate each other in surface form. The more precise but cumbersome choice afraid that he would be late is a luxury that test makers cannot afford. As we will see, they are not so willing to tolerate what they take to be imprecision on the part of test takers.

There is a further reason for the choice late—and that’s the presence of the phrase in time that ends the second sentence: “Al wondered if he would get there in time.” When we asked test takers to write down their reasons for choosing late, the majority of them focused on this phrase. Some simply wrote in time, whereas others aligned these words with late in some way (a few even wrote out the equation late=in time). Other test takers were more elaborate, specifying what Al was trying to be in time for; these elaborations, as I will discuss shortly, were largely built around the store ctosing. But no matter what form their explanations took, nearly all of them were centrally concerned with aligning the target response with a crucial detail in the passage.

Despite these reasons for choosing late, a number of adults, whether experienced or inexperienced readers, are still attracted to the choice of excited. It is, of course, not surprising news that people end up choosing the wrong answer on a test question. Indeed, if a particular question doesn’t sufficiently discriminate among people during its trial runs, it doesn’t even get included on the test. The need for test questions to pull their own weight may well contribute to the rather awkward relations they often bear to the passage; as Hill and Larsen (1983) and Fillmor and Kay (1983) have shown, test passages are often deliberately restructured so that they can accommodate questions that will discriminate among test takers. Such restructuring often results in a violation of discourse norms, which, in turn, leads to a greater divergence of human responses.

So why are certain adults attracted to excited? Three doctoral students—Lauric Anderson, Yasuko Watt, Scratton Rayand—and I conducted research with nearly 500 adult readers to find out the reasons for their choices on task 20 (we worked with such a large number because we included representative samples of non-native speakers, who constitute about two-thirds of enrollment in adult literacy programs in urban areas like New York City). We asked these adults to do three tasks:

to write down their recall of the passage after an exposure of 45 seconds
to answer task 20 and then explain their answer
to estimate Al’s age and then explain their estimate

While doing these tasks, the adults did not have access to the passage; moreover, as they did each task, they did not have access to what they had done on the preceding task.

To give you an idea of how readers handled these tasks, I here include the responses of an experienced reader who selected excited:

(1) recall of the passage
The big store seemed very far away. AJ had to walk through a crowd to get there. It was hot & sticky out. At last he came to the big glass doors and opened them to step into the cool building. Then he went downstairs to the shoe department asaw in front of him the black shoes he wanted to buy.

(2) response to talk 20 and accompanying explanation
* excited
* It seems as though he is anxious to reach the store and that getting the shoes is his goal. Everything seems magnified like a kid sees the world when he is excited.

(3) estimation of Al’s age and accompanying explanation
* about 10
* Story details (e.g., the building is big) give the feeling of a child and his perceptions.

Her written recall of the passage includes certain details that I have italicized: big store, very far away, big glass doors. These details form what we might call a youth-gestalt. If you look at this woman’s response to the second task, you see that she chose excited and then explained this choice by focusing on how a magnified world suggests a child’s point of view; and if you look at her response to the third task, you see she estimated Al to be about 10 years old and again explained her choice with reference to the point of view. Despite her capacity for precise recall and subtle interpretation, this woman would receive no credit for her response to task 20. From the perspective of the machine scoring the test, she would have marked the wrong bubble.

Our research uncovered robust correlations between what I am calling the youth-gestalt details and the choice of excited. Using a statistical technique known as linear discriminatory analysis, we were able to show that it was the presence of these details in the written recalls that best predicted the choice of excited. Apart from these content details, the adults’ recall of unusual phrasing in the original passage strongly predicted the choice of excited as well. In the example above, the woman maintains the sense of excited discovery found in the passage when she carries over the frontshifting of the phrase in front of him:

Reader response:
…saw in front of him the black shoes he wanted to buy. Original text:
Right in front of him were the black shoes that he had come to buy.

By way of contrast, neither this syntactic detail nor the content details had any predictive value for the choice of late. It was, of course, the recall of the phrase in time that strongly predicted this choice.

When we examine adult responses to the age-estimation task, we uncover other robust correlations. To begin with, those who represented Al as young tended to select excited on task 20. This age-estimation task even uncovered a significant correlation between viewing Al as an old man and selecting tired on task 20. It’s as if point of view associated with the magnified world can be embodied in an old man as well as a young boy. Either the young or the old can be viewed as experiencing the physical world as overwhelming.

I would like to describe one other small experiment that is particularly revealing. As already mentioned, those selecting late not only focus on the presence of in time but relate this detail to Al reaching the store before it ctosed. Hence we decided to ask 50 students to complete the second sentence, which is, after all, elliptical. In effect, what was Al trying to be in time for? When we correlated their responses to this question with their responses to task 20, we discovered two firm patterns: (1) those who focused on the store closing tended to select late and (2) those who focused on getting to the shoes tended to select excited.

The second pattern reflects a more active stance toward the passage, for readers are using later information—Al’s concern with buying a particular pair of shoes—to resolve an elliptical structure—in time for what?—that occurs earlier. Certain of these readers even pointed out that Al’s concern with time is not related to the store closing, since this concern is further manifested even when he is inside the store, as indicated by the sentence, “He rushed down the stairs to the shoe department.” Clearly these readers are attentive to local detail, but they are concerned with integrating such detail into a larger whole. As we well know, the use of later information to resolve earlier indeterminacies is fundamental to real-world reading, but in the case of a reading test this more active stance toward reading often needs to be suppressed.

In the passage above, if readers do not have access to later text, they are forced to resolve the elliptical second sentence by focusing on the store closing. We confirmed this fact by conducting another small experiment in which readers had access to only the first two sentences:

The big store looked very far away. Al wondered if he would get there in time.

When we asked the question “in time for what?” nearly all referred to the store closing; at this point in the story they have nothing to work with but the fact that there’s a big store far away.

The confusing nature of this test material can be better understood if we trace where it comes from. In developing a test for adults, the test makers have adapted material from the California Achievement Test (1970), a test designed for children in elementary school (using material written for children in teaching adults is not uncommon). In order to facilitate your comparing the two versions of the test material, I have placed the children’s version on the left and the later version for adults on the right:

In examining the two versions of the passage, we see that they differ only minimally (the five places changed in the original are italicized and marked by an arrow). When we turn to the task, however, we discover a difference of major consequence: the target response for the children’s version, as indicated by the italics, is excited rather than late. How can the task reflect a change of such magnitude when the passage changed only minimally? To try and answer this question, I decided to stop off in Monterey, California, and talk to the test makers. As I have often discovered, detective work in our great test factories does not yield all that much, so I have been forced to construct a scenario that might explain this shift in target response.

The Examiner’s Manual for the adult test states that the guiding principle in adapting the material designed for children was to preserve it as much as possible. This quest for stability is not surprising when one learns that the psychometric norming carried out on the original material was simply transferred to the adapted material, a practice which, of course, violates the standards of the testing industry. The manual goes on to say that the material was occasionally changed so that it might “reflect adult usage, experience, and interests” (1976:3). It is for this reason, I suppose, that we get black shoes rather than shiny skates. We also get a change from two characters to one, presumably because the test makers didn’t want us to imagine two grown men out shopping together for shoes.

Once these changes were made, the test makers faced a problem. It’s one thing to imagine two boys excited about shiny skates but quite another to imagine a grown man excited about black shoes. Clearly the lest makers had to do something, so they decided to go for late rather than excited. After all, they were now working with an adult world where people tend to run behind schedule and worry about being late. They could thus solve their problem by inserting the detail in time, which would lead a seasoned test taker to select late. In effect, such a test taker would know to draw on one of their own crucial interpretive norms: a literal interpretation based on tocal detail takes precedence over a more inferential interpretation based on holistic pattern. But it is, of course, holistic inferencing that is more characteristic of what we ordinarily do in real-world reading.