Incomplete Measures

In K-12 accountablity, are we answering the wrong questions well and the right questions poorly? by JOHN R. TANNER

Large-scale test scores are the most visible representation of what happens in schools in this country. The prevailing notion is that they will tell us most of what we need to know about a school, including the quality of instruction and the effectiveness of the teaching and administrative staff.

Real estate agents use standardized test scores to convince prospective buyers of the quality of neighborhoods. These scores drive most school, district and state accountability systems, determine allocation of scarce resources and are the source for much of the negative rhetoric regarding “our failing schools” — all this from standardized tests primarily in reading and math.

John TannerJohn Tanner is executive director of Test Sense in San Antonio, Texas.

We put such credence in test scores despite the fact most people would be hard-pressed to explain how tests do whatever it is they do, whether the use is in line with the design and whether the resulting decisions are even valid. This lack of understanding leads many to treat tests and test scores as they were never intended, and the consequences — especially when tied to accountability — are huge.

Accountability’s Core
The primary methodology of standardized testing was developed years ago to meas-ure achievement quickly and efficiently within a particular domain. Standardized tests did this rather ingeniously by taking a sample of items from the easiest-to-measure part of a domain and allowing researchers to make statistical inferences from that sample to the broader and certainly richer domain itself. In that way, inexpensive and easy-to-use instruments could serve as a fairly effective proxy for something larger and more complicated.

To produce valid results, it was essential to administer the tests without altering the tested environment, so the skew within the tested content wouldn’t pose a problem. In other words, as long as teachers stuck to the curriculum and didn’t adjust it in anticipation of a much narrower and limited test instrument, the test could do its job. The results could provide an indication of achievement among students, identify relative strengths and weaknesses that could be further explored by instruments designed to do more fine-grained work (such as reading diagnostics), or determine whether what was being taught in one school was in line with what was being taught elsewhere. Such information, in the hands of a thoughtful practitioner, was most useful.

Two questions are at the heart of any working system of accountability: (1) Did what was intended to happen actually happen? and (2) What is the external consequence or value of what happened? Translated to the education world, we could ask the questions this way: Did the internal process of teaching and learning occur as efficaciously as possible, and does what was taught have relevance in the broader world?

Both questions, whether in education or otherwise, must be answered in the same breath, but certainly not by the same instruments, as that assumes a simplicity or sameness to the answers that just doesn’t exist. Consider, for example, that people can do good work that just doesn’t matter, get away with poor work if it can’t be examined against the larger context or answer one question and assume it answers both, possibly forcing an organization to adopt behaviors that are counterproductive but perceived as otherwise. (Teaching to the test is a notable example of the latter.)

The standardized-test methodology was in fact well-suited to answering the relevance question through its ability to provide meaningful comparisons across and among populations of students, so its selection as an accountability instrument was not without reason. However, what has since become clear is that its selection was made without considering whether the instrumentation could support the same inferences in a different environment.

Selecting standardized tests as the methodology by which to answer both accountability questions has generated at least three straightforward problems:

•  The tests were designed to produce valid results about the larger tested domain under particular conditions that no longer exist once accountability places virtually all the attention on the tested content. 

•  If both answers are to be determined from the same instrument, at best you’ve answered one question effectively and been forced to accept another answer that is politically or otherwise expedient, but unlikely to be meaningful. In the case of education, it is clear we attempt to answer the internal efficacy question with a test designed to answer the external relevance question. 

•  Finally, we get a double whammy of sorts in that the first error presents us with a compromised tool that doesn’t do what we think it does, and then we compound the problem as we commit the second error and use the compromised tool to answer both accountability questions. The tool might answer one question (external relevance), but it was never designed to answer the other (internal efficacy), even in idealized conditions. Yet policymakers and much of the public apparently assume these tests provide all the answers they need.

Efficacy and Relevance
Our assessment and accountability systems based on testing seem so cumbersome and ineffective because they represent a square peg trying to force its way into a round hole. Still, current policy encourages us to commit the errors mentioned above and then assumes the problems are in the schools and not in the system designed to measure them and hold them accountable.

Now consider the consequences of the fact that state tests still largely skew toward the least-challenging, easiest-to-measure aspects of a domain and the fact the public and policymakers presume test scores are commensurate with the goals of education in terms of both efficacy and relevance.

First, because accountability is tied to traditional tests that are likely to alter the environment they are designed to assess, inferences about the larger domain are difficult to come by. Any inferences from that altered environment are now limited to the tested material as opposed to the standards or the broader domain, and any response or reaction that is based on the test will likely skew in that direction.

Second, by tying both classroom and school success to these tests (the efficacy and the relevancy questions), we send a message as clear as it is erroneous — that test scores fully reflect what happens in classrooms and rising test scores equate to real and meaningful classroom and school success.

But the tests at best were always and only capable of answering the external relevance question. Assuming they can answer the internal efficacy question also forces that question to be answered by a blunt instrument designed for a different purpose. If, as I’ve suggested, the relevance question is also answered poorly, schools are left to answer two important questions with data that are unlikely to be up to the task.

The frustrated response of teaching to the test is understandable in light of the fact that’s where the policy states success will be had — policy that continues to ignore the fact the methodology it invokes fails the instant the policy is put in place because it changes the conditions in which the measure was designed to be used. Teachers are left to pick up the pieces, but doing a better job of teaching to a test of easy-to-measure material and seeing test scores rise as a result can hardly be said to have anything to do with meaningful success.

Finally, because of a single answer to both the internal efficacy and the external relevance questions, the tests have assumed a role for which they were even less prepared — as a primary driver of curricular change and instructional practice. At its best, educators can use the standardized-testing methodology as an indicator of areas to consider for change, but not as the driver of those changes, as it lacks the specificity necessary to determine what those changes should look like.

Having a system that forces an incomplete measure into an incomplete accountability system is a difficult problem to address in practice. Once understood, the situation demands that either policymakers make better policies or that teachers and administrators refuse to accept the erroneous educational assumptions that permeate our systems and do what is right for their students.

The good news here is that the system teeters just enough on a few sound principles that the ever-coveted rising test scores are just as likely to be the reward from sound practice as the bad practices now so often engendered in response.

A Few Practices
Whatever the policy landscape looks like now or in the future, the two questions at the heart of real educational accountability — how efficacious were we in the classroom and did what the student learned have relevance in a broader context — are more important than ever. These questions are worthy of a system designed to actually answer them, not pretend to do so and then ask us to act on the pretense. I sincerely hope that policymakers will pay heed to supporting a system designed to do so, with measures and data appropriate to each.

In the meantime, I will take a pessimistic view and assume our policies aren’t going to change for the better. Teachers must teach, students must learn, and if test scores can improve as a result of doing both of those properly in the face of pressure to do otherwise, so much the better.

Consider just a few practices educators can implement now that treat tests for what they are and can help make the case for a better system of accountability:

•  Teachers can and should address the full richness of the standards in a curriculum designed to support that richness, ignoring the fact they will administer a test at the end of the year. Ironically, as they do that, the test methodology will more effectively do its job of indicating trends, providing quality comparative data and providing an overall sense of student success within the tested domain.

•  Administrators should be open about and publicly supportive of the fact that while teaching to a test may help a few extra students pass, quality instruction against the richness of the curriculum and the standards is the best way to create long-term sustainable increases in student achievement. 

•  Educators should adjust the curriculum only in light of data that address the full depth and breadth of the curriculum and the standards. If data skew toward one end, the decisions based on those data will be skewed as well. In addition, teachers should change their instruction based on data specifically geared to that purpose. 

•  Extensive test prep violates the very nature of what the instruments are designed to measure and is done at the expense of practices that are far more beneficial to students. Test preparation is a short-term fix of a long-term problem and fails to provide what students truly need to achieve. Test prep is not instruction, contributes almost nothing to a student’s-long-term success and is done largely out of fear for the system rather than concern for the student.

•  Educators must be crystal clear about the purpose of a measure or data point and whether it answers the internal efficacy or the external relevance question.

Rich Reflection
Honest answers to some bold questions should prompt a rich reflection on these and other practices:

•  Where do we believe success in education should be reflected — on the test or in the classroom?

While this is an intended rhetorical question, we must be honest about the message we send to teachers and administrators on a daily basis because all too often the inadvertent answer is “the test.” When the answer is “in the classroom,” most of our current accountability tests can serve as broad indicators of achievement. (Not that they shouldn’t be much improved, but they need not be treated as they so often are now.)

•  Have we asked teachers, even inadvertently, to choose between doing what is best for the system and doing what is best for their students?

If teaching to the test is a solution, the system is being privileged, quite probably at the expense of students it is supposed to serve.

•  What tools are we using to drive curricular changes?

Are we relying on large-scale state tests based on a design that precludes them from driving specific curricular change or on thoughtful conversations based on measures that encompass the full richness of our expectations? 

•  Do we understand that in spite of the policy in its current format (No Child Left Behind), we can still answer the accountability questions concerning efficacy and relevance appropriately?

This requires real leadership to treat the current tests as the external measure that they are and to generate data and information appropriate to the efficacy measure, but the effort will be worthwhile.

A Challenging Feat
The good news is that today’s school leaders have opportunities to treat the tests as they were designed to be treated, to help teachers make good decisions for all students and to ensure the curriculum and the test do not become synonymous, all within (and while we work to improve upon) the current accountability model.

Accomplishing these feats will be challenging when so many people, many in decision-making positions, believe test scores tell them all they need to know about our schools. Having a basic understanding of what tests were designed to do, how accountability systems work and how the two can be brought together to a positive end is critical.

And just because acting differently and rightly represents a bold challenge doesn’t mean it shouldn’t be done. It should. The most exciting thing is knowing that we need not wait for some unrealized future in order to act. The most challenging thing, of course, is to do so.

John Tanner is executive director at Test Sense in San Antonio, Texas. E-mail: johnt@testsense.com