Skip to main content

Evaluating Educational Programs

Part of the Methodology of Educational Measurement and Assessment book series (MEMA)


This chapter was written by Samuel Ball and originally published as an ETS report in 1979. Ball was one of ETS’s most active program evaluators for 10 years and directed several pacesetting studies, including a large-scale evaluation of the educational effects of Sesame Street. The chapter documents the vigorous program of evaluation research conducted at ETS in the 1960s and 1970s, which helped lay the foundation for this fledgling field. This work developed new viewpoints, techniques, and skills for systematically assessing educational programs and led to the creation of principles for program evaluation that still appear relevant today.

This chapter was written by Samuel Ball and originally published in 1979 by Educational Testing Service and later posthumously in 2011 as a research report in the ETS R&D Scientific and Policy Contributions Series. Ball was one of ETS’s most active program evaluators for 10 years and directed several pacesetting studies including a large-scale evaluation of Sesame Street . The chapter documents the vigorous program of evaluation research conducted at ETS in the 1960s and 1970s, which helped lay the foundation for what was then a fledgling field. This work developed new viewpoints, techniques, and skills for systematically assessing educational programs and led to the creation of principles for program evaluation that still appear relevant today.

11.1 An Emerging Profession

Evaluating educational programs is an emerging profession, and Educational Testing Service (ETS) has played an active role in its development. The term program evaluation only came into wide use in the mid-1960s, when efforts at systematically assessing programs multiplied. The purpose of this kind of evaluation is to provide information to decision makers who have responsibility for existing or proposed educational programs. For instance, program evaluation may be used to help make decisions concerning whether to develop a program (needs assessment), how best to develop a program ( formative evaluation ), and whether to modify—or even continue—an existing program ( summative evaluation ).

Needs assessment is the process by which one identifies needs and decides upon priorities among them. Formative evaluation refers to the process involved when the evaluator helps the program developer—by pretesting program materials, for example. Summative evaluation is the evaluation of the program after it is in operation. Arguments are rife among program evaluators about what kinds of information should be provided in each of these forms of evaluation.

In general, the ETS posture has been to try to obtain the best—that is, the most relevant, valid, and reliable—information that can be obtained within the constraints of cost and time and the needs of the various audiences for the evaluation. Sometimes, this means a tight experimental design with a national sample; at other times, the best information might be obtained through an intensive case study of a single institution. ETS has carried out both traditional and innovative evaluations of both traditional and innovative programs, and staff members also have cooperated with other institutions in planning or executing some aspects of evaluation studies. Along the way, the work by ETS has helped to develop new viewpoints, techniques, and skills.

11.2 The Range of ETS Program Evaluation Activities

Program evaluation calls for a wide range of skills, and evaluators come from a variety of disciplines: educational psychology, developmental psychology, psychometrics , sociology, statistics, anthropology, educational administration, and a host of subject matter areas. As program evaluation began to emerge as a professional concern, ETS changed, both structurally and functionally, to accommodate it. The structural changes were not exclusively tuned to the needs of conducting program evaluations. Rather, program evaluation, like the teaching of English in a well-run high school, became to some degree the concern of virtually all the professional staff. Thus, new research groups were added, and they augmented the organization’s capability to conduct program evaluations.

The functional response was many-faceted. Two of the earliest evaluation studies conducted by ETS indicate the breadth of the range of interest. In 1965, collaborating with the Pennsylvania State Department of Education, Henry Dyer of ETS set out to establish a set of educational goals against which later the performance of the state’s educational system could be evaluated (Dyer 1965a, b). A unique aspect of this endeavor was Dyer’s insistence that the goal-setting process be opened up to strong participation by the state’s citizens and not left solely to a professional or political elite. (In fact, ETS program evaluation has been marked by a strong emphasis, when at all appropriate, on obtaining community participation.)

The other early evaluation study in which ETS was involved was the now famous Coleman report ( Equality of Educational Opportunity ), issued in 1966 (Coleman et al. 1966). ETS staff, under the direction of Albert E. Beaton, had major responsibility for analysis of the massive data generated (see Beaton and Barone , Chap. 8, this volume). Until then, studies of the effectiveness of the nation’s schools, especially with respect to programs’ educational impact on minorities, had been small-scale. So the collection and analysis of data concerning tens of thousands of students and hundreds of schools and their communities were new experiences for ETS and for the profession of program evaluation.

In the intervening years , the Coleman report (Coleman et al. 1966) and the Pennsylvania Goals Study (Dyer 1965a, b) have become classics of their kind, and from these two auspicious early efforts, ETS has become a center of major program evaluation. Areas of focus include computer-aided instruction, aesthetics and creativity in education, educational television , educational programs for prison inmates, reading programs, camping programs, career education, bilingual education, higher education, preschool programs, special education, and drug programs. (For brief descriptions of ETS work in these areas, as well as for studies that developed relevant measures, see the appendix.) ETS also has evaluated programs relating to year-round schooling, English as a second language , desegregation, performance contracting, women’s education, busing, Title I of the Elementary and Secondary Education Act (ESEA) , accountability , and basic information systems.

One piece of work that must be mentioned is the Encyclopedia of Educational Evaluation , edited by Anderson et al. (1975). The encyclopedia contains articles by them and 36 other members of the ETS staff. Subtitled Concepts and Techniques for Evaluating Education and Training Programs, it contains 141 articles in all.

11.3 ETS Contributions to Program Evaluation

Given the innovativeness of many of the programs evaluated, the newness of the profession of program evaluation, and the level of expertise of the ETS staff who have directed these studies, it is not surprising that the evaluations themselves have been marked by innovations for the profession of program evaluation. At the same time, ETS has adopted several principles relative to each aspect of program evaluation. It will be useful to examine these innovations and principles in terms of the phases that a program evaluation usually attends to—goal setting, measurement selection, implementation in the field setting, analysis, and interpretation and presentation of evidence.

11.3.1 Making Goals Explicit

It would be a pleasure to report that virtually every educational program has a well-thought-through set of goals, but it is not so. It is, therefore, necessary at times for program evaluators to help verbalize and clarify the goals of a program to ensure that they are, at least, explicit. Further, the evaluator may even be given goal development as a primary task, as in the Pennsylvania Goals Study (Dyer 1965a, b). This need was seen again in a similar program, when Robert Feldmesser (1973) helped the New Jersey State Board of Education establish goals that underwrite conceptually that state’s “thorough and efficient” education program.

Work by ETS staff indicates there are four important principles with respect to program goal development and explication. The first of these principles is as follows: What program developers say their program goals are may bear only a passing resemblance to what the program in fact seems to be doing.

This principle—the occasional surrealistic quality of program goals—has been noted on a number of occasions: For example, assessment instruments developed for a program evaluation on the basis of the stated goals sometimes do not seem at all sensitive to the actual curriculum. As a result, ETS program evaluators seek, whenever possible, to cooperate with program developers to help fashion the goals statement. The evaluators also will attempt to describe the program in operation and relate that description to the stated goals, as in the case of the 1971 evaluation of the second year of Sesame Street for Children’s Television Workshop (Bogatz and Ball 1971). This comparison is an important part of the process and represents sometimes crucial information for decision makers concerned with developing or modifying a program.

The second principle is as follows: When program evaluators work cooperatively with developers in making program goals explicit, both the program and the evaluation seem to benefit.

The original Sesame Street evaluation (Ball and Bogatz, 1970) exemplified the usefulness of this cooperation. At the earliest planning sessions for the program, before it had a name and before it was fully funded, the developers, aided by ETS, hammered out the program goals. Thus, ETS was able to learn at the outset what the program developers had in mind, ensuring sufficient time to provide adequately developed measurement instruments. If the evaluation team had had to wait until the program itself was developed, there would not have been sufficient time to develop the instruments; more important, the evaluators might not have had sufficient understanding of the intended goals—thereby making sensible evaluation unlikely.

The third principle is as follows: There is often a great deal of empirical research to be conducted before program goals can be specified.

Sometimes, even before goals can be established or a program developed, it is necessary, through empirical research, to indicate that there is a need for the program. An illustration is provided by the research of Ruth Ekstrom and Marlaine Lockheed (1976) into the competencies gained by women through volunteer work and homemaking. The ETS researchers argued that it is desirable for women to resume their education if they wish to after years of absence. But what competencies have they picked up in the interim that might be worthy of academic credit? By identifying, surveying, and interviewing women who wished to return to formal education, Ekstrom and Lockheed established that many women had indeed learned valuable skills and knowledge. Colleges were alerted and some have begun to give credit where credit is due.

Similarly, when the federal government decided to make a concerted attack on the reading problem as it affects the total population, one area of concern was adult reading. But there was little knowledge about it. Was there an adult literacy problem? Could adults read with sufficient understanding such items as newspaper employment advertisements, shopping and movie advertisements, and bus schedules? And in investigating adult literacy , what characterized the reading tasks that should be taken into account? Murphy, in a 1973 study (Murphy 1973a), considered these factors: the importance of a task (the need to be able to read the material if only once a year as with income tax forms and instructions), the intensity of the task (a person who wants to work in the shipping department will have to read the shipping schedule each day), or the extensivity of the task (70% of the adult population read a newspaper but it can usually be ignored without gross problems arising). Murphy and other ETS researchers conducted surveys of reading habits and abilities , and this assessment of needs provided the government with information needed to decide on goals and develop appropriate programs.

Still a different kind of needs assessment was conducted by ETS researchers with respect to a school for learning disabled students in 1976 (Ball and Goldman 1976) . The school catered to children aged 5–18 and had four separate programs and sites. ETS first served as a catalyst, helping the school’s staff develop a listing of problems. Then ETS acted as an amicus curiae, drawing attention to those problems, making explicit and public what might have been unsaid for want of an appropriate forum. Solving these problems was the purpose of stating new institutional goals—goals that might never have been formally recognized if ETS had not worked with the school to make its needs explicit.

The fourth principle is as follows: The program evaluator should be conscious of and interested in the unintended outcomes of programs as well as the intended outcomes specified in the program’s goal statement.

In program evaluation, the importance of looking for side effects, especially negative ones, has to be considered against the need to put a major effort into assessing progress toward intended outcomes. Often, in this phase of evaluation, the varying interests of evaluators, developers, and funders intersect—and professional, financial, and political considerations are all at odds. At such times, program evaluation becomes as much an art form as an exercise in social science.

A number of articles were written about this problem by Samuel J. Messick , ETS vice president for research (e.g., Messick 1970, 1975). His viewpoint—the importance of the medical model—has been illustrated in various ETS evaluation studies. His major thesis was that the medical model of program evaluation explicitly recognizes that “…prescriptions for treatment and the evaluation of their effectiveness should take into account not only reported symptoms but other characteristics of the organism and its ecology as well” (Messick 1975, p. 245). As Messick went on to point out, this characterization was a call for a systems analysis approach to program evaluation—dealing empirically with the interrelatedness of all the factors and monitoring all outcomes, not just the intended ones.

When, for example, ETS evaluated the first 2 years of Sesame Street (Ball and Bogatz 1970), there was obviously pressure to ascertain whether the intended goals of that show were being attained. It was nonetheless possible to look for some of the more likely unintended outcomes: whether the show had negative effects on heavy viewers going off to kindergarten, and whether the show was achieving impacts in attitudinal areas.

In summative evaluations , to study unintended outcomes is bound to cost more money than to ignore them. It is often difficult to secure increased funding for this purpose. For educational programs with potential national applications, however, ETS strongly supports this more comprehensive approach.

11.3.2 Measuring Program Impact

The letters ETS have become almost synonymous in some circles with standardized testing of student achievement . In its program evaluations, ETS naturally uses such tests as appropriate, but frequently the standardized tests are not appropriate measures. In some evaluations, ETS uses both standardized and domain-referenced tests. An example may be seen in The Electric Company evaluations (Ball et al. 1974). This televised series, which was intended to teach reading skills to first through fourth graders, was evaluated in some 600 classrooms. One question that was asked during the process concerned the interaction of the student’s level of reading attainment and the effectiveness of viewing the series. Do good readers learn more from the series than poor readers? So standardized, norm-referenced reading tests were administered, and the students in each grade were divided into deciles on this basis, thereby yielding ten levels of reading attainment.

Data on the outcomes using the domain-referenced tests were subsequently analyzed for each decile ranking. Thus, ETS was able to specify for what level of reading attainment, in each grade, the series was working best. This kind of conclusion would not have been possible if a specially designed domain-referenced reading test with no external referent had been the only one used, nor if a standardized test, not sensitive to the program’s impact, had been the only one used.

Without denying the usefulness of previously designed and developed measures, ETS evaluators have frequently preferred to develop or adapt instruments that would be specifically sensitive to the tasks at hand. Sometimes this measurement effort is carried out in anticipation of the needs of program evaluators for a particular instrument, and sometimes because a current program evaluation requires immediate instrumentation.

An example of the former is a study of doctoral programs by Mary Jo Clark et al. (1976). Existing instruments had been based on surveys in which practitioners in a given discipline were asked to rate the quality of doctoral programs in that discipline. Instead of this reputational survey approach, the ETS team developed an array of criteria (e.g., faculty quality, student body quality, resources, academic offerings, alumni performance), all open to objective assessment. This assessment tool can be used to assess changes in the quality of the doctoral programs offered by major universities.

Similarly, the development by ETS of the Kit of Factor-Referenced Cognitive Tests (Ekstrom et al. 1976) also provided a tool—one that could be used when evaluating the cognitive abilities of teachers or students if these structures were of interest in a particular evaluation. A clearly useful application was in the California study of teaching performance by Frederick McDonald and Patricia Elias (1976). Teachers with certain kinds of cognitive structures were seen to have differential impacts on student achievement . In the Donald A. Trismen study of an aesthetics program (Trismen 1968), the factor kit was used to see whether cognitive structures interacted with aesthetic judgments. Developing Special Instruments

Examples of the development of specific instrumentation for ETS program evaluations are numerous. Virtually every program evaluation involves, at the very least, some adapting of existing instruments. For example, a questionnaire or interview may be adapted from ones developed for earlier studies. Typically, however, new instruments, including goal-specific tests, are prepared. Some ingenious examples, based on the 1966 work of E. J. Webb, D. F. Campbell, R. D. Schwartz , and L. Sechrest , were suggested by Anderson (1968) for evaluating museum programs, and the title of her article gives a flavor of the unobtrusive measures illustrated—“Noseprints on the Glass.”

Another example of ingenuity is Trismen’s use of 35 mm slides as stimuli in the assessment battery of the Education through Vision program (Trismen 1968). Each slide presented an art masterpiece, and the response options were four abstract designs varying in color. The instruction to the student was to pick the design that best illustrated the masterpiece’s coloring. Using Multiple Measures

When ETS evaluators have to assess a variable and the usual measures have rather high levels of error inherent in them, they usually resort to triangulation. That is, they use multiple measures of the same construct , knowing that each measure suffers from a specific weakness. Thus, in 1975, Donald E. Powers evaluated for the Philadelphia school system the impact of dual-audio television—a television show telecast at the same time as a designated FM radio station provided an appropriate educational commentary. One problem in measurement was assessing the amount of contact the student had with the dual-audio television treatment (Powers 1975a) . Powers used home telephone interviews, student questionnaires, and very simple knowledge tests of the characters in the shows to assess whether students had in fact been exposed to the treatment. Each of these three measures has problems associated with it, but the combination provided a useful assessment index.

In some circumstances, ETS evaluators are able to develop measurement techniques that are an integral part of the treatment itself. This unobtrusiveness has clear benefits and is most readily attainable with computer-aided instructional (CAI) programs. Thus, for example, Donald L. Alderman , in the evaluation of TICCIT (a CAI program developed by the Mitre Corporation), obtained for each student such indices as the number of lessons passed, the time spent on line, the number of errors made, and the kinds of errors (Alderman 1978). And he did this simply by programming the computer to save this information over given periods of time.

11.3.3 Working in Field Settings

Measurement problems cannot be addressed satisfactorily if the setting in which the measures are to be administered is ignored. One of the clear lessons learned in ETS program evaluation studies is that measurement in field settings (home, school, community) poses different problems from measurement conducted in a laboratory.

Program evaluation, ether formative or summative, demands that its empirical elements usually be conducted in natural field settings rather than in more contrived settings, such as a laboratory. Nonetheless, the problems of working in field settings are rarely systematically discussed or researched. In an article in the Encyclopedia of Educational Evaluation , Bogatz (1975) detailed these major aspects:

  • Obtaining permission to collect data at a site

  • Selecting a field staff

  • Training the staff

  • Maintaining family /community support

Of course, all the aspects discussed by Bogatz interact with the measurement and design of the program evaluation. A great source of information concerning field operations is the ETS Head Start Longitudinal Study of Disadvantaged Children, directed by Virginia Shipman (1970). Although not primarily a program evaluation, it certainly has generated implications for early childhood programs. It was longitudinal, comprehensive in scope, and large in size, encompassing four sites and, initially, some 2000 preschoolers. It was clear from the outset that close community ties were essential if only for expediency—although, of course, more important ethical principles were involved. This close relationship with the communities in which the study was conducted involved using local residents as supervisors and testers, establishing local advisory committees, and thus ensuring free, two-way communication between the research team and the community.

The Sesame Street evaluation also adopted this approach (Ball and Bogatz 1970). In part because of time pressures and in part to ensure valid test results, the ETS evaluators especially developed the tests so that community members with minimal educational attainments could be trained quickly to administer them with proper skill. Establishing Community Rapport

In evaluations of street academies by Ronald L. Flaugher (1971), and of education programs in prisons by Flaugher and Samuel Barnett (1972), it was argued that one of the most important elements in successful field relationships is the time an evaluator spends getting to know the interests and concerns of various groups, and lowering barriers of suspicion that frequently separate the educated evaluator and the less-educated program participants. This point may not seem particularly sophisticated or complex, but many program evaluations have floundered because of an evaluator’s lack of regard for disadvantaged communities (Anderson 1970). Therefore, a firm principle underlying ETS program evaluation is to be concerned with the communities that provide the contexts for the programs being evaluated. Establishing two-way lines of communication with these communities and using community resources whenever possible help ensure a valid evaluation.

Even with the best possible community support, field settings cause problems for measurement. Raymond G. Wasdyke and Jerilee Grandy (1976) showed this idea to be true in an evaluation in which the field setting was literally that—a field setting. In studying the impact of a camping program on New York City grade school pupils, they recognized the need, common to most evaluations, to describe the treatment—in this case the camping experience. Therefore, ETS sent an observer to the campsite with the treatment groups. This person, who was herself skilled in camping, managed not to be an obtrusive participant by maintaining a relatively low profile.

Of course, the problems of the observer can be just as difficult in formal institutions as on the campground. In their 1974 evaluation of Open University materials, Hartnett and colleagues found, as have program evaluators in almost every situation, that there was some defensiveness in each of the institutions in which they worked (Hartnett et al. 1974). Both personal and professional contacts were used to allay suspicions. There also was emphasis on an evaluation design that took into account each institution’s values. That is, part of the evaluation was specific to the institution, but some common elements across institutions were retained. This strategy underscored the evaluators’ realization that each institution was different, but allowed ETS to study certain variables across all three participating institutions.

Breaking down the barriers in a field setting is one of the important elements of a successful evaluation, yet each situation demands somewhat different evaluator responses. Involving Program Staff

Another way of ensuring that evaluation field staff are accepted by program staff is to make the program staff active participants in the evaluation process. While this integration is obviously a technique to be strongly recommended in formative evaluations , it can also be used in summative evaluations . In his evaluation of PLATO in junior colleges, Murphy (1977) could not afford to become the victim of a program developer’s fear of an insensitive evaluator. He overcame this potential problem by enlisting the active participation of the junior college and program development staffs. One of Murphy’s concerns was that there is no common course across colleges. Introduction to Psychology, for example, might be taught virtually everywhere, but the content can change remarkably, depending on such factors as who teaches the course, where it is taught, and what text is used. Murphy understood this variability and his evaluation of PLATO reflected his concern. It also necessitated considerable input and cooperation from program developers and college teachers working in concert—with Murphy acting as the conductor.

11.3.4 Analyzing the Data

After the principles and strategies used by program evaluators in their field operations are successful and data are obtained, there remains the important phase of data analysis. In practice, of course, the program evaluator thinks through the question of data analysis before entering the data collection phase. Plans for analysis help determine what measures to develop, what data to collect, and even, to some extent, how the field operation is to be conducted. Nonetheless, analysis plans drawn up early in the program evaluation cannot remain quite as immutable as the Mosaic Law. To illustrate the need for flexibility, it is useful to turn once again to the heuristic ETS evaluation of Sesame Street .

As initially planned, the design of the Sesame Street evaluation was a true experiment (Ball and Bogatz 1970) . The analyses called for were multivariate analyses of covariance, using pretest scores as the covariate. At each site, a pool of eligible preschoolers was obtained by community census, and experimental and control groups were formed by random assignment from these pools. The evaluators were somewhat concerned that those designated to be the experimental (viewing) group might not view the show—it was a new show on public television, a loose network of TV stations not noted for high viewership. Some members of the Sesame Street national research advisory committee counseled ETS to consider paying the experimental group to view. The suggestion was resisted, however, because any efforts above mild and occasional verbal encouragement to view the show would compromise the results. If the experimental group members were paid, and if they then viewed extensively and outperformed the control group at posttest, would the improved performance be due to the viewing, the payment, or some interaction of payment and viewing? Of course, this nice argument proved to be not much more than an exercise in modern scholasticism. In fact, the problem lay not in the treatment group but in the uninformed and unencouraged-to-view control group. The members of that group, as indeed preschoolers with access to public television throughout the nation, were viewing the show with considerable frequency—and not much less than the experimental group. Thus, the planned analysis involving differences in posttest attainments between the two groups was dealt a mortal blow.

Fortunately, other analyses were available, of which the ETS-refined age cohorts design provided a rational basis. This design is presented in the relevant report (Ball and Bogatz 1970). The need here is not to describe the design and analysis but to emphasize a point made practically by the poet Robert Burns some time ago and repeated here more prosaically: The best laid plans of evaluators can “gang aft agley,” too. Clearing New Paths

Sometimes program evaluators find that the design and analysis they have in mind represent an untrodden path. This result is perhaps in part because many of the designs in the social sciences are built upon laboratory conditions and simply are not particularly relevant to what happens in educational institutions.

When ETS designed the summative evaluation of The Electric Company , it was able to set up a true experiment in the schools. Pairs of comparable classrooms within a school and within a grade were designated as the pool with which to work. One of each pair of classes was randomly assigned to view the series. Pretest scores were used as covariates on posttest scores, and in 1973 the first-year evaluation analysis was successfully carried out (Ball and Bogatz 1973). The evaluation was continued through a second year, however, and as is usual in schools, the classes did not remain intact.

From an initial 200 classes, the children had scattered through many more classrooms. Virtually none of the classes with subject children contained only experimental or only control children from the previous year. Donald B. Rubin , an ETS statistician, consulted with a variety of authorities and found that the design and analysis problem for the second year of the evaluation had not been addressed in previous work. To summarize the solution decided on, the new pool of classes was reassigned randomly to E (experimental) or C (control) conditions so that over the 2 years the design was portrayable as Fig. 11.1.

Fig. 11.1
figure 1

The design for the new pool of classes. For Year II, EE represents children who were in E classrooms in Year I and again in Year II. That is, the first letter refers to status in Year I and the second to status in Year II

Further, the pretest scores of Year II were usable as new covariates when analyzing the results of the Year II posttest scores (Ball et al. 1974). Tailoring to the Task

Unfortunately for those who prefer routine procedures, it has been shown across a wide range of ETS program evaluations that each design and analysis must be tailored to the occasion. Thus, Gary Marco (1972), as part of the statewide educational assessment in Michigan, evaluated ESEA Title I program performance. He assessed the amount of exposure students had to various clusters of Title I programs, and he included control schools in the analysis. He found that a regression -analysis model involving a correction for measurement error was an innovative approach that best fit his complex configuration of data.

Garlie Forehand , Marjorie Ragosta , and Donald A. Rock , in a national, correlational study of desegregation, obtained data on school characteristics and on student outcomes (Forehand et al. 1976) . The purposes of the study included defining indicators of effective desegregation and discriminating between more and less effective school desegregation programs. The emphasis throughout the effort was on variables that were manipulable. That is, the idea was that evaluators would be able to suggest practical advice on what schools can do to achieve a productive desegregation program. Initial investigations allowed specification among the myriad variables of a hypothesized set of causal relationships, and the use of path analysis made possible estimation of the strength of hypothesized causal relationships. On the basis of the initial correlation matrices, the path analyses, and the observations made during the study, an important product—a nontechnical handbook for use in schools—was developed.

Another large-scale ETS evaluation effort was directed by Trismen et al. (1976). They studied compensatory reading programs, initially surveying more than 700 schools across the country. Over a 4-year period ending in 1976, this evaluation interspersed data analysis with new data collection efforts. One purpose was to find schools that provided exceptionally positive or negative program results. These schools were visited blind and observed by ETS staff. Whereas the Forehand evaluation analysis (Forehand et al. 1976) was geared to obtaining practical applications, the equally extensive evaluation analysis of Trismen’s study was aimed at generating hypotheses to be tested in a series of smaller experiments.

As a further illustration of the complex interrelationship among evaluation purposes, design, analyses, and products, there is the 1977 evaluation of the use of PLATO in the elementary school by Spencer Swinton and Marianne Amarel (1978). They used a form of regression analysis—as did Forehand et al. (1976) and Trismen et al. (1976). But here the regression analyses were used differently in order to identify program effects unconfounded by teacher differences. In this regression analysis, teachers became fixed effects, and contrasts were fitted for each within-teacher pair (experimental versus control classroom teachers).

This design, in turn, provides a contrast to McDonald’s (1977) evaluation of West New York programs to teach English as a second language to adults. In this instance, the regression analysis was directed toward showing which teaching method related most to gains in adult students’ performance.

There is a school of thought within the evaluation profession that design and analysis in program evaluation can be made routine. At this point, the experience of ETS indicates that this would be unwise.

11.3.5 Interpreting the Results

Possibly the most important principle in program evaluation is that interpretations of the evaluation’s meaning—the conclusions to be drawn—are often open to various nuances. Another problem is that the evidence on which the interpretations are based may be inconsistent. The initial premise of this chapter was that the role of program evaluation is to provide evidence for decision-makers. Thus, one could argue that differences in interpretation, and inconsistencies in the evidence, are simply problems for the decision-maker and not for the evaluator.

But consider, for example, an evaluation by Powers of a year-round program in a school district in Virginia (Powers 1974, 1975b). (The long vacation was staggered around the year so that schools remained open in the summer.) The evidence presented by Powers indicated that the year-round school program provided a better utilization of physical plant and that student performance was not negatively affected. The school board considered this evidence as well as other conflicting evidence provided by Powers that the parents’ attitudes were decidedly negative. The board made up its mind, and (not surprisingly) scotched the program. Clearly, however, the decision was not up to Powers. His role was to collect the evidence and present it systematically. Keeping the Process Open

In general, the ETS response to conflicting evidence or varieties of nuances in interpretation is to keep the evaluation process and its reporting as open as possible. In this way, the values of the evaluator, though necessarily present, are less likely to be a predominating influence on subsequent action.

Program evaluators do, at times, have the opportunity to influence decision-makers by showing them that there are kinds of evidence not typically considered. The Coleman Study, for example, showed at least some decision-makers that there is more to evaluating school programs than counting (or calculating) the numbers of books in libraries, the amount of classroom space per student, the student-teacher ratio, and the availability of audiovisual equipment (Coleman et al. 1966). Rather, the output of the schools in terms of student performance was shown to be generally superior as evidence of school program performance.

Through their work, evaluators are also able to educate decision makers to consider the important principle that educational treatments may have positive effects for some students and negative effects for others—that an interaction of treatment with student should be looked for. As pointed out in the discussion of unintended outcomes, a systems-analysis approach to program evaluation—dealing empirically with the interrelatedness of all the factors that may affect performance—is to be preferred. And this approach, as Messick emphasized, “properly takes into account those student-process-environment interactions that produce differential results” (Messick 1975, p. 246). Selecting Appropriate Evidence

Finally, a consideration of the kinds of evidence and interpretations to be provided decision makers leads inexorably to the realization that different kinds of evidence are needed, depending on the decision-maker’s problems and the availability of resources. The most scientific evidence involving objective data on student performance can be brilliantly interpreted by an evaluator, but it might also be an abomination to a decision maker who really needs to know whether teachers’ attitudes are favorable.

ETS evaluations have provided a great variety of evidence. For a formative evaluation in Brevard County, Florida , Trismen (1970) provided evidence that students could make intelligent choices about courses. In the ungraded schools, students had considerable freedom of choice, but they and their counselors needed considerably more information than in traditional schools about the ingredients for success in each of the available courses. As another example, Gary Echternacht , George Temp, and Theodore Stolie helped state and local education authorities develop Title I reporting models that included evidence on impact, cost, and compliance with federal regulations (Echternacht et al. 1976). Forehand and McDonald (1972) had been working with New York City to develop an accountability model providing constructive kinds of evidence for the city’s school system. On the other hand, as part of an evaluation team, Amarel provided, for a small experimental school in Chicago, judgmental data as well as reports and documents based on the school’s own records and files (Amarel and The Evaluation Collective 1979). Finally, Michael Rosenfeld provided Montgomery Township, New Jersey, with student, teacher, and parent perceptions in his evaluation of the open classroom approach then being tried out (Rosenfeld 1973).

In short, just as tests are not valid or invalid (it is the ways tests are used that deserve such descriptions), so too, evidence is not good or bad until it is seen in relation to the purpose for which it is to be used, and in relation to its utility to decision-makers.

11.4 Postscript

For the most part, ETS’s involvement in program evaluation has been at the practical level. Without an accompanying concern for the theoretical and professional issues, however, practical involvement would be irresponsible. ETS staff members have therefore seen the need to integrate and systematize knowledge about program evaluation. Thus, Anderson obtained a contract with the Office of Naval Research to draw together the accumulated knowledge of professionals from inside and outside ETS on the topic of program evaluation. A number of products followed. These products included a survey of practices in program evaluation (Ball and Anderson 1975a), and a codification of program evaluation principles and issues (Ball and Anderson 1975b). Perhaps the most generally useful of the products is the aforementioned Encyclopedia of Educational Evaluation (Anderson et al. 1975).

From an uncoordinated, nonprescient beginning in the mid-1960s, ETS has acquired a great deal of experience in program evaluation. In one sense it remains uncoordinated because there is no specific “party line,” no dogma designed to ensure ritualized responses. It remains quite possible for different program evaluators at ETS to recommend differently designed evaluations for the same burgeoning or existing programs.

There is no sure knowledge where the profession of program evaluation is going. Perhaps, with zero-based budgeting, program evaluation will experience amazing growth over the next decade, growth that will dwarf its current status (which already dwarfs its status of a decade ago). Or perhaps there will be a revulsion against the use of social scientific techniques within the political, value-dominated arena of program development and justification. At ETS, the consensus is that continued growth is the more likely event. And with the staff’s variegated backgrounds and accumulating expertise, ETS hopes to continue making significant contributions to this emerging profession.


  • Alderman, D. L. (1978). Evaluation of the TICCIT computer-assisted instructional system in the community college. Princeton: Educational Testing Service.

    Google Scholar 

  • Amarel, M., & The Evaluation Collective. (1979). Reform, response, renegotiation: Transitions in a school-change project. Unpublished manuscript.

    Google Scholar 

  • Anastasio, E. J. (1972). Evaluation of the PLATO and TICCIT computer-based instructional systems—A preliminary plan (Program Report No. PR-72-19). Princeton: Educational Testing Service.

    Google Scholar 

  • Anderson, S. B. (1968). Noseprints on the glass—Or how do we evaluate museum programs? In E. Larrabee (Ed.), Museums and education (pp. 115–126). Washington, DC: Smithsonian Institution Press.

    Google Scholar 

  • Anderson, S. B. (1970). From textbooks to reality: Social researchers face the facts of life in the world of the disadvantaged. In J. Hellmuth (Ed.), Disadvantaged child: Vol. 3. Compensatory education: A national debate. New York: Brunner/Mazel.

    Google Scholar 

  • Anderson, S. B., Ball, S., & Murphy, R. T. (Eds.). (1975). Encyclopedia of educational evaluation: Concepts and techniques for evaluating education and training programs. San Francisco: Jossey-Bass Publishers.

    Google Scholar 

  • Ball, S. (1973, July). Evaluation of drug information programs—Report of the panel on the impact of information on drug use and misuse, phase 2. Washington, DC: National Research Council, National Academy of Sciences.

    Google Scholar 

  • Ball, S., & Anderson, S. B. (1975a). Practices in program evaluation: A survey and some case studies. Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., & Anderson, S. B. (1975b). Professional issues in the evaluation of education/training programs. Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., & Bogatz, G. A. (1970). The first year of Sesame Street: An evaluation (Program Report No. PR-70-15). Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., & Bogatz, G. A. (1973). Reading with television: An evaluation of the Electric Company (Program Report No. PR-73-02). Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., & Goldman, K. S. (1976). The Adams School An interim report. Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., & Kazarow, K. M. (1974). Evaluation of To Reach a Child. Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., Bogatz, G. A., Kazarow, K. M., & Rubin, D. B. (1974). Reading with television: A follow-up evaluation of The Electric Company (Program Report No. PR-74-15). Princeton: Educational Testing Service.

    Google Scholar 

  • Ball, S., Bridgeman, B., & Beaton, A. E. (1976). A design for the evaluation of the parent-child development center replication project. Princeton: Educational Testing Service.

    Google Scholar 

  • Bogatz, G. A. (1975). Field operations. In S. B. Anderson, S. Ball, & R. T. Murphy (Eds.), Encyclopedia of educational evaluation (pp. 169–175). San Francisco: Jossey-Bass Publishers.

    Google Scholar 

  • Bogatz, G. A., & Ball, S. (1971). The second year of Sesame Street: A continuing evaluation (Program Report No. PR-71-21). Princeton: Educational Testing Service.

    Google Scholar 

  • Boldt, R. F. (with Gitomer, N.). (1975). Editing and scaling of instrument packets for the clinical evaluation of narcotic antagonists (Program Report No. PR-75-12). Princeton: Educational Testing Service.

    Google Scholar 

  • Bussis, A. M., Chittenden, E. A., & Amarel, M. (1976). Beyond surface curriculum. An interview study of teachers’ understandings. Boulder: Westview Press.

    Google Scholar 

  • Campbell, P. B. (1976). Psychoeducational diagnostic services for learning disabled youths [Proposal submitted to Creighton Institute for Business Law and Social Research]. Princeton: Educational Testing Service.

    Google Scholar 

  • Clark, M. J., Hartnett, R. Y., & Baird, L. L. (1976). Assessing dimensions of quality in doctoral education (Program Report No. PR-76-27). Princeton: Educational Testing Service.

    Google Scholar 

  • Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfeld, F. D., & York, R. L. (1966). Equality of educational opportunity. Washington, DC: U.S. Government Printing Office.

    Google Scholar 

  • Corder, R. A. (1975). Final evaluation report of part C of the California career education program. Berkeley: Educational Testing Service.

    Google Scholar 

  • Corder, R. A. (1976a). Calexico intercultural design. El Cid Title VII yearly final evaluation reports for grades 7–12 of program of bilingual education, 1970–1976. Berkeley: Educational Testing Service.

    Google Scholar 

  • Corder, R. A. (1976b). External evaluator’s final report on the experience-based career education program. Berkeley: Educational Testing Service.

    Google Scholar 

  • Corder, R. A., & Johnson, S. (1972). Final evaluation report, 1971–1972, MANO A MANO. Berkeley: Educational Testing Service.

    Google Scholar 

  • Dyer, H. S. (1965a). A plan for evaluating the quality of educational programs in Pennsylvania (Vol. 1, pp 1–4, 10–12). Harrisburg: State Board of Education.

    Google Scholar 

  • Dyer, H. S. (1965b). A plan for evaluating the quality of educational programs in Pennsylvania (Vol. 2, pp. 158–161). Harrisburg: State Board of Education.

    Google Scholar 

  • Echternacht, G., Temp, G., & Storlie, T. (1976). The operation of an ESEA Title I evaluation technical assistance center—Region 2 [Proposal submitted to DHEW/O]. Princeton: Educational Testing Service.

    Google Scholar 

  • Ekstrom, R. B., & Lockheed, M. (1976). Giving women college credit where credit is due. Findings, 3(3), 1–5.

    Google Scholar 

  • Ekstrom, R. B., French, J., & Harman, H. (with Dermen, D.). (1976). Kit of factor-referenced cognitive tests. Princeton: Educational Testing Service.

    Google Scholar 

  • Elias, P., & Wheeler, P. (1972). Interim evaluation report: BUENO. Berkeley: Educational Testing Service.

    Google Scholar 

  • Feldmesser, R. A. (1973). Educational goal indicators for New Jersey (Program Report No. PR-73-01). Princeton: Educational Testing Service.

    Google Scholar 

  • Flaugher, R. L. (1971). Progress report on the activities of ETS for the postal academy program. Unpublished manuscript, Educational Testing Service, Princeton.

    Google Scholar 

  • Flaugher, R., & Barnett, S. (1972). An evaluation of the prison educational network. Unpublished manuscript, Educational Testing Service, Princeton.

    Google Scholar 

  • Flaugher, R., & Knapp, J. (1972). Report on evaluation activities of the Bread and Butterflies project. Princeton: Educational Testing Service.

    Google Scholar 

  • Forehand, G. A., & McDonald, F. J. (1972). A design for an accountability system for the New York City school system. Princeton: Educational Testing Service.

    Google Scholar 

  • Forehand, G. A., Ragosta, M., & Rock, D. A. (1976). Final report: Conditions and processes of effective school desegregation (Program Report No. PR-76-23). Princeton: Educational Testing Service.

    Google Scholar 

  • Frederiksen, N., & Ward, W. C. (1975). Development of measures for the study of creativity (Research Bulletin No. RB-75-18). Princeton: Educational Testing Service.

  • Freeberg, N. E. (1970). Assessment of disadvantaged adolescents: A different approach to research and evaluation measures. Journal of Educational Psychology, 61, 229–240.

    CrossRef  Google Scholar 

  • Hardy, R. A. (1975). CIRCO: The development of a Spanish language test battery for preschool children. Paper presented at the Florida Educational Research Association, Tampa, FL.

    Google Scholar 

  • Hardy, R. (1977). Evaluation strategy for developmental projects in career education. Tallahassee: Florida Department of Education, Division of Vocational, Technical, and Adult Education.

    Google Scholar 

  • Harsh, J. R. (1975). A bilingual/bicultural project. Azusa unified school district evaluation summary. Los Angeles: Educational Testing Service.

    Google Scholar 

  • Hartnett, R. T., Clark, M. J., Feldmesser, R. A., Gieber, M. L., & Soss, N. M. (1974). The British Open University in the United States. Princeton: Educational Testing Service.

    Google Scholar 

  • Harvey, P. R. (1974). National College of Education bilingual teacher education project. Evanston: Educational Testing Service.

    Google Scholar 

  • Holland, P. W., Jamison, D. T., & Ragosta, M. (1976). Project report no. 1—Phase 1 final report research design. Princeton: Educational Testing Service.

    Google Scholar 

  • Hood, D. E. (1972). Final audit report: Skyline career development center. Austin: Educational Testing Service.

    Google Scholar 

  • Hood, D. E. (1974). Final audit report of the ESEA IV supplementary reading programs of the Dallas Independent School District. Bilingual education program. Austin: Educational Testing Service.

    Google Scholar 

  • Hsia, J. (1976). Proposed formative evaluation of a WNET/13 pilot television program: The Speech Class [Proposal submitted to educational broadcasting corporation]. Princeton: Educational Testing Service.

    Google Scholar 

  • Marco, G. L. (1972). Impact of Michigan 1970–71 grade 3 title I reading programs (Program Report No. PR-72-05). Princeton: Educational Testing Service.

    Google Scholar 

  • McDonald, F. J. (1977). The effects of classroom interaction patterns and student characteristics on the acquisition of proficiency in English as a second language (Program Report No. PR-77-05). Princeton: Educational Testing Service.

    Google Scholar 

  • McDonald, F. J., & Elias, P. (1976). Beginning teacher evaluation study, Phase 2. The effects of teaching performance on pupil learning (Vol. 1, Program Report No. PR-76-06A). Princeton: Educational Testing Service.

    Google Scholar 

  • Messick, S. (1970). The criterion problem in the evaluation of instruction: Assessing possible, not just intended outcomes. In M. Wittrock & D. Wiley (Eds.), The evaluation of instruction: Issues and problems (pp. 183–220). New York: Holt, Rinehart and Winston.

    Google Scholar 

  • Messick, S. (1975). Medical model of evaluation. In S. B. Anderson, S. Ball, & R. T. Murphy (Eds.), Encyclopedia of educational evaluation (pp. 245–247). San Francisco: Jossey-Bass Publishers.

    Google Scholar 

  • Murphy, R. T. (1973a). Adult functional reading study (Program Report No. PR-73-48). Princeton: Educational Testing Service.

    Google Scholar 

  • Murphy, R. T. (1973b). Investigation of a creativity dimension (Research Bulletin No. RB-73-12). Princeton: Educational Testing Service.

    Google Scholar 

  • Murphy, R. T. (1977). Evaluation of the PLATO 4 computer-based education system: Community college component. Princeton: Educational Testing Service.

    Google Scholar 

  • Powers, D. E. (1973). An evaluation of the new approach method (Program Report No. PR-73-47). Princeton: Educational Testing Service.

    Google Scholar 

  • Powers, D. E. (1974). The Virginia Beach extended school year program and its effects on student achievement and attitudes—First year report (Program Report No. PR-74-25). Princeton: Educational Testing Service.

    Google Scholar 

  • Powers, D. E. (1975a). Dual audio television: An evaluation of a six-month public broadcast (Program Report No. PR-75-21). Princeton: Educational Testing Service.

    Google Scholar 

  • Powers, D. E. (1975b). The second year of year-round education in Virginia Beach: A follow-up evaluation (Program Report No. PR-75-27). Princeton: Educational Testing Service.

    Google Scholar 

  • Rosenfeld, M. (1973). An evaluation of the Orchard Road School open space program (Program Report No. PR-73-14). Princeton: Educational Testing Service.

    Google Scholar 

  • Shipman, V. C. (1970). Disadvantaged children and their first school experiences (Vol. 1, Program Report No. PR-70-20). Princeton: Educational Testing Service.

    Google Scholar 

  • Shipman, V. C. (1974). Evaluation of an industry-sponsored child care center . An internal ETS report prepared for Bell Telephone Laboratories. Murray Hill, NJ. Unpublished manuscript, Educational Testing Service, Princeton, NJ.

    Google Scholar 

  • Sigel, I. E. (1976). Developing representational competence in preschool children: A preschool educational program. In Basic needs, special needs: Implications for kindergarten programs. Selected papers from the New England Kindergarten Conference, Boston. Cambridge, MA: The Lesley College Graduate School of Education.

    Google Scholar 

  • Swinton, S., & Amarel, M. (1978). The PLATO elementary demonstration: Educational outcome evaluation (Program Report No. PR-78-11). Princeton: Educational Testing Service.

    Google Scholar 

  • Thomas, I. J. (1970). A bilingual and bicultural model early childhood education program. Fountain Valley School District title VII bilingual project. Berkeley: Educational Testing Service.

    Google Scholar 

  • Thomas, I. J. (1973). Mathematics aid for disadvantaged students. Los Angeles: Educational Testing Service.

    Google Scholar 

  • Trismen, D. A. (1968). Evaluation of the Education through Vision curriculum—Phase 1. Princeton: Educational Testing Service.

    Google Scholar 

  • Trismen, D. A. (with T. A. Barrows). (1970). Brevard County project: Final report to the Brevard County (Florida) school system (Program Report No. PR-70-06). Princeton: Educational Testing Service.

    Google Scholar 

  • Trismen, D. A., Waller, M. I., & Wilder, G. (1976). A descriptive and analytic study of compensatory reading programs (Vols. 1 & 2, Program Report No. PR-76-03). Princeton: Educational Testing Service.

    Google Scholar 

  • Vale, C. A. (1975). National needs assessment of educational media and materials for the handicapped [Proposal submitted to Office of Education]. Princeton: Educational Testing Service.

    Google Scholar 

  • Ward, W. C., & Frederiksen, N. (1977). A study of the predictive validity of the tests of scientific thinking (Research Bulletin No. RB-77-06). Princeton: Educational Testing Service.

  • Wasdyke, R. G. (1976, August). An evaluation of the Maryland Career Information System [Oral report].

    Google Scholar 

  • Wasdyke, R. G. (1977). Year 3—Third party annual evaluation report: Career education instructional system project. Newark School District. Newark, Delaware. Princeton: Educational Testing Service.

    Google Scholar 

  • Wasdyke, R. G., & Grandy, J. (1976). Field evaluation of Manhattan Community School District #2 environmental education program. Princeton: Educational Testing Service.

    Google Scholar 

  • Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures: Nonreactive research in the social sciences. Chicago: Rand McNally.

    Google Scholar 

  • Woodford, P. E. (1975). Pilot project for oral proficiency interview tests of bilingual teachers and tentative determination of language proficiency criteria [Proposal submitted to Illinois State Department of Education]. Princeton: Educational Testing Service.

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Samuel Ball .

Editor information

Editors and Affiliations

Appendix: Descriptions of ETS Evaluation and Some Related Studies in Some Key Categories

Appendix: Descriptions of ETS Evaluation and Some Related Studies in Some Key Categories

11.1.1 Aesthetics and Creativity in Education

For Bartlett Hayes III’s program of Education through Vision at Andover Academy, Donald A. Trismen developed a battery of evaluation instruments that assessed, inter alia, a variety of aesthetic judgments (Trismen 1968). Other ETS staff members working in this area have included Norman Frederiksen and William C. Ward , who have developed a variety of assessment techniques for tapping creativity and scientific creativity (Frederiksen and Ward 1975; Ward and Frederiksen 1977); Richard T. Murphy , who also has developed creativity-assessing techniques (Murphy 1973b, 1977); and Scarvia B. Anderson, who described a variety of ways to assess the effectiveness of aesthetic displays (Anderson 1968).

11.1.2 Bilingual Education

ETS staff have conducted and assisted in evaluations of numerous and varied programs of bilingual education. For example, Berkeley office staff (Reginald A. Corder , Patricia Elias, Patricia Wheeler ) have evaluated programs in Calexico (Corder 1976a), Hacienda-La Puente (Elias and Wheeler 1972), and El Monte (Corder and Johnson 1972). For the Los Angeles office, J. Richard Harsh (1975) evaluated a bilingual program in Azusa, and Ivor Thomas (1970) evaluated one in Fountain Valley. Donald E. Hood (1974) of the Austin office evaluated the Dallas Bilingual Multicultural Program. These evaluations were variously formative and summative and covered bilingual programs that, in combination, served students from preschool (Fountain Valley) through 12th grade (Calexico).

11.1.3 Camping Programs

Those in charge of a school camping program in New York City felt that it was having unusual and positive effects on the students, especially in terms of motivation . ETS was asked to—and did—evaluate this program, using an innovative design and measurement procedures developed by Raymond G. Wasdyke and Jerilee Grandy (1976).

11.1.4 Career Education

In a decade of heavy federal emphasis on career education, ETS was involved in the evaluation of numerous programs in that field. For instance, Raymond G. Wasdyke (1977) helped the Newark, Delaware, school system determine whether its career education goals and programs were properly meshed. In Dallas, Donald Hood (1972) of the ETS regional staff assisted in developing goal specifications and reviewing evaluation test items for the Skyline Project, a performance contract calling for the training of high school students in 12 career clusters. Norman E. Freeberg (1970) developed a test battery to be used in evaluating the Neighborhood Youth Corps. Ivor Thomas (1973) of the Los Angeles office provided formative evaluation services for the Azusa Unified School District’s 10th grade career training and performance program for disadvantaged students. Roy Hardy (1977) of the Atlanta office directed the third-party evaluation of Florida’s Comprehensive Program of Vocational Education for Career Development, and Wasdyke (1976) evaluated the Maryland Career Information System. Reginald A. Corder, Jr. (1975) of the Berkeley office assisted in the evaluation of the California Career Education program and subsequently directed the evaluation of the Experience-Based Career Education Models of a number of regional education laboratories (Corder 1976b).

11.1.5 Computer-Aided Instruction

Three major computer-aided instruction programs developed for use in schools and colleges have been evaluated by ETS. The most ambitious is PLATO from the University of Illinois. Initially, the ETS evaluation was directed by Ernest Anastasio (1972), but later the effort was divided between Richard T. Murphy , who focused on college-level programs in PLATO, and Spencer Swinton and Marianne Amarel (1978), who focused on elementary and secondary school programs. ETS also directed the evaluation of TICCIT , an instructional program for junior colleges that used small-computer technology; the study was conducted by Donald L. Alderman (1978). Marjorie Ragosta directed the evaluation of the first major in-school longitudinal demonstration of computer-aided instruction for low-income students (Holland et al. 1976).

11.1.6 Drug Programs

Robert F. Boldt (1975) served as a consultant on the National Academy of Science’s study assessing the effectiveness of drug antagonists (less harmful drugs that will “fight” the impact of illegal drugs). Samuel Ball (1973) served on a National Academy of Science panel that designed, for the National Institutes of Health, a means of evaluating media drug information programs and spot advertisements.

11.1.7 Educational Television

ETS was responsible for the national summative evaluation of the ETV series Sesame Street for preschoolers (Ball and Bogatz 1970) , and The Electric Company for students in Grades 1 through 4 (Ball and Bogatz 1973); the principal evaluators were Samuel Ball, Gerry Ann Bogatz, and Donald B. Rubin . Additionally, Ronald Flaugher and Joan Knapp (1972) evaluated the series Bread and Butterflies to clarify career choice; Jayjia Hsia (1976) evaluated a series on the teaching of English for high school students and a series on parenting for adults.

11.1.8 Higher Education

Much ETS research in higher education focuses on evaluating students or teachers, rather than programs, mirroring the fact that systematic program evaluation is not common at this level. ETS has made, however, at least two major forays in program evaluation in higher education. In their Open University study, Rodney T. Hartnett and associates joined with three American universities (Houston, Maryland, and Rutgers) to see if the British Open University’s methods and materials were appropriate for American institutions Hartnett et al. 1974). Mary Jo Clark , Leonard L. Baird , and Hartnett conducted a study of means of assessing quality in doctoral programs (Clark et al. 1976). They established an array of criteria for use in obtaining more precise descriptions and evaluations of doctoral programs than the prevailing technique—reputational surveys—provides. P. R. Harvey (1974) also evaluated the National College of Education Bilingual Teacher Education project, while Protase Woodford , (1975) proposed a pilot project for oral proficiency interview tests of bilingual teachers and tentative determination of language proficiency criteria.

11.1.9 Preschool Programs

A number of preschool programs have been evaluated by ETS staff, including the ETV series Sesame Street (Ball and Bogatz 1970; Bogatz and Ball 1971). Irving Sigel (1976) conducted formative studies of developmental curriculum. Virginia Shipman (1974) helped the Bell Telephone Companies evaluate their day care centers, Samuel Ball , Brent Bridgeman , and Albert Beaton provided the U.S. Office of Child Development with a sophisticated design for the evaluation of Parent-Child Development Centers (Ball et al. 1976), and Ball and Kathryn Kazarow evaluated the To Reach a Child program (Ball and Kazarow 1974). Roy Hardy (1975) examined the development of CIRCO, a Spanish language test battery for preschool children.

11.1.10 Prison Programs

In New Jersey, ETS has been involved in the evaluation of educational programs for prisoners. Developed and administered by Mercer County Community College, the programs have been subject to ongoing study by Ronald L. Flaugher and Samuel Barnett (1972).

11.1.11 Reading Programs

ETS evaluators have been involved in a variety of ways in a variety of programs and proposed programs in reading. For example, in an extensive, national evaluation, Donald A. Trismen et al. (1976) studied the effectiveness of reading instruction in compensatory programs. At the same time, Donald E. Powers (1973) conducted a small study of the impact of a local reading program in Trenton, New Jersey. Ann M. Bussis , Edward A. Chittenden , and Marianne Amarel reported the results of their study of primary school teachers’ perceptions of their own teaching behavior (Bussis et al. 1976). Earlier, Richard T. Murphy surveyed the reading competencies and needs of the adult population (Murphy 1973a).

11.1.12 Special Education

Samuel Ball and Karla Goldman (1976) conducted an evaluation of the largest private school for the learning disabled in New York City, and Carol Vale (1975) of the ETS office in Berkeley directed a national needs assessment concerning educational technology and special education. Paul Campbell (1976) directed a major study of an intervention program for learning disabled juvenile delinquents.

Rights and permissions

This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (, which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and Permissions

Copyright information

© 2017 Educational Testing Service

About this chapter

Cite this chapter

Ball, S. (2017). Evaluating Educational Programs. In: Bennett, R., von Davier, M. (eds) Advancing Human Assessment. Methodology of Educational Measurement and Assessment. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-58687-8

  • Online ISBN: 978-3-319-58689-2

  • eBook Packages: EducationEducation (R0)