Evaluation Domains/Week 3

This class has several sections:


 * First is a Q&A session presented in the general form of a Socratic Dialogue.

Socratic Dialogue

 * Student: If you haven't been collecting data, as evaluators, we go out there and collect data and have nothing to compare it to: we just have what we gathered?
 * Scriven: No, we can dig into the archives.
 * Student: Well then, they would have collected that information. :Scriven: Yeah, but that doesn't mean that they corrupted it. If you dig into the archives, you often find things that they don't know are there; kinds of data like test scores from hundreds of thousands of students, which they've probably looked at the bottom line of those test scores, but they haven't looked carefully at the [indistinguishable] analysis – and that may have secrets ready to unload on you. So there's a lot that you can pick up that isn't corruptible without really remarkable achievements in the way of deceptions and so on. But, there is some that is likely to be corruptible, and that is likely to be part of your investigations.

That can be part of your double-check routines: you can interview the students, or the subjects; whoever they were; get  hold of their medical records if in the medical field; talk to nurses who were attended; talk to teachers if it's education; and so on: does what they say accord with what you see in the data?
 * Student: At-risk populations in the system who don't have these records on hand?
 * Scriven: If you don't have the first-line data, you must face that as a major problem for the evaluation, but there are a lot of options:


 * 1) There may be ways to find footprints.
 * 2) There may be jobs at stake; records from employers who saw things in their interviews; in their own tests.
 * 3) Ask students how they did, in their opinion, and compare that to the fact...
 * Ingenuity is one of the crucial things in the evaluator's toolkit. You construct from a dozen sources: when you get down to a particular case, you start thinking about sources. For example; one of our crucial values was the housing, and in one case one of the houses was still standing: but other cases, how much food they were eating - was long gone; but how they bought food from the company store, or wherever, may be on record. And in the memories of the elders of the tribe may be things like how well disposed people are towards the tribe: you just dig, dig, dig. But that takes time and can be expensive if you're supporting crew in the field: you have to take it into account in your plan for the evaluation.
 * And I can't say that you always find a good solution. We had a certain notation for indicating unreliable data: fuzzy lines, and so on; we would change the graphs a little bit; you can get a rough sense and shade that in instead of reading a line in a fuzzy area: you see a fuzzy line here, and don't place much weight on that score; but in the fundamental ____ you're looking at questions where you have over 70 variables; and if 50 of them are behaving a certain way give or take a little bit of judder, then you can not be too worried by a few where there's a lot of jutter. You can more or less guess at the pre-test if there hadn't been one. . .things go together to a substantial degree: if the family is getting fitter, eating well and so on, then it's doing better at school, you can bet. But it's a typical problem that you have to face.
 * Putting people in the field is not like making measurements on the amount of rust coming on the golden gate bridge year by year: it's like estimating the blood pressure of people that crossed the golden gate bridge year by year. It's tough. Guided sampling; even then it's tough.


 * Scriven: What have you chosen as your "god-child"?
 * Student: For me I was thinking of having as a project, one of these categories: Either Evaluation Approaches; Evaluation Subdivisions (Academic), or Evaluation Subdivisions (Professional) - a continuum going from least professional to most professional, with rough metrics like: "How much money is involved in that industry?"
 * Scriven: how far have you gone with that?
 * Student:  It's a completely blank slate right now
 * Scriven: Any thoughts for filling it in?
 * Student: I started looking at the list that you have: referees, diamond buyers, etc. - and I started looking at categorizing into - my initial categories were "Amateur Evaluation" and then Professional Evaluation - and possibly multiple grades within that, depending on factors like the Time of Training they have to have; like the apprenticeship - how many accepted levels there are within the field, and what kind of measure those levels are – I would assume usually categorical, but if its possible they get us up to (the third level) -

I'm also interested in what would count as evaluation big-picture; all the things that are talked about in bits and pieces, like in Shadish, Cook, & Leviton's "Foundations of Program Eval" at the beginning of the chapter there's a little list of the other things that he would consider evaluation: I want to look at that and see if it expands, and especially look at management consulting within that, because it wasn't explicitly mentioned, but seemed to do a lot of similar evaluation-type things for the business and government community.
 * Scriven: Yes, but it's considerably more general. For example: it may be mostly about leadership. It may be mostly about recruiting; sales; these people all call themselves "Management consultants" - and there's a little bit of eval in these, but not much: it's mostly about "how do you get something done" rather than "how well was it done" - but still; you're fleshing out the amateur to professional spectrum with some areas where details might be interesting, such as an accountant: that's a good way to go with your topic. Anything else that you're looking at or considering in your project's behavior as you want to represent it?
 * Student: Well, I'm interested in - with the management consulting I think of hte big five; the McKinsey, Booze, and so on - not necessarily the Accenture side, in the (more operational side) - and also; the financial firms; PwC; so that's just another data point there.
 * Scriven: So that's your professional end of the spectrum?  (Yeah). Okay; the rest of you: What has he left out? This is a geographic question about the geography of evaluation.
 * Student: The entire field of product development, which has a ton of evaluation embedded in it?
 * Scriven: Yes, that's really important, but I'm thinking of a larger slice even than that.

Students: (silence)
 * Scriven: Whats missing is the whole slew of professional, non-academic evaluators, for example: horse judges; referees; the dance and diving evaluators; these are professionals, take serious training; and they are clearly evaluating all the time. It's probably fair to say that they're mostly Perceptual Evaluators - so you should have thought to yourself, "The academics are doing long-term inferences, but there's a group that don't do that: they only have a tiny window of time to make the call. And it doesn't mean they're amateurs: they're highly trained perceptualists.
 * This is part of what I say when I say, "the Domain of Evaluation is colossal." - And part of what I say when I say, " The domain of Professional Evaluation is colossal. So, we learn something from talking about that case.

Now, the subquestion you're dealing with is an interesting question: but you gave it too large a title.

Where does most of the evaluation go on in K-12 Schools?

 * Scriven: Okay: if you think about school, K-12 school; where do you think about evaluation going on.
 * Student: I have a question that goes underneath that, in terms of: "What are the most basic activities of evaluation?" because even in the process used for a child to turn a door-handle, I imagine that there is an evaluative process going on in determining that there is a door handle: what it is for, and so on.
 * Scriven: Most of that is not evaluative: it's just recognition. Okay, back to the school:
 * Student: I think you'd want to look at learning outcomes; teacher performance, student performance. ..
 * Scriven: Okay, those are good examples. Give me another example.
 * Student: Safety of the physical environment for the children.
 * Scriven: Yes; "safe" and "unsafe" are evaluative terms.
 * Student: Do they have good equipment?
 * Student: Are you asking for evaluation in schools or evaluation of schools?
 * Scriven: In school: in your life, as a schoolchild, what evaluations did you see, or what went on around you that you didn't notice at the time, but now you think of it as evaluative?
 * Student: Interpersonal, socially speaking, the hierarchy of children: evaluation of cool  / not-cool
 * Scriven: Yes. Now, what I like about that is it's not the academic evaluation: it's part of the ebb and flow of evaluation going on in every phase of a kid's life at school. So give me some other examples.
 * Student: The choice and time of when you take breaks, or lunch; and simple decisions, done primarily by the teachers.
 * Scriven: Forget the teachers: everybody knows that one. Where else are evaluators are going on?
 * Student: How to prioritize effort and time, for each student - is evaluating themselves.
 * Scriven: That's big: how important is it that I do this homework, rather than risking that homework -
 * Student: Self-Evaluation
 * Scriven: Very good. Very important type. Not done enough by kids. What else?
 * Scriven: I'm driving you away from everything you normally think of about school: I want you thinking about the life of the school kid, when you normally think about the task of the schoolkid in the eyes of the teachers, his mother, the principle. That's all very well, but it ain't the life of the schoolkid that you are pulling samples out of.
 * Student: I think evaluation is tied to cool/not cool, but beyond people to things like lunchboxes and other object.
 * Scriven: Yes, how about what's in the lunchboxes?
 * Student: Whether it's nutritious or not?
 * Scriven: Screw nutrition: Taste is a pretty big. You don't have to get into the highly academic evaluative issue of nutrition. So there's the food: your friends, your enemies; the teachers, the principle, the administrators, the punishments: the difficulty of learning subjects; the importance of subjects; the "being good at games" - one mention of that early on in this discussion, but you obviously weren't thinking of it in the sense that you read the local newspaper, page after page are about how this group did this well in competition with another school; and so on. Very important to the kids and parents and school managers.
 * Student: (That might be mostly in the US. . . I didn't have much of that experience as a grade-schooler in Dubai)

Summary of criteria for Academic Eval
(Derived from the list above) - Cool or Not Cool - How to prioritize effort & time; (Values; subconscious and conscious) - Self-Evaluation “(I am smart; I am not smart”) - Cool or Not Cool - Lunch; - Food Evaluation - Cost/Benefit Evaluation (Buying food) - Risk Evaluation (Bullies) - Social - Skill evaluation of peers - Social Desirability Evaluation - (J: I think the main evaluative criteria I had was, “Is this fun?”)

Analytic Evaluation
___ You do analysis before doing appraisal; and analysis markedly effects the appraisal: ; i.e. for a skilled glass cutter, “How many pieces can I make out of this that will maximize the value of the pieces?” - Things to consider: Flaws;

- Literal Break-Down: Recognize the standard components; -  Metaphorical Break-Down: You may break it down metaphorically speaking; and there are two ways you can do that: 1) Break it out into components; 2) Break it out into dimensions.

Component 14:13:58 - You can evaluate from artifact; No odd noises; well repaired and tarted up again, so that it looks closer to new: that’s all “Component Evaluation” - because it’s holistic in terms of the look you get to begin with; but the moment you look under the hood, or underneath

Dimensional 14:16:29

Conceptual Categories: ex: Evaluating employees, you’ll evaluate a string that has dimensions on it: you won’t evaluate their arms; their hands; eyes. . . that’s too fine a structure for anyone but a Makeup artist. ..  - if you’re a business person evaluating a candidate for a job you’ll break it down into 1) DUTIES that must be done (Things we need them to do to meet our needs) 2) Look at performance of this individual on those categories (i.e.; be able to meet and greet. . .; needs spanish and english -> Test them on their colloquial spanish.) - That case does Analytic evaluation - and this analytic process is driven by

Those are the categories we use to break up the overall life-work record of this individual

Evaluate a Cruise Taken
You come back, they want to hear how it was: Evaluate the cruise briefly for them:


 * J:
 * 1) initial Holistic evaluation “it was good” or “It was awesome!” or “oh boy!” (indication of some not-good, or some big story to tell) (This is also culturally based: other cultures might be quite happy saying “it was bad”;
 * 2) then breakdowns
 * A: I would tell them about the food, the lodging, . ..
 * :Scriven: You’re doing analytic, and then the component sub-dimension of analytic. Now, imagine that we were supposed to write up for a cruise guidebook: What would you look for as the next step in doing that? -
 * 1) __Develop a common list of analytic headings__ - A common checklist, important for comparability. Then you’d want to be sure about that:
 * 2) 1. It has to be comprehensive
 * 3) 2. Categories can not overlap - why?
 * 4) More detailed than people having to do work twice; - suppose that the pool is 5/5 stars: what happens? WE’ve come to pool: 5 stars; next one: Facilities . . . can’t do less than 1 or two stars because of the shadow from the pool rating. You blur and misrepresent: make it impossible for the facilities to get a really bad rating. And that’s not helpful, because people are expecting separation.

There are a number of rules for constructing a checklist: You can see the result, which I’ve never had time to improve - if you want to help try to improve it; it’s online at http://www.wmich.edu/evalctr/checklists/

. ..

The one by me has one mistake in it, which I discovered 5 years ago: see if you can find it before I correct it.

(J: I COULDN”T FIND THIS DOCUMENT AT http://www.wmich.edu/evalctr/checklists/evaluation-checklists/   ooh, I found it at: http://www.wmich.edu/evalctr/checklists/about-checklists/)

The Elders

 * 1) Logic
 * 2) Ethics
 * 3) Aesthetics
 * 4) Medicine

The Establishment
=== Holistic Evaluation Okay, we’ve gone through ___ and ___;

Holistic Evaluation: The Quick Look: Heuristic?: we see someone, and look at it - a large slice of our everyday evaluation is done in the fleet of the eye, like that.

It may be that the brain is breaking this up into components: we have various people who think that’s true;

J: Would this be like / map to “Heuristic”? S: No; a Heuristic is an __aid to a response__ - a heuristic could be an easily-memorized acronym;

2. Break it up in some way:


 * 1) Break it up in the mind, that’s: Component Eval, Subspecies: Virtual
 * 2) Break it up: that’s Component Eval, subspecies

Usually circumstances constrain you as to which method to take.

Each of them can be done professionally, or amateurishly.

Complex, inferential evaluation is something else: the evaluation you take every time you cross the road, it takes evaluation of several risks: multiple concurrent threats; kid in between two cars, can’t see me, sees dad across the road: runs into the road; (J: modeling hidden information; modeling risks in a situation; modeling risks of unknown in a situation)

Disaster Evaluation (Under Phenomena Evaluation)

 * 1) Preparation (-Pre-identification of values / checklist that you will be working from when concurrent)
 * 2) Concurrent (What you do when the disaster is actually happening)
 * 3) Immediate Repair (Reparative)
 * 4) Rebuild

Now; 12 years ago, nobody had thought of that. It really wasn’t until the great Indonesian Tsunami, that we began to get systematic inferential evaluation approach going for disasters. That’s when I decided we should start doing something serious about it. - Some people were doing something specific; some agencies were doing partial checklists; - AEA meeting; Panel to start the process; the next step was, “Lets have a TIG


 * 1) What sound should be made?
 * 2) What action should be taken in response to the sound? - Aka: High ground; etc. it turns out, it is possible to do that. (J: Do what?)

== Produce Evaluation (Under Product Evaluation)

New disciplines:

 * 1) Interdisciplinary
 * 2) Meta-Evaluation
 * 3) Crowdsourced
 * 4) (Ben: Semi-Quantitative)

One of those things doesn’t fit: I’ll tell you in a minute what it is; under #9, technical tools - Now you know most of what’s in there: any questions.

= Socratic Dialogue #2 =

I was recently called in for the summer olympics coordinated swimming competition to assist them in developing better standards for the identification of people in five categories: they felt they hadn’t had enough training to get the degree of reliability they needed to have.

In terms of the standard names of categories of evaluation, what sub-type would this be:


 * 1) Ben: Perceptual: Performance - Depending on the context it might also be personnel
 * J: Jurisprudential method? - Or would that be just for laws, judges, courts, legal system? :Scriven: The Jury is kind of a flier on that spectrum: they get a little bit of training from the judge after they’ve been chosen for this decision; so they’re not quite hte person off the street. They’ve also been trained a bit by the two counsels arguing about what is and is not relevant. So they’re a semi-trained group: that’s a lot better than an untrained group. So I think we can say that it’s a half-way house between the man / woman on the street, and the fully-trained expert. [Quasi-Expert] or [Trainee-Expert] or something like that would be a good name.

World stage, we start talking about personel: the best diver in the world off that platform: from a dive, to a set of dives, and an inference to the diver. If the set of dives was by and large good, but there were two indications of a bad fault: that would be ___ to the diver, but it wouldn’t kill (his/her performance) at the olympics.

Other questions on that one? They will be more difficult than that one, at least in part.

Now, second: Here’s a point that I want to develop before we get to the test.

What degree of training does an evaluator need in order to be counted as a professional?
This is obviously not something that people have agreed about in country-wide associations of eval: the general point is that its a long process getting highly trained; it’s completely unrealistic to think that in order to be counted as an official evaluator you need toconsultant be trained to the level of competence in all relevant evaluative disciplines: FACTCHECK!* That’s never going to happen; we’re going to  have to add statisticians for tough quant. cases; because most teams won’t have a person on that level of quality, as we can see from the Duke University disaster last year which killed two patients because the stats on the trials *FACTCHECK!

It helps considerably to have subject-matter knowledge, but is not absolutely essential as long as you hire consultants who DO have good subject-matter knowledge, and use them whenever appropriate. When Use Them is crucial of course: having them but not using them doesn’t count.

But what does count? Not easy to say, but you’ll have to err on the safe side consistently there. Which means that I have a primary consultant on subject matter, and a secondary consultant who judges the primary consultants judgements, who only comes in occasionally, but comes in at least twice during the course of the evaluation.

Now, that’s not such a large chore as you might think: many evaluators have a considerable range of subject knowledge, partly because they acquire it on the job. Doing a lot of work in Transylvania … you don’t have to be the world’s leading expert on the subject matter, you just have to be good at it.

It must be treated very seriously, and very early, so that you can get these people that you need pinned down for the times that you can use them most optimally.

Now, another thing that’s serious when we move to getting the point of taking charge of an evaluation, which you can do here, courtesy of people like Rebecca Eddy who was on faculty here until 1.5 years ago; and retired in order to set up her own evaluation shop, which she keeps busy, mostly in educational eval. She’s a good quantitative evaluator. She hires quite a few students from here to help with each project, and you pick up a lot of field competence from working in that sort of context.

Tarek Azam has a lot of projects running, where he hires students to do it, and you may yourselves dig up opportunities. Getting paid in learning is a way of employers exploiting you, but you don’t have to be exploited.

If you are dead set on working with psychiatric kids, then you may want to be willing to put in some time as an unpaid supervised intern, because that’s not very hard to come by. And the people that come by it are people who can spare someone to do mentoring - which isn’t everybody on every project.

But we’re building up a pretty good network these days so that we can meet those requests.

However, I think the second most important thing to have in your repertoire is an outside interest. Alan was an experienced gay activist, so he had the experience, and the motivational push, to be able to run his own projects and experiments.

You want to be looking at your interests outside of eval, if you don’t already have key interests within eval: that’s where you want to pick up key skills that are salable, as well as the Eval skills. Anything you can acquire there is also very valuable.

Any more questions on that sort of stuff?

Three New Sub-Areas for The Future Of Evaluation
Each of these is monstrous: it’s a huge, century-long effort. But it’s important not to think that this is going to happen fast. But the centuries are more like it:

α - Alpha Discipline
Under Roles for Evaluation - Of course, the main role for evaluation is “Doing Evaluation” - but Formative isn’t the same as Summative. It will involve many moves, but it’s purpose is different. Summative can be delayed formative, the one thing that can’t happen is formative should be premature summative. . . but it can be done like a summative, pretending that the program is at this point completed. How good is it then? It’s quite helpful to say to the people that are doing. . half of it is as good as you need to have it, but the other half is a long way from home. One half would be A; but another half, C. That would be a way to convey the reality of the situation. These are the big three that I want to see us come to.

The time scale: a thermometer with 100º; the Alpha role is about 25º - and it only began about 9 years ago. On the age scale,. . . well, I wrote long study on what interdisciplinary is in ’91; so it’s now at 22 years old. Not too bad! - So what’s happened:

Criteria for Alpha Discipline: 
 * 1) Something that is itself an essential tool in the operation of several, perhaps many other disciplines
 * 2) And it is an independent discipline in its own right.
 * 3) Intellectually important enough to be worth studying and developing on its own grounds.
 * 4)  Evaluation is not just a help to many other disciplines:
 * 5) It is a help to every discipline, outside as well as inside science
 * 6) Without its approval, there are no other disciplines. (J: But don’t we do formal, but non-theoretical evaluation right now to assign disciplines, based in large part on politics, felt needs, and history? So to do this, we have to convince people of the worth of Evaluation… - and to do that, we need “contingent theories of evaluation”; with “meta-theoretical nomenclature to compare and contrast the relative strengths and weaknesses…” (Shadish, 1999, 'Evaluation is Who We Are',  pp. 7-8;)
 * 7) We now discover that Medicine has gotten very sloppy; and Physics has gotten even worse than that: its foundations of excellence do not meet standards of minimum agreement of equally competent judges. Repeatability, always quoted by the second string or better, as foundation of their quality control: they don’t meet repeatability standards. (J: Don’t we need a theory of change if we want to march forward on this issue?)
 * 8) - There’s the work by Coryn and Allies, including me, which looks at way that nations that do their funding of research through committees was doing; they were doing rather variably; New Zealand got 98/100, France at bottom got 7/100; now that Switzerland, Russia, Canada, NZ all asked Coryn to go in with a crew in order to set up better standards for their own committees, this has gone very well: the latest issue of the Swiss publication “review of research efforts” has a long piece about how they have improved the system. Similarly in Canada and so on.
 * 9) - At Applied level, improving practice on evaluation principles, that’s showing payoff from the alpha discipline candidacy. These are all people with PhDs and world-class reputations in some scientific field; and we have to convince them that it’s time to improve it further.
 * 10) So; 22% - not numerically of course, but the first icebreaker through is the really important one: we’re at that point with the Alpha discipline procedure.

Q: DO you have a theory of change? :Scriven: Every time the change gets made, you have doubled the number of places that have been changed, if it operates on a one-on-one basis; that would be a maximum for it – but maybe not. . . because the great speeches by the great scientists often went to hundreds and affected half of them: breakthroughs are often slow-starting; that typical, tangential to the ____; and then they begin to pick up as the numbers affected are getting bigger, and each is divided up into two or more: so I think we’ve made some progress in seeing what the progress is; and that’s why I feel that this is not too bad of an estimate. But ahead of us are the vast number of people who still secretly are dear to the positivist doctrine, explicitly or implicitly, have got to have their ideas changed. At the end of that war by the wolf-pack of evaluators, we obtain equality at least, from the masses. Acknowledgement of alpha status will be spotty to begin with; although slow behind the recognition of the importance of the wolf-pack, it will

Q: (J:) What are the metrics by which evaluation is successful, that we can tell other people?

Exemplar Discipline

 * 1) X Discipline - / X Role: “Exemplar”
 * 2) Exemplar Role for evaluation is the following Role: the role of Evaluation done well, being treated as the model for doing virtually all applied science. - That I’m setting up a 100 year cycle; and I think we’re about .05º - because it’s age is roughly .5 of a year. Omega I’ll save, and tell you all about when next we meet.

Omega Discipline

 * 1) Omega Discipline - Application of Evaluation to Ethics.

Interdisciplinary
My use of the term, which was done withoutknowledge of  the Marxist / Feminist stuff -