Introduction added 2019
The following is an analysis I wrote regarding an evaluation system for teachers that was being used in our district recently. The school district was using iReady scores of students in teachers' classes to determine how the teachers were to be rated. The problem was that every step of the process was mathematically invalid from a statistical analysis approach. They were not analyzing what they thought they were analyzing, and teachers were paying the price. A lack of probabilistic and statistical knowledge is endemic to the field of education. K-12 educators are as a group woefully ignorant of how to reason properly in these mathematical branches. The problem with the iReady situation was that not only was the product not built to generate the data the school district needed to make a valid analysis of an individual teacher's performance, but the data it did give was then analyzed in an invalid way. Thus, problems compounded on top of problems. But once, again this is par the course for the field of education.
I wrote the below analysis of the problems of the final step of the teacher evaluation as an attempt to explain to fellow educators where our district went wrong. I intended to go back later and write about the problems with the iReady software itself in terms of its statistical validity, but found it was too monumental a task to explain the underlying statistics in a way that didn't result in an entire book needing to be written. And I knew no one was going to spend the time to read something that long. I also knew that educators jump from program to program with wild abandon anyway, so it was only a matter of time before iReady teacher evaluation was dumped. Sure enough, iReady was abandoned this last year. And so I never went on to write the other parts of this analysis. But I leave this one part here for posterity as a snapshot into one of the most glaring issues in education today.
Our Growth Rating Formula
To determine the validity of a math formula in a given situation we must examine it first in an imaginary perfect world containing our situation. By perfect we do not mean better or good. We mean an imaginary state where we have absolute control of all variables. If our formula can’t even work in a state of perfect knowledge and data validity then we certainly don’t want to use it in the real world where tons of extraneous factors come into play. This is how all of math works. There is no such thing as a perfect 2 inches in the real world. The difference may be microscopic but there will be a difference. So, if we say 2 inches plus 2 inches equals 4 inches, we are actually saying “in a perfect world where I could control the size of an object down to a quark (or if they exist, a string) the math would work that 2 inches plus 2 inches would equal 4 inches. Therefore, if I need to make a cut on this board in the real world I’m okay with the fact that these board pieces will not be exactly 2 inches. I will just work to make it as close to the perfect world result as possible because I know the math itself is valid.” The math works in this case. It’s the real world that may fudge things a bit.
Let’s look at imaginary student growth in two imaginary classes and see if we can determine which teacher might be the better one without even breaking out a calculator. To be clear, we are imagining that these numbers reflect the exact growth each student made. We are not saying these are the growth numbers shown in iReady using the method the district is promoting for doing so. We will examine that issue later. We are just saying for the sake of argument that somehow omnisciently we know what growth each student really made. We will use the iReady growth points requirement for 2ndgrade (39) to indicate a year’s worth of growth in this example. I am not making any statement as to the validity of the iReady targets themselves right now either; I am just using the number for this example. We will look at the validity of those targets later. Finally, we are also saying in this example that there are no external influences affecting these students’ performances in class. We will say that there is one internal factor at play here. We will say that the fact that students intrinsically learn at different rates is in play. Thus, with a set skill of content delivery to all types of learners a teacher may have a class with student growth that is not identical between students. But we will add the stipulation that there are no ESE students. We will also add the additional stipulation that the average learning capability of the entire class is exactly an average student’s year’s growth in a year’s time. So, this is a perfectly constructed imaginary scenario where every number really means what we want it to mean. The teacher really did meet the needs of each learner in the way that learner’s score reflects.
Teacher #1: 38, 38, 35, 37, 35, 38, 48, 45, 55, 50
Teacher #2: 0, 3, 2, 5, 40, 39, 40, 39, 40, 41
Now I am sure that without using a calculator any educator could tell that teacher #1 is obviously the better teacher. Thirty percent of Teacher #1’s class grew more than the best student’s growth in Teacher #2’s class and her lowest performing students got very close to a year’s growth. We can all see hopefully that teacher #2 needs some serious help as it appears that she is only teaching part of her class.
If we want to generate a final single score that encapsulates these teachers’ effectiveness we will need to do some math. As we increase the number of students it will get harder and harder to simply eyeball the classes. Also as the numbers become not so extreme it will become harder to eyeball. But the question is, which formula should we use to do this? Whatever formula we use should produce an answer that shows that teacher 1 is better than teacher 2. If we plug the numbers into a formula and we don’t get that result and somehow it says that Teacher 2 is the best, then we can of course say that the formula is invalid for use in teacher evaluation. Maybe the formula would work in some other completely different situation, but it certainly doesn’t work in this one.
Let’s now examine two different formulas in our perfect world and see which one works. We will examine a straight average formula and the district’s formula.
The first formula we will try is the straight average. We simply add up all the scores and divide by how many scores there are. Here’s our results with that formula:
Teacher #1: 41.9
Teacher #2: 24.9
Looking at those results, we can say simplistically that it looks like a straight average worked. It told us teacher 1 is better than teacher 2. Now that we have a working formula we can then see if it works in any perfect world scenario we could come up. Are there perfect world scenarios where it patently doesn’t work at all? Well... Will the formula work when the student growth scores are less extreme than what we imagined for teacher #2? Yes, it will. It will still catch the situation where teacher #1’s scores cluster right around the 39 with most slightly below it and teacher #2 has exactly some of her students at 39 and the other half clustered around say 20. In the real world we might come up with some plausible reasons based on external factors to account for such a performance. But in our perfect world we just want to make sure our formula worked if those external factors don’t exist. And as we move our student growth scores together for these imaginary teachers, the formula still works. If Teacher #2 managed to make two years of growth with half the class and the rest made basically no growth then the average would work out that she would tie Teacher # 1 in a straight average. As we split the scores apart further and further however it becomes more and more unlikely a scenario. Even the given split is highly unlikely. I just made it that big to make the problem easily viewable without doing math at first. But a lesser and lesser split between top and bottom scores is more and more likely, and thankfully the straight average method keeps on working in those situations.
Teacher #1 in this case made just slightly more than a year’s growth on average with her students which would have been a 39. How would you rate her? Well, being teachers, we would probably rate her on a percentage and say if full mastery of the assignment of getting a year’s worth of growth out her class equals a 100% then Teacher #1 receives a 107%. Pretty good. She gets a A+. All her scores were fairly clustered with a few outliers as might be expected and so we can be fairly confident in our assessment of her. Perhaps though we might say that getting a year’s growth on average is merely C work. An average teacher should be able to do that. To get a B or an A would then raise the percentage of a year’s growth required to make an A. So maybe to make a B you must have 110% of a year’s growth and to make an A you need 120% of a year’s growth. In the real world we would have to do some research to determine those percentages if we were going to raise them like that. You can argue about which one of those percentage requirements is a better one, but either way, the formula itself worked.
Now Teacher #2 ended up with a 24.9 average. If we rate her in the straight forward way we rated teacher #1 we end up with a 64% score when we divide 24.9 by the year's growth total of 39. She receives a lower D in the first percentage break down. She would receive an F in the second one. She’s obviously got some abilities based on some of her student’s scores but her inability to reach 40% of her class in a perfect world with no external factors causes us concern. Something weird is going on in that class and our formula and rating system caught the problem and flagged it for us. Well done formula! We are ready to take it into the real world now and start modifying it to include all those messy external factors that exist out there. Right off the bat, one thing we probably want to add on to our formula is a measurement of the variance between the scores of students in order to really emphasize such a situation as is happening in teacher #2’s classroom. Such a measurement would really help flag the unlikely situation where the scores get further apart. But for now we will stop here with this formula and say so far so good.
The District’s Formula
Now let’s turn to another possible formula and see what happens. The formula we will now use is the district’s formula. In the district’s formula a student’s score is turned into a 1 if they made at least 39 points and a 0 if did not make 39 points. In other words, the gradient of possible growth scores from 0 to 50 (in the case of our example) is being completely collapsed into a binary set (1 or 0). Then the scores are added up and divided by the number of students in the class. Teachers are then rated based on the final number. It sounds rather like the first standard average formula, and if we were just glancing at it in passing as we were plugging our numbers into a worksheet we might think that it is basically the same thing as an average and so it will probably work too. We might even think, well if anything, it does make the numbers easier to mentally compute. But let’s take the time to see what really happens…
Collapsing the growth scores into 1's and 0's and then averaging them gives:
Teacher #1: 0,0,0,0,0,0,0,1,1,1,1 = .4
Teacher #2: 0,0,0,0,1,1,1,1,1,1 = .6
What???? Teacher #2 is better than Teacher #1? Teacher #1 had the highest growth rate in her top tier students of the two teachers. She had thirty percent of her students make more growth than the highest growth student of teacher #2. She also didn’t leave 40% of her students back at the bus stop as Teacher #2 seems to have done. Every student progressed in Teacher #1’s classroom to at least within spitting distance of a year's growth. And yet somehow according to this formula, Teacher #2 is the better teacher… 20% better!
This is what we would call a Type 2 error. A type 2 error is an error where a false situation is mistakenly counted as true. If we say that we are starting with the base assumption that most teachers are going to be OK teachers then we can call that assumption the “null hypothesis”. In other words, our null hypothesis is that teacher #1 and teacher #2 are going to be competent teachers. We could compare this favorably with our country’s legal system wherein everyone is assumed innocent until proven guilty. However, life being what it is, sometimes your test results of the evidence will be wrong. In a type 2 error within the legal system, a guilty person would be declared innocent. That’s not good. However, it’s not as bad as a type 1 error. If a type 1 error occurs within our legal system then an innocent person is declared guilty and punished. Most people would agree that that is far worse than the type 2 error. So, people setting up a formula to analyze data (who know what they are doing) will always trade some amount of increasing type 2 errors in order to decrease type 1 errors. In this case, we know something is wrong with teacher #2 but the formula made her look much better than teacher #1, who we probably think was an okay teacher. The null hypothesis is obviously false for teacher #2 since she is not an okay teacher, but this formula says the hypothesis is true.
This becomes more obvious when we apply the letter grade based on the district’s scale to these teachers. In this case teacher #1 gets a D because the district says that 32% to 40% is a D. Teacher #2 is a B in this rating scale because 54% to 61% is a B. She nearly got an A. The fact that the district formula did that is concerning. By the way, do you feel Teacher #1 should really be given a D? Is she that bad in your mind? If you think a D is unfair then you are saying you think a type 1 error is also being made. But perhaps that is arguable… The important thing is that the straight average formula we tried first did not produce a type 2 error unless we increased the variance between scores of Teacher #2 beyond 2 years growth. So as of right now we could stop and say we are going to go with the straight average because that one has less errors.
But let’s continue examining the District’s method and see if it produces any of the dreaded type 1 errors in a way no one in their right mind would argue with. Does it condemn innocent people in our perfect world where we know the evidence we are analyzing is absolutely correct? If this formula were to condemn an OK teacher even when we are sure there are no external factors to mess things up, then we know this formula has no way to redeem itself. There is no way to fix the formula if it can’t even avoid type 1 errors in a perfectly controlled world.
Let’s say we change teacher #1’s scores to the following:
38, 38, 38, 38, 38, 38, 38, 38, 38, 38
In other words, every student was one point away from a year’s growth. Every one of them made 97% of a year’s growth. I want everyone to be able to see the data trend crystal clear without using math at this point. Also, we have now removed one of our original stipulations of the perfect world and now all students learn at exactly the same rate, which happens to be, exactly a year’s growth in a year’s time if the teacher does her job exactly right. Now, without doing the math, what does your gut tell you this teacher’s effectiveness should be rated? If you think a year’s worth of student growth is worth a C for a teacher, then perhaps you might think “this teacher missed the mark by as close as it is possible to get so I’d give her at least a low C.” If you think a year’s worth of student growth is worth an 100% A for a teacher then you might think “She’s ridiculously close so I’ll give her an A.” But would any of you say this teacher has an effectiveness rating of an F? Think about the weird teacher #2 who managed to make a B under this formula. Is there any way you could fail this new version of teacher #1 with a clear conscience after seeing that happen with teacher #2? I guess there’s probably some stone-cold individual out there that would say “Yep, super close doesn’t even begin to cut it. Flunk her!” In my imagination I see a Samuel L. Jackson character delivering this line with some over the top use of expletives thrown in just to emphasize how very stone-cold he is.
Anyway, under the district’s formula, when you do the math, this teacher gets a score of 0% because a 38 translates to a 0 when you squash the gradient. She didn’t just fail, she didn’t even show up to school. However, if you agree with me then you would call this a serious Type 1 error. An OK teacher has just been deemed a complete and absolute failure by the formula itself. Not by external factors, not by the capricious whims of the universe, not by measurement errors, not by anything else was this done, but only by the formula. Therefore, the formula simply is not acceptable to use as a basis in the situation of evaluating a teacher. The formula can’t even get it right in a perfect world.
Where would the district formula work?
You may be wondering if there is a situation in which the district formula might work. Well there is. If you had a task that someone needed to do in which there was no gradient of accomplishment involved then you could grade someone on a pass/fail basis (a 1 or a 0). For instance, some sports act this way. If your ball doesn’t go in the basket then you don’t get partial points. If you don’t cross home plate you don’t get ¾ of a point for making it to third base. Other sports act on a gradient, however. In swimming, or track and field, or long distance running you are graded on a gradient of performance, which is your time or distance traveled. This gradient score is then compared to other competitors’ gradient scores to determine who is better. You see in basketball there is no humanly possible way to grade a basketball shooter on a gradient. How in the world are you going to calculate exactly how close each shot was to going in the basket in the middle of a game in order to hand out a gradient of points? The answer is, you can’t. So, you are stuck with doing pass/ fail in basketball. In football and baseball, you might be better able to accomplish a gradient scoring system, but the math involved would get intense and probably you’d start making the sports fans upset when you keep flashing percentages and formulas up on the scoreboard and all they want to do is get entertained for a while. But hopefully everyone can see that basketball is not the correct situation to compare with growing students. There is a definite gradient in student performance and it is easy to calculate based on that gradient. So if you choose not to use a gradient in such a situation you will end up with the disturbing type 2 and type 1 errors we see when we take the time to examine things. Type 1 and type 2 errors do occur in those sports using a pass/fail system all the time, by the way. It’s a constant. They occur everywhere to varying extents actually. However, the binary set system increases the amount of those errors. But nobody really cares in sports if that happens (besides maybe the players and coaches if they fully understood what was happening). But for everyone else it’s entertainment… People even prefer a system where sometimes the less talented team trounces the more talented one if it's for entertainment.
So to sum it up, pass/fail (binary set) formulas are acceptable to use when there is no obvious gradient and no way to realistically calculate a gradient. They are also acceptable to use when no one is making any big decisions based on the outcome. If you're just using it for entertainment (like in sports), then who cares? In that case pass/fail is easier, so why not use it?
iReady Weighs In
Now you may be wondering if the idea to use a pass/fail formula instead of a gradient average came from the iReady people. I don’t know where the district got the formula, but it wasn’t from iReady. The mathematicians at iReady disagree with the district on the proper formula to use when analyzing iReady class scores. Allow me to quote from the official iReady document on this subject, “Using iReady Diagnostic as a Student Growth Measure”:
“To calculate growth for a class or group of students, simply calculate the gain for each student (last diagnostic test minus first diagnostic test). Then calculate the average (adjusted average) of all these gains.”
They have a footnote after the words “adjusted average” where they explain what they mean.
“For students who show a decline, assign them a growth score of zero before calculating the average. Negative gains are translated to zero because it is highly likely that these results are due to measurement error rather than a student truly regressing or losing skills.”
In other words, the iReady people agree with using a straight average to compute classroom growth with the additional stipulation that negative trends be calculated as zeros and not negative numbers. After all, the straight averaging method does not produce egregious type 1 or type 2 errors. In our imaginary perfect world, it actually worked as far as we could see.
By the way, that last sentence put down in the tiny footnote at the bottom of the iReady document page provides us with a foreshadowing of problems we will have to deal with later… There are other statements in the document that are of interest to us in our quest as well. We’ve only just begun.