Wednesday, December 10, 2014

The 2014 PGR: Confidence Intervals and Graphics

As you have probably heard, the 2014 edition of the PGR went live earlier this week. The results had been extensively previewed, so there wasn't anything terribly surprising that I could see. One thing that was a little surprising was that the confidence intervals we were promised are not going to fly after all, and that three types of graphical representations of the data appear instead. (Of course, this was announced a little ahead of time, too, so it wasn't exactly a surprise, either.)

Why no confidence intervals? A couple of reasons, according to Leiter. A) Given the design of the survey, in which not all evaluators evaluate all departments, there are several ways to calculate them, and they "did not want the precise method chosen to become a matter of pointless controversy." And B) properly informative confidence intervals should be rounded off to two decimal places, and this generates an accuracy-related mismatch with the PGR's long-standing practice of rounding to the tenths place, which is done in order to discourage "invidious comparisons."

I guess I kind of accept point (B), except that I don't see what the big deal would be to post the more-precisely rounded means along with the accompanying confidence intervals off to the side, or on a separate chart, while retaining the customary averages rounded to the tenths place for the main rankings. I don't see how this would encourage invidious comparisons. You'd have numbers rounded to the hundredths, but you'd also have the confidence intervals right there.

Point (A) seems to me to be a non-issue. If there's more than one reliable way to do the calculation, pick one of the reliable ways—whichever one you want, as long as it really is reliable enough—and tell whoever doesn't like it to go fuck themselves. If it's reliable then it's reliable, and it's not like we're measuring the critical mass for weapons-grade plutonium. One method is probably as good as the next, and I can imagine only that the bootstrapping procedure Healy used on the 2006 data would be totally fine. (Of course, maybe I'm wrong about all of this, and if I am I hope one of y'all Smokers who knows more than me about this will set me straight.)

Furthermore, I think the survey-design issue that Leiter says gives rise to point (A) serves to underscore the need for confidence intervals. It's just not possible to understand or properly interpret the Report without them. Not all evaluators evaluate all departments. Some evaluators evaluate all or almost all of them, but some evaluate only a few. And, as Healy points out, "higher-ranking departments do not just have higher scores on average, they are also rated more often. This is because respondents may choose to only vote for a few departments, and when they do this they usually choose to evaluate the higher-ranking departments." (His 2006 analysis found approximately the same thing.) That means that, generally speaking, more evaluators evaluated the top departments than the rest of the field, and explains why the confidence intervals for those top-rated departments tended to be narrower than the rest. That is, the size of the confidence intervals is not constant throughout the report, and so a difference of 0.1 might be meaningful when it involves a top-ten department like Yale and then non-meaningful when it involves a top-30 department like Virginia.

Now, I realize that I'm on record as being basically okay with looking to the confidence intervals for the 2006 Report and extrapolating/guessing about what they suggest about this year's edition. But i) I don't think doing that is close to ideal, and I was really looking forward to Healy's analysis of the 2014 data; ii) I think that it's okay to do that only if there's no more recent data available; iii) I realize that the 2006 intervals are only indirectly relevant to the 2014 edition, and don't have any direct implications in any specific case in 2014—just general trends, and then only suggestion, and definitely not anything close to proof; and iv) I'd really, really much rather just have confidence intervals calculated on this year's data—so then, you know, we'd know. (In retrospect, I think I could have been more clear about some of this in my post from last week, and I apologize for any confusion that might have caused.)

I do like that Brogaard, Healy, and Leiter have included these new graphical figures. I think that the histograms and kernel density plots are interesting. I do feel like they help me understand the ratings better. I do. But I don't agree with Leiter's claim that "these visualizations convey the necessary information in a detailed and accessible way." On the contrary. If you are trying to figure out what to make of the fact that (e.g.) UConn's score increased by a margin of 0.4 while MIT fell by 0.3 (which is a slightly smaller margin but takes place much higher up in the rankings), these visualizations are insufficient, and do not convey the necessary information. In order to understand what's going on there, you need confidence intervals calculated on 2014 survey data for each department, because sample sizes differ from department to department and tend to get smaller as you go down the rankings.

And so, while I appreciate why they don't want to invite "invidious comparisons" by posting rounded mean scores that are too fine-grained, I think that ultimately this is a misguided reason against calculating confidence intervals or including them in the Report. It seems to me that you need the confidence intervals in order to know which comparisons are invidious. And if past analysis is any guide, there's reason to suspect that differences of one tenth of a point are sometimes at least potentially invidious, and that this margin is more likely to be invidious the further down in the rankings one goes.

In closing, I continue to think that confidence intervals are a vital tool whose absence greatly impairs the PGR's usefulness, and I don't see any good reason not to include them.

Ok. I'm sorry about this. People have been asking in comments for a new thread, and I realize that this was not what you wanted. Last post about the PGR for a while. Promise. Soon I'll put together one of the "interview questions" posts we do every year.

--Mr. Zero


Anonymous said...

Could we also have a new job market post? The other one is at 230 comments.

Anonymous said...


Anonymous said...

Did it cross anybody else's mind that maybe the reason BL didn't post the confidence intervals was that Zero was right? If they didn't show what Zero predicted they would, posting them would have been a nice way to make him look stupid. It's not like BL to pass up an opportunity like that.

Anonymous said...

Does any one know what precipitated Leiter's attack on McAffee this morning? Seems weirdly out of the blue. And what reason is there for thinking that chrisclare is a pseudonym for McAffee?

Anonymous said...

An observation: The person who thinks and acts as Letier interprets McAffee of thinking and acting isn't McAffee. It's Leiter.

As with all things in Leiter's universe, Leiter's post "about" McAffee is really a post about Leiter. (I.e., he really is the malicious infant he accuses others of being.)

Anonymous said...

Noelle, do you really think that helps?

Anonymous said...

It doesn't matter who said it. And it doesn't matter whether it 'helps.' What matters is whether or not its true.

Strikes me as true.

Anonymous said...

I love reading BL's twitter. Anytime he calls someone oblivious, or self-deceived, or attributes ulterior motives--it's pretty much guaranteed that he's projecting. It makes me miss the Leiter Reports parody twitter.

Anonymous said...

Yes. Leiter's brutality is obvious. And I think the vast majority of people in the profession recognize it as such.

God only knows what's going on with those who don't.