Thursday, December 4, 2014

Did The "Big Movers" of the 2014 PGR Actually Move? (No)

After the most recent PGR survey closed, Leiter posted some data about which departments improved most in the rankings. That is, which departments increased their ordinal rank most, in comparison with the next-most-recent ranking, in 2011. But because the "data" is presented only in terms of ordinal rank, the size of these moves are highly misleading, and are almost all based on trivial differences in mean numerical scores.You can see this when you compare the mean scores for the 2014 survey (reported here for the top 20 and here for the rest of the top 50) with the mean scores as reported in the 2011 version of the Report.

According to Leiter, the biggest movers of 2014 are the following, along with their numerical scores from both the 2011 and 2014 versions of the Report (I omit Saint Louis University, which was not evaluated in 2011):

Yale University (from #7 to #5, occupying that spot by itself)
Yale 2011 mean score: 4.0
Yale 2014 mean score: 4.1 
University of Southern California (from #11 to #8, tied with Stanford)
USC 2011 mean score: 3.8
USC 2014 mean score: 3.9 
University of California at Berkeley (from #14 to #10, tied with others)
Berkeley 2011 mean score: 3.7
Berkeley 2014 mean score: 3.8 
University of California at Irvine (from #29 to #24, tied with others)
UCI 2011 mean score: 3.0
UCI 2014 mean score: 3.0 
Washington University in St. Louis (from #31 to #24, tied with others)
Wash U 2011 mean score: 2.9
Wash U 2014 mean score: 3.0 
University of Virginia (from #37 to #31, tied with others)
UVA 2011 mean score: 2.7
UVA 2014 mean score: 2.8 
University of Connecticut, Storrs (from #50 to #37, tied with others)
UConn 2011 mean score: 2.3
UConn 2014 mean score: 2.7 
Of the "big movers" that were included in the 2011 survey, only UConn's mean score has significantly improved. All of the others improved by a trivial margin of 0.1, except the University of California at Irvine, whose mean score stayed exactly the same.

The bulk of the rankings are densely packed and ties are common, which means that apparently substantial jumps in ordinal rank can be caused by disproportionately negligible changes in mean evaluator score, or, in the case of UC Irvine, by no change whatsoever. In the case of UCI, what actually happened was this: Indiana and Duke fell from 3.1 to 3.0, UMass and Ohio State fell from 3.1 to 2.9, and Colorado fell from 3.1 to 2.8. None of these departments changed by very much—two by 0.1, two by 0.2, and one by 0.3 (Leiter suggests that differences of 0.4 or less are unimportant)—but it was enough to cause UCI to jump five spots and create the illusion of a substantial improvement.

Kieran Healy's analysis of the 2006 PGR data showed that "in many cases" differences of 0.1 were "probably not all that meaningful." This is the only time I'm aware of that any attempt has been made to perform this kind of analysis on Leiter's data, and although Leiter says Healy will be calculating confidence intervals for the 2014 edition, those calculations are unfortunately not yet available. But on the assumption that the 2014 numbers are similar to their counterparts from 2006, there is reason to doubt whether these differences of 0.1 or less represent actual differences—which means that almost all of the departments Leiter has singled out as "big movers" haven't actually moved at all. In all but one of the cases Leiter singled out, the 2014 survey didn't measure movement.

And so, as I have said before, there is a general problem with this kind of ordinal scale in that it fails to accurately represent the differences between ranked departments. As another example, the most recent data has NYU as the best-ranked department with a mean score of 4.8, which is better than #6-ranked Harvard and Pittsburgh by a margin of 0.8. That same interval of 0.8 also separates the sixes from UC San Diego, which comes in at #23. I, for one, find it impossible to look at the PGR and see these differences accurately. To my eye, the way the information is presented significantly understates the difference between NYU and Harvard/Pitt, and dramatically overstates the difference between Harvard/Pitt and UCSD.

Finally, I should say that I was glad to read that Kieran Healy will be calculating confidence intervals this time around. I think that information would be helpful. However, I bristle a little bit at the attribution of this idea to a session at the 2013 Central Division APA meeting; I raised this idea in 2009.

 --Mr. Zero

45 comments:

Anonymous said...

What, exactly, about the PGR does this analysis reveal to be misleading? You're right that often the difference between 5 or even 10 ordinal spots might be only a negligible difference in mean score, but nobody is suggesting otherwise. Leiter himself, as you point out, says that small differences in mean score aren't significant, and he puts the mean scores right there on the list for anyone to see.

So, anyone who spends even a tiny amount of time looking at how the rankings are generated will recognize that there's not much of a difference between (for example) #15 and #20.

Given that any non-stupid person can recognize this, I don't see why it constitutes an objection to ranking in general.


Mr. Zero said...

What, exactly, about the PGR does this analysis reveal to be misleading?

The ordinal rankings.

You're right that often the difference between 5 or even 10 ordinal spots might be only a negligible difference in mean score, but nobody is suggesting otherwise.

Of the seven departments that Leiter cites as big movers for which a comparison with the 2011 edition of the report is possible, six did not exhibit a measurable movement.

Of course, it's possible that Healy's confidence intervals will show that a 0.1 difference is generally meaningful this time around. If that happens, I'll post a revision/retraction. But in light of the fact that the only such analysis that has ever been done on any edition of the PGR shows that a 0.1 difference is not normally meaningful, it seems pretty misleading to hold up a group of departments that mostly only moved by 0.1 as "big movers." Based on the available information, departments that have moved by 0.1 are not movers at all.

Given that any non-stupid person can recognize this...

I didn't say that it was hard to recognize. But the fact that the non-movement of the six non-moving big movers is easy to recognize makes it more egregious, not less.

I don't see why it constitutes an objection to ranking in general.

I also didn't say I was objecting to ranking in general. I am objecting to an unnecessarily (and obviously) misleading way of presenting and organizing the information.

Anonymous said...

What is misleading are:
(1) talk about "big movers"
(2) talk about "top 10" or "top 20" departments

It is misleading because this talk conversationally implies that the relevant categories track significant differences of quality, whereas they really track only extremely trivial numerical differences.

Anonymous said...

But everything good about the PGR is so obvious! And all the haters are so obviously wrong!

Anonymous said...

You also have to look at median and mode, not just the rounded mean, and there were more significant changes there.

DH said...

"You also have to look at median and mode, not just the rounded mean, and there were more significant changes there."

Except the "big movers" were determined by the change in their ordinal rank, which was determined by their mean scores. So this is a bit beside the point of the post.

Anonymous said...

Healy said a 0.1 difference is "probably not all that meaningful."
But you say “0.1 difference is not normally meaningful.” And you also call this difference “not measurable.”
What you are saying is obviously not what Healy said. Can you explain the reasoning you used to come to your conclusion that the difference is not measurable and not meaningful?

Anonymous said...

I am puzzled about why it is inaccurate to say these schools had the largest changes in ordinal rank (what Leiter obviously meant by "biggest movers"). Surely this is true, and surely ordinal Leiter rank, if it measures anything, measures perceptions of relative quality. So a program's mean could decrease and yet others decrease more, so that it would not be the least bit misleading to say the program's rank increased. And that would be interesting, insofar as you cared about relative ranking.

Anonymous said...

I am no fan of the overall rankings, but I think that you are overlooking something. A score of 3.0 on one iteration of the report does not mean the same as a score of 3.0 on another report. After all, there are different evaluators in 2014 than 2011. In fact, one might think that the only way that one can determine whether a sustained 3.0 score is an improvement or not is by comparing it to the variations in other scores in the ballpark between the two surveys.

That said, the point about ordinal rankings is well stated. Whereas were Princeton to increase its score by .3, this would mean no change to its ordinal ranking (besides breaking a tie), an increase of .3 improves Georgetown's ordinal ranking by 13. Focusing on the ordinal rankings can, hence, be misleading (or at least, not the whole story).

David said...

Hang on. Healey's discussing the meaningfulness of differences *within* a given ranking exercise, not *between* ranking exercises.

(It doesn't sound implausible that they're fairly similar, but that's based on the guess that the mark distributions in 2011 and 2014 look about the same - which might or might not be true.)

Anonymous said...

You've misrepresented Healy, as others noted, so on what basis have you decided that a .1 increase is not significant? And doesn't it really depend on the distribution of other scores, which may have been different in 2014 than in 201?

Anonymous said...

I, too, was glad to see that Keiran Healey will be calculating confidence intervals this time around. But a) why didn't he think to calculate confidence intervals before? The report is 25 years old and has been using approximately its current, survey-based methodology for 15. Why haven't confidence intervals been a part of the report for a long time? b) why doesn't he just calculate his own confidence intervals? If he's competent to be the lead researcher on this project, then he's competent to calculate confidence intervals. (Or, put it another way, if he's incapable of calculating his own confidence intervals, then he's not competent to be the lead researcher on this project.) c) relatedly, if he's not capable of calculating confidence intervals, he's not qualified to have worthwhile opinions as to the methodological soundness of the PGR, on a couple of levels. i) if he lacks the technical expertise to perform the calculations, he mostly likely also lacks the technical expertise to evaluate the methodology--this is Dunning-Kruger territory. ii) if he hasn't calculated the confidence intervals, he doesn't know how seriously to take his results--that's what a confidence interval is for.

Mr. Zero said...

Can you explain the reasoning you used to come to your conclusion that the difference is not measurable and not meaningful?

I was just thinking about what confidence intervals are. If the nominal difference is less than the confidence interval, then the survey didn't measure a difference.

I am puzzled about why it is inaccurate to say these schools had the largest changes in ordinal rank (what Leiter obviously meant by "biggest movers"). Surely this is true, and surely ordinal Leiter rank, if it measures anything, measures perceptions of relative quality.

Not if the numerical scale that the ordinal ranking is based on is too fine-grained. If the numerical scale outpaces the confidence intervals, then nominal differences will not necessarily track actual differences.

A score of 3.0 on one iteration of the report does not mean the same as a score of 3.0 on another report.

No, the scale is the same on both editions of the survey. As far as I know, Leiter has used the same scale on every edition of the PGR that has utilized a survey-based procedure, and '3' means "strong."

Healey's discussing the meaningfulness of differences *within* a given ranking exercise, not *between* ranking exercises.

Yeah, but the scale of measurement and other details of the survey procedures are the same between this and prior editions of the report.

Anonymous said...

Ahh, the narcissistic obsession with the rankings themselves about their content. Who cares? Can 13 people adequately assess the complexity of metaethics itself? I mean really. Now, we're worried about what the increase in ordinal ranking really reveals, yet we don't systematically question how silly it is that in 1989, a law student applying to graduate school thought that a ranking should exist in pedigree as it does for law school...

Wow is all I can say.

Anonymous said...

I think there's a domain problem here. If the PGR were a methodologically sound survey, then we should absolutely want to see confidence intervals and robustness analyses and all the other statistical bells-and-whistles to help us make sense of the data

Problem is, that's not the PGR. It is another beast entirely, only loosely dressed up to look like a legit survey. In which case, asking for confidence intervals is largely beside the point. We're not in the domain of sound statistical methods here.

Nevertheless, even on its own terms (which--to paraphrase the first comment--no non-stupid person should accept, but we can still have some fun with it) Mr. Zero is right to be troubled by this seemingly insignificant numerical difference between "top" schools and "other" schools and between "big movers" and "non-movers".

In other words, even if you ignore the fact that this is a flawed survey based on an obviously biased sample of the profession, it is still hard to defend a mean difference of 0.1 as tracking an important difference in program quality.

So here's a good homework assignment: Work out what it would have taken to make Princeton steal the #1 spot from NYU. They needed just 0.4 more in their average and NYU just 0.1 less (or choose whichever numerical scenario you like). Since we know that there is not a well-defined criteria provided to evaluators, we are basically unrestricted in the plausible explanations for any change in scores. Maybe a small proportion of raters change their scores for those schools by a lot, or a large set all change it by a little, or some combination of the two. Some of them were having a bad day and decided "Screw NYU!" the day they turned in their evaluations and knocked off a point. Who knows?

Fundamentally the point is that it's all up in the air with the PGR. It asks a small set of people in philosophy about their perceptions. It is helped by the fact that it more-or-less tracks/re-affirms what a larger set of people in philosophy already think. This does not make it a sound quasi-experiment.

Anonymous said...

8:51,

The only way to know if you are winning is to check the scoreboard. When no scoreboard exists, you invent one.

It will remain important because those who are winning want to keep winning. And those who are losing want to know where they stand so they can improve.

Anonymous said...

If you have time to worry about this bullshit, then your life is lacking something.

Mr. Zero said...

Leiter's attempt to attribute the changes in the rankings to changes in departmental composition misses the point. The point is not that the changes are inexplicable given the available information. The point is that, given the results of the only attempt that has ever been made to study the sensitivity of his survey, and bracketing the question of whether the survey is sufficiently well-designed as to qualify as measuring anything, the survey does not seem to have measured any movement in most of the cases he cites as "big moves."

Anonymous said...

On what basis do you say going up .1 is not a big change? How hard or easy is it to do that? Did any departments go up .1 that hadn't added senior faculty? The PGR says students should not choose among departments that are not far apart (.3? .4?), but that's a different point, isn't it?

Anonymous said...

If you have time to worry about this bullshit, then your life is lacking something.

Whether it's creator intended this side effect or not, this "bullshit" largely determines hiring practices, and sets the community understanding of the (pedigree) value of our degrees. If you have luxury of not caring about this "bullshit", then you must have a pretty great life. Those of us outside the ranked programs don't have the luxury of sharing your apathy.

Anonymous said...

the survey does not seem to have measured any movement in most of the cases he cites as "big moves."

Once again you claim the differences are not measurable. You defended this above by saying "I was just thinking about what confidence intervals are." But you don't know what the confidence intervals are. So I don't think you really have any defense.


Anonymous said...

"this "bullshit" largely determines hiring practices"

what evidence do you have for this claim?

Mr. Zero said...

But you don't know what the confidence intervals are. So I don't think you really have any defense.

It's inconclusive, but the PGR has been studied before. It's not like in 2009, when I complained that all of the statistical data you'd need to interpret the Report was totally unavailable. We don't have the 2014 confidence intervals, but of course we've never had confidence intervals for a contemporary edition of the report. And in the absence of current data, we can look to past data to get a sense for what kinds of changes are likely to be meaningful. Subject, of course, to change when current data becomes available.

Or, look at it like this: given that past analysis has shown that nominal differences of 0.1 have often not been meaningful, what evidence is there that such nominal differences will turn out to be generally meaningful in 2014?

Anonymous said...

What past analysis showed a .1 difference wasn't meaningful? Can you cite it please?

Mr. Zero said...

What past analysis showed a .1 difference wasn't meaningful? Can you cite it please?

I did cite it. The Healy analysis I link to in the original post shows that differences of 0.1 are often not meaningful, and to my knowledge is the only such analysis that has ever been conducted.

Anonymous said...

That's not what Healy said--you need to read the sequence of posts. And the claim was specific to 2006, not 2009 or 2011 or 2014. So you are just making things up.

Mr. Zero said...

That's not what Healy said

It's what the chart containing the confidence intervals mostly reveals. Not in every case, of course, and there's more certainty at the top end of the rankings. But nevertheless.

--you need to read the sequence of posts.

I read them, but I must have missed the passage you're thinking of. Where else do you have him addressing the sensitivity of the overall rounded mean scores?

And the claim was specific to 2006, not 2009 or 2011 or 2014.

I didn't claim otherwise. But on your view, what was the point of analyzing the 2006 data in 2012?

So you are just making things up.

I don't agree. I'm looking to the only data that has ever been generated about the precision of the PGR's rounded mean scores, in order to see if it supports Leiter's claims about it. It mostly doesn't.

As I've said several times, I don't regard this as definitive, and I am looking forward to the 2014 data, and I will revise/retract the claims of this post if that's necessary. We'll see, I guess.

Anonymous said...

@9:42, @6:37: Carolyn Dicey Jennings says this in a comment on her post about the data-gathering she was able to do: "As to the point about prestige, I have been looking at 1) the correlation between total placed graduates and PGR mean rating and rank, 2) the correlation between placement rate and PGR mean rating and rank, and 3) the correlation between tenure-track placement rate and PGR mean rating and rank. So far it looks as though correlation decreases at each of these steps. It is very strong for 1 (.75 and .71 respectively), strong for 2 (.70 and .58), and moderate to strong for 3 (.66 and .52). I will do a posting about this at some point, hopefully when I have more complete data. Departments differ as to whether graduates achieve a balance of postdoctoral and tenure-track positions, or more of one or the other. This information will be provided in the Excel spreadsheet that I will make available in the coming weeks. This is about as close as I come to being able to answer your question right now." http://www.newappsblog.com/2014/06/job-placement-2011-2014-overall-placement-rate.html#more

Anonymous said...

Or, look at it like this: given that past analysis has shown that nominal differences of 0.1 have often not been meaningful, what evidence is there that such nominal differences will turn out to be generally meaningful in 2014?

I don't know.
So, the responsible thing to say is that we don't know whether the changes the 'movers' showed are within the confidence interval for the scores.

Mr. Zero said...

So, the responsible thing to say is that we don't know whether the changes the 'movers' showed are within the confidence interval for the scores.

I think that's a fair point.

Anonymous said...

Wait, so, is he doing confidence intervals now?

Mr. Zero said...

Yeah, I'm not sure how to interpret that post. And I'm not sure I see what the problem would be with posting the unrounded raw scores along with the Healy's confidence intervals. To me it seems like that would be the thing to do if you wanted to convey accurate and useful information.

Anonymous said...

Well, that assumes that BL's intention is merely to provide useful information. But people stopped believing that a while ago.

Anonymous said...

I don't understand his reasoning. The first issue seems to be that there is more than one reliable way to do it, so they're not doing it. The second issue seems to be that it is inaccurate to perform the calculations and then round them off, and this conflicts with their long standing commitment to rounding off scores to one decimal point.

Can someone explain wtf he's talking about?

Anonymous said...

Can we talk about the fact that the evaluators for the specialty rankings frequently include lots of people who don't even work in those areas?

Anonymous said...

1:36,

I noticed that too.

Anonymous said...

Example?

Anonymous said...

I'm surprised at how few reviewers there are in some of the specialties--esp. feminism, race, pragmatism. Not sure if this speaks to specialists' unwillingness to serve or an attempt to control the narrative.

Anonymous said...

"Whether it's creator intended this side effect or not, this "bullshit" largely determines hiring practices, and sets the community understanding of the (pedigree) value of our degrees."

Yes--and isn't a problem that such a methodologically-flawed instrument has such power and influence?

Anonymous said...

Example: Michael Tye in philosophical logic.

Anonymous said...

Ok, I get that Brian Leiter has been trying to 'control the message' about the PGR for the last few months. But his constant posting of various anonymous people thanking him for his rankings and his services along his constant ruminations about the 'smear campaign' against him are really starting to look pretty pathetic. I get that he is having a hard time letting go. But, C'mon!

Anonymous said...

For better or worse, it is clear that the most powerful and influential in our profession remain committed to the PGR. What they are no longer committed to is Brian Leiter--and for him this is a fate worse than death.

Anonymous said...

I’m no fan of Leiter or the PGR. But the evidence is consistent with Leiter’ss hypothesis that changes to departmental composition is reflected in the rankings. Note that a couple of data points consistent with Leiter’s hypothesis are omitted by Leiter himself.

MIT dropped from 7 to 13 (on the US Ranking) and this reflects some severe staff loses (consistent with Stalnaker phased retirement, Langton).

ANU dropped (on the English Speaking World (ESW) ranking) from around 15 to 24 (consistent with Chalmers lost).

Edinburgh moved (on the ESW ranking) from the mid-40s to shared 29. Similarly, 8 to 4 on the UK ranking (consistent with their hiring spree).

The main data is that moves are few and far in between. But the faculty composition does appear to play a major role in the visible moves in both directions.

Of course, it is not the only factor: E.g., Boulder (US Ranking) dropped from 24 to 31 (attributable to climate rather than faculty changes).

Anonymous said...

9.21

"It will remain important because those who are winning want to keep winning."

And luckily those who are "winning" get to define what constitutes winning.

Great work by the discipline as always.

Rst Movers said...

Reading this post reminds me of my old room mate
Dubai Movers and Packers