ESP and Xphi

List

January 11, 2011 | Andy | philosophy, the profession, the academy, the profession

Many of you may be aware of the fact that a leading psych journal is set to publish an article that allegedly supports the thesis that ESP exists.

It has set off a debate in the scientific community about data analysis. See the link below.

http://www.nytimes.com/2011/01/11/science/11esp.html?_r=1&src=tptw

The main issue is that some statisticians have long been critical of a commonly accepted methodology that they think is too permissive when determining statistical significance with respect to small sample sizes. They are now rejoicing that this ESP paper is going through, because they think it will shed light on what they have long thought is a widely-accepted, bad methodology.

I think this has some bearing on the recent experimental philosophy (xphi) movement. I’ve expressed concerns here about some very popular xphi papers that allege to show that cultural background influences intuitions about philosophical thought experiments. My main concern was the incredibly small sample size and whether the studies included statistically significant differences. I was assured by some active xphi philosophers that these sample sizes plus the standards for determining statistical significance are widely accepted, and so I took them at their word.

Note, however that these statisticians are criticizing precisely this fact. They claim that this method is widely accepted and that it is bad that this is this case. Also note, that they have as one of their targets certain kinds of social science studies that have an xphi flavor – namely, looking at what sorts of factors might bias behavior.

My question is: if much of current xphi is modeling itself after these these social science methods: is xphi as currently practiced in trouble?

Update: Here’s one place where I raised concern about the sample size of a popular xPhi paper

12 Responses to “ESP and Xphi”

KT
January 11, 2011 at 10:16 am

Good question–I wish I knew the answer. This might be worth bringing up on the x-phi blog to see what folks over there say.

Reply
Jeff Maynes
January 11, 2011 at 11:42 am

There are some important issues that this raises. The ICCI blog had a post the other day on some related issues, and the New Yorker had a piece as well. In a sentence, there is a concern that many really exciting results turn out to not be replicable. One possible explanation is that the incidence rate of results caused by chance is still to high with a 0.05 p-value, and these “sexy” results are more likely to get published, etc. than failed attempts at replication.

There is also some concern from other corners of psychology about the samples used, there was quite a splash with Henrich’s “WEIRD” paper last year in BBS about the types of people that are being sampled in social science and psychology (middle class, white, American college students).

All that said, these are important issues that ought to be addressed, and they do matter for the X-Phi results we’ve got, and work in the future. Indeed, that’s precisely the result that an X-Phi advocate *wants*. The guiding insight (or reputed insight for those less sympathetic) is that traditional philosophy is making empirical claims which are best studied using the best empirical methods we have available – those developed by the relevant sciences. Good X-Phi work is going to be that which rigorously adheres to those standards, and if the standards change, then what counts as good X-Phi practice should change with them.

Reply
Dennis Whitcomb
January 11, 2011 at 11:50 am

I have an idea! Let’s do a survey to see what the folk think counts as a significant sample. Then, based on what we find, we can resolve the issue.

(OK, that was cheap. So, even cheaper: let’s just consult our own intuitions about what counts as significant!)

Reply
James Beebe
January 11, 2011 at 12:16 pm

Andy, I don’t think this development points to anything too significant for the prospects of X-Phi. It is recognized in the social sciences that statistical significance does not equal theoretical significance. Some X-Philes have been too quick to see theoretical significance where they have not done enough to establish that they have anything beyond mere statistical significance.

For example, in the Weinberg et al. article you allude to, they found a statistically significant difference between the responses of high and low socioeconomic status participants. Is that theoretically significant? Weinberg, Nichols and Stich no longer think so, although they did initially. However, it was pointed out that low SES subjects responded in a way that was not distinguishably different from blind guessing, as if they had not properly understood the task. Once it was pointed out that Weinberg et al. had done nothing to rule out this alternative explanation of their data, they immediately stopped endorsing it.

I think something similar is true of their east-west differences. Was there a statistically significant difference between eastern and western responses? Yes. Is this theoretically significant. I don’t think theoretical significance has been establish because, again, readily available alternative explanations have not been addressed or ruled out. For example, east Asians answered in a way consistent with their proclivity for helping others to save face. There are cultural norms governing what one can and cannot say in a social situation (and filling out a questionnaire has been shown to be a genuine social situation/interaction). Just because east Asians say different things publicly doesn’t mean they have fundamentally different intuitions or conceptions of knowledge.

So, I don’t think the statistics used in X-Phi need to be changed. Rather, more attention needs to be paid to the issue of whether a statistically significant difference really counts as theoretically significant.

Reply
Josh May
January 11, 2011 at 3:30 pm

Ditto what James said. I was going to say something similar…

But I’ll add that the issue with the early x-phi studies you allude to seem quite different from the problems sparked by the ESP study. Indeed, the same issues apply to some x-phi (often using only a p-value of .05 for significance, tending to only publish positive results, not always paying enough attention to replication, etc.). But these are not always or essentially or inherently problematic, and much of the issues flagged by this rising debate in the scientific community are something that many empirical researchers are well aware of—and have been for some time. The idea is just that there is more explicit focus on it now.

Moreover, much of the popular coverage of the issue seems to downplay this. They don’t highlight that researchers have been aware of the issues and often try to avoid them. Sometimes, for example, referees demand a lower p-value than 0.05, and over time people value replication, otherwise they downplay results (and so on).

So, to echo some of what James and Jeff said: these are interesting issues, but the researchers themselves have been aware of them and do what they can to avoid the problems. And as we learn to be better at avoiding the problems x-phi will benefit. However, in Andy’s defense, he might say: sure, in the future it might benefit, but I’m worried that we now have evidence that the current and past work is bunk! Again, I think the right response is: some of it is, some of it isn’t. But we knew that before.

Reply
Andrew Cullison
January 12, 2011 at 5:05 pm

Thanks for all these comments everyone.

@James…I am not worried about the prospects for Xphi, per say, I’m worried that XPhi as it is now actually practiced by a lot of xPhiles might be guilty of what some of these guys are trying to call attention to.

Here’s a way to refine my question. Here are some positions you could take with respect to Xphi as currently practiced.

1. Much of current xPhi is in the clear because the standards in xPhi are way more stringent than the standards of the social sciences that the guy in the article is targeting. But that’s not true, is it?

2. Current xPhi uses the standards this guy is critical of and is doing something bad because this guy is right that it’s bad to use these standards.

3. Current xPhi uses the standards this guy is critical of and is NOT necessarily doing something bad because this guy is wrong. It’s not bad (or always bad) to use these standards.

“standards” (in all three options) = using .05 significance measure on whatever sample sizes this guy thinks is way too small

Those don’t exhaust the logical space, but they seem to be the main three options I could take…which one am I supposed to take with respect to past and much of current xPhi?

(I’ll grant that future xPhi can avoid this by simply going with stringent standards all the time)

Reply
Josh May
January 12, 2011 at 5:47 pm

Clearly x-phiers primarily use the methods of the social sciences, so they do often use a .05 threshold for significance. But I didn’t think the ESP study was raising issues in the scientific community just about that. The NYT article you cite does focus on the .05 significance threshold (and testing using significance methods at all), but a focus on that seems rather puzzling to me for several reasons:

(a) Many people’s tests of significance yield something like p<.0000001 (or whatever). The fact that they are using a .05 threshold just means that they *would* count it as significant if the p-value *was* .04 (e.g.). But the fact that their p=.0001 (or less or whatever) means that it’s in a much better position.

And often times people pay attention to p-values. Jonathan Schaffer and Joshua Knobe, for example, express a qualm about one of our (May et al. 2010) results because it “managed to just squeak past the conventional threshold of statistical significance” (http://pantheon.yale.edu/~jk762/Contrastivism.pdf).

Now the people quoted in the NYT article seem to think this entire methodology is bunk because it’s not Bayesian or whatever. That may be true, I guess, but I don’t see why. The example given seems to illustrate that higher levels of significance using the current testing methods is less worrisome. But I don’t know what this Bayesian alternative would look like exactly (I’m just currently ignorant of the details). In any case, my sense is that the call to move away from standard tests of significance entirely isn’t an idea most experts would seriously entertain because of this ESP study. Instead, this leads me to the next point…

(b) There are other more serious worries about methodology, such as insufficient focus on replication over time, publishing only positive sexy results, etc.

X-phi is just as subject to these worries of course, but my contention is that these worries need to be discussed (i.e. they are not always obviously bad) and only some studies are subject to them.

So I think the right position is something like your (3), but “standards” should be widened, and the claim about the guy’s views should be weakened to something like:

4. Current x-phi uses the standards this guy is critical of but x-phiers are NOT necessarily doing something bad because these standards are not always bad (well maybe they are, but that takes argument, so let’s have that discussion, etc.).

Reply
jonathan weinberg
January 13, 2011 at 2:04 am

A couple of quick thoughts:
(i) I’m not sure that any of these issues really have much to do with _sample sizes_, really. It’s more about things like testing how one’s results _comparatively_ change the probabilities of different competitor hypotheses; and, along with that, what the prior probabilities on those hypotheses were in the first place. That last part is super-relevant to ESP, the priors on whose existence are of course _very_ low. I don’t think there’s any x-phi result that flies in the face of any entrenched empirical result in anythig remotely like the way that the ESP result does. (It’s mostly part of the x-phi point that we are addressing questions for which there’s frankly not very much antecedent empirical reason to expect them to turn out one way or the other — they pertain to claims that we think _need_ to be checked.)

(ii) I, too, mostly agree with JB’s comment, though I think it’s important to keep in mind that “theoretical significance” is also a function of what your theoretical goals are. So, e.g., in most of my work, I take myself to be trying to given reasons for us to worry about using intuitions as evidence for philosophical claims. As such, pretty much the _last_ thing I want to do is use results like those from “Normativity & Epistemic Intuitions” to claim that different groups really have different “conceptions of knowledge”. I think that most intuitions about weird cases are just noise. For my purposes, then, there are various sorts of explanations of the findings that are unproblematic, which nonetheless would be a problem for people with cultural-relativistic goals.
(iii) fwiw, I don’t think the “face saving” hypothesis is a very good one here, for our ethnicity results. It doesn’t really fit our data well at all, since it seems (if I understand it correctly) to predict a general bias towards “really knows” in our EA subjects. And there isn’t any, at least not in our data. (Which is not to say that there might not be yet other deflationary explanations that should be considered.)
(iv) I think I agree with everything in JM’s last comment.

Reply
Andrew Cullison
January 13, 2011 at 9:43 am

JM and JW – This is very helpful. I’m going to try and extract a candidate position for xPhiers on a question that is now bugging me and Weinberg has gestured at an answer.

So it seems like JM and JW agree that the position for xPhiers to adopt is to say something like the following:

4. Current x-phi uses the standards this guy is critical of but x-phiers are NOT necessarily doing something bad because these standards are not always bad (well maybe they are, but that takes argument, so let’s have that discussion, etc.)

And now it looks like we’ll have a fun philosophical puzzle for xPhiers. We’re in the classic, sometimes X is good, sometimes X is not and there is some pressure to start articulate a distinction.

I’ll frame the question in terms of when the standards do (and do not) yield supporting evidence. For example, I suspect a lot of people will want to say that these standards did not yield supporting evidence for ESP. But they did yield supporting evidence for important Xphi conclusions.

So if we’re going with (4), the new question is: Under what conditions do the standards (that xPhi generally adheres to) yield supporting evidence?

Interestingly, JW seems to already be thinking about this question and is gesturing at an answer that is cached out in terms of prior probabilities.

We can probably get away with trying to articulate just a necessary condition for the present purposes.

(E) The standards S (that xPhi uses) provide supporting evidence for P only if the prior probability of P isn’t low.

Then we can say. (E) will at least help us explain why we don’t get evidence for ESP without having to say that xPhi is in trouble.

I’ll have to think more about some of this. I’ll probably come back with more thoughts.

Reply
jonathan weinberg
January 20, 2011 at 2:08 am

I’m not sure that this is really the best way to ask the question: “Under what conditions do the standards (that xPhi generally adheres to) yield supporting evidence?” It strikes me as rather too binary. Better, I think, for us all to try to understand what are the various risks of error that different methods are exposed to, and under what circumstances and in what ways those risks can be mitigated (or exacerbated).

So, in this case, one can say that traditional hypothesis-testing methods are ones that are susceptible to some extent to Type I errors (as of course any statistical reasoning is); and that this risk is something that, with these methods, is exacerbated in situations where the priors of the null hypothesis being false are very, very low; and can be mitigated somewhat by setting a lower threshold level for what will count as a significant p-value.

Reply
Jonathan Livengood
February 1, 2011 at 8:00 pm

Andrew,

You are misrepresenting the issue(s) surrounding the ESP paper. The ESP paper does not make use of small samples. In fact, it has very large samples (several hundred subjects in most cases).

The main problems with the ESP paper are (1) that for most of us, our priors for ESP being real are so low that getting significant results probably shouldn’t make us very likely to believe that ESP is real, and (2) that the *effect sizes* being reported are very small, so it is unclear whether there is really an effect at all. For the second point, I refer you to excellent discussion by Andrew Gelman.

As a general rule, when effect sizes are pretty large, classical significance tests are just fine. Almost all interesting x-phi work reports modest to large effect sizes. Moreover, most interesting x-phi work has been extensively reproduced. And as Jonathan Weinberg points out, the priors on results in x-phi are typically not as small as in the case of ESP.

Also, I should note that there are several ways of incorporating prior information into a statistical analysis. Bayesian updating is just one of them. In several x-phi papers, you will find a power analysis, which is a classical alternative for making use of prior commitments.

Reply
Jonathan Livengood
February 1, 2011 at 11:03 pm

Sorry, that first parenthetical should be “more than one hundred subjects” — still very large samples as far as these things go.

Reply