Friday, March 12, 2010

On the Improbability of One-tailed Hypothesis Tests

One-tailed hypothesis tests have fairly wide popularity in ecology and evolution. For instance, an article by Lombardi & Hurlbert (2009) reported that 13% and 24% of "articles with data susceptible to one-tailed tests" used such tests in two recent journal years. Another similar review by Ruxton & Neuhäuser (2010) found that 5% of all articles published in 2008 in the journal Ecology used at least one one-tailed test, although they didn't examine "susceptibility" (i.e., many articles not using a one-tailed test might not have had data appropriate to such a test).

One-tailed hypothesis tests are popular in large part because they provide increased power to reject the null hypothesis if it is false. The lower panel of the figure, right, shows the expected mean absolute value of t for a real (but small) mean difference between populations A and B, for various equal sample sizes of A and B. What it reveals is that the sample required to reject a two-tailed (rather than a one-tailed) null on average is about 50% larger, which could be expensive and time consuming if data are difficult to obtain.

However, there have been repeated articles questioning the general appropriateness of one-tailed tests. For instance, Lombardi & Hurlbert (2009) conclude that "all uses of one-tailed tests in the journals surveyed seemed invalid." Ruxton & Neuhäuser (2010) were a little more generous, but they concluded that in 17 papers using a one-tailed test, only one had appropriate justification to do so.

The problem arises from an apparently widespread belief among ecologists and evolutionary biologists that any a priori hypothesis regarding the direction of the outcome in our statistical test is sufficient grounds to justify a one-tailed null hypothesis. This is not true, but Lombardi & Hurlbert (2009) conclude that the reason for this misperception is fairly well founded, documenting bad or confusing advice regarding the application of one-tailed hypothesis tests in 40 of 52 popular statistical texts (Lombardi & Hurlbert 2009, Supplement).

In fact, a one-tailed hypothesis test is only appropriate if a large effect in the opposite direction of our a priori prediction is exactly as interesting and will result in the same action as a small, non-significant result in the predicted direction. Both articles point out some very restrictive circumstances in which this might be true. (For instance, in the example of an FDA test on a new headache drug - no positive effect and a large negative effect on the pain of test subjects will result in the same action: no approval for the drug.) However, in ecology and evolution it is quite hard to imagine circumstances in which a large, significant result in the opposite direction of that predicted by theory could easily be ignored.

Of course, there are many statistical tests (lots of them common among evolutionary biologists) to which the concept of "tailedness" doesn't really apply. For instance, we are not usually interested in whether our data fit our a priori model better than expected in a goodness-of-fit test (although perhaps we should be).

For statistical tests in which the concept of tailedness does apply, one-tailed tests generally ill-advised. Thus, their use should require substantial justification. Ruxton & Neuhäuser (2010) give two very simple grounds on which they feel a one-tailed need be justified. First, an author using a one-tailed test should clearly explain why the result in a particular direction is expected, and why it is fundamentally more interesting than a result in the opposite direction. Second, importantly the author should also explain why a large result in the unexpected direction should be treated no differently from a non-significant result in the expected direction (Ruxton & Neuhäuser [2010]). These conditions may be rare (or, in fact, nonexistent: Lombardi & Hurlbert [2009]) in our field.

4 comments:

Glor said...

My favorite thing is when authors present both one and two tailed results as an indication of the degree to which their hypothesis is supported. Hopefully all those folks have moved on to model-fitting...

Anonymous said...

I think that (at least part) of what is going on here is a consequence of the ridiculousness of using alpha = 0.05 as a magic cut-off point. Even devious scientists would only switch to one-tailed tests for p values between 0.05 and 0.1.

I think p-values are informative. Even if someone presents a one-tailed value I don't have much trouble figuring out what the two-tailed value would be, and how much to (mentally) weigh the evidence in favor of their views.

This discussion feels, to me, like more "heat" than "light."

Unknown said...

So Liam, does this apply equally to parametric and nonparametric tests?

Also I wanted to note that some questions seem to be to be naturally one or two-tailed, and some methods may only produce a one or two-tailed result. What do you do in these situations?

Lastly, as you are likely aware, two-tailed methods have also been criticized in many situations because one is likely is find a difference between groups given large enough sample sizes.

Perhaps this is just reinforcing what Luke is saying.

Liam Revell said...

I think the argument applies equally to nonparametric tests. That is, the authors would argue that if we cannot justify a priori why a strong result in the opposite direction of our prediction is fundamentally and wholly uninteresting, a two-tailed test should be used.

I agree that some tests are by construction one-tailed. For instance, we are not usually interested in whether our F statistic in an ANOVA is lower than expected by chance (although it could in theory be: http://upload.wikimedia.org/wikipedia/commons/f/f7/F_distributionPDF.png).

I'm not sure how one would be more "likely is find a difference between groups given large enough sample sizes" in a two-tailed test than in a one-tailed test. (In fact, my lower panel figure suggests that the opposite will be true.)