Sunday, March 29, 2009

Speciation with gene flow?

Liam’s post on nested clade analysis (below) reminded me of this very recent paper by Becquet and Przeworski (Evolution, online accepted). Many of us are interested in understanding whether speciation typically entails an abrupt cessation of gene flow between populations, or whether the process is “leaky”, with continued exchange of migrants after the initial population split. These are typically referred to as the “isolation” and “isolation-migration” models, and several computational tools have been developed to estimate parameters relevant to the isolation-migration model (IM, MIMAR).

Becquet and Przeworski look at both of these programs (and MIMAR is one of their own) and ask the simple question of how they perform when some of their underlying assumptions are violated. Their answer: not as well as one might like. When ancestral populations are structured, both IM and MIMAR can lead to rejections of a true allopatry (isolation only) model, even when there is no post-separation gene-flow. Moreover, parameter estimates – including divergence time and ancestral population sizes – become unreliable when there is ancestral population structure or when gene-flow rates change through time. Real population histories likely involve both time-varying levels of gene flow and complex current and historical population structure and it is not clear how to develop methods that are robust to a broad range of potential splitting scenarios. Approximate Bayesian methods, anyone?


Liam Revell said...

This article appears to contain two main points:
1) The correct model won't be selected if it is not included in the set of tested models. This is a truism, and thus not especially interesting; and
2) Wholly incorrect biological conclusions can result due to (1).
An obvious solution would be to compare a set of alternative, biologically realistic models. This is not suggested by the authors. Why not? It's possible that different processes of divergence might result in similar genetic patterns, leaving us unable to distinguish them from genetic data. Even if true, we would still avoid (2) by adding more realistic models to the arsenal.

Dan Rabosky said...

I agree. Although I will add that the correct model is *often* not included in the set of tested models - and here I'm speaking broadly of all science that involves model-based inference. Yet all too often, researchers assume that the very fact that you have a "best fit" model implies that this same model is good in some sort of absolute sense. But models are rarely evaluated a posteriori through parametric simulation etc, and sometimes (all too often, in my experience) you find that the supposed 'best' model fails miserably to recover any of the relevant features of the real data.

Liam Revell said...

Some might say that "all models are wrong, but some are useful", eh?

Glor said...

Excellent points gentlemen. In spite of it's limitations, I hope this paper will serve as a wake up call those using the IM model.

I believe the answer to the question of why various alternatives aren't fit is that fitting of just one relatively simple model is hard enough. This conclusion follows from what is being said in articles that apply the IM model to large datasets for relatively young and well-understood populations like humans:

"While no model can accurately capture all of the processes that affect biological populations, the IM model represents an improvement over traditional models that rely heavily upon the assumptions that populations either do not have shared history apart from gene flow (e.g., Wright’s island model), or that populations diverge in isolation. Such generality comes at the expense of the model being parameter rich and thus inference with MCMC techniques can be computationally prohibitive and challenging. prohibitive and challenging" - Garrigan et al.

Susan Perkins said...

I love that quote, Liam. I think I might just have to incorporate that into my tenure talk (as a response to a very predictable question about why I might use ML (as opposed to equally weighted parsimony analyses).

Celine said...


I am really glad that people are reading it and finding it interesting. I indeed was trying to warn the community on the danger of over interpretation: the method consider a divergence model, with many assumptions (likely violated by data). Concluding on "speciation" models from this is way too risky given the limited powers of the methods given available data sets.

However, I wanted to let you know that all the results on ancestral structure (the one different from the descendant populations) are wrong...
the manuscript is on hold for the "correct" simulated data to be analyzed.

The revised manuscript will follow (once IM is finally done!!).

I had a similar idea than Liam "to compare a set of alternative". I tried the goodness-of-fit test that i describe, but I plugged the estimates sampled from the posterior distribution (i.e. im model) into different alternative models:
-isolation model (m=0)
- structure in the ancestral populations (the hat(m) is in the ancestral pop. and I set the starting point of the structure at hat(T)*.5)
- Isolation followed by secondary contact (hat(m) is only starting hat(T)/.5 generations ago)
- early gene flow (hat(m) from hat(T) to hat(T)/.5 generations ago)

I obtain the Pr(D| posterior and a specific model) by taking the products of say, P(sum(Ss*) "inferior" sum(Ss)) where Ss* is the observed value in the data set and Ss is the distributon of the statistics given the "estimated parameters and a given model"

I compare these likelihood of the data given different models, look how often they were higher than for the im model ....
and I confirm that fact that the statistics i used for the gof (s1, S2, ss, Sf, Pi in pop1 and 2, D in pop. 1 and 2, and Fst) do not capture enough information about the different model.
I am sad of this failure as I was hopping that people could you this to help interpret their results until I am done with my current project...

I wanted to let you know that I am working on a model with changing gene flow rate over time (i.e., 2 gene flow rate model with a time at which the rate change). However, it is a very hard problem: I am still at the process of finding summary statistics of the data that capture differences between the models (i.e., Constant gene flow, vs early gene flow only, vs. late gene flow only).

Anyway Thanks again for receiving my contribution so well.


Glor said...

Sounds like some cool stuff Celine! Did your model-fitting analysis reveal that the alternatives were not strongly differentiated, or did more basic problems with model-fitting stop you before you even got to that point? Please keep us posted as things progress.

Celine said...

Fist a likelihood ratio test is definitelly not powerfull enough to help.
Now just looking at how often is L(Ha) larger than L(H0), H0 being the im model, I found the following:

- if the true model underlying the data is the im model, the im and secondary contact model have large likelihood.

- if there is no gene flow since the split in the data (recall that in this case the estimates of the gene flow rate are very small) the likelihoods are up for all models (makes sense as if yo remove gene flow from any of the alternative I propose then it's like isolation model anyway.

- the same thing happens if data are generated under a model of early gene flow. but here again in most cases gene flow is not detected, so we are back to the isolation case (and it should the little use of the statistics from my goodness of fit in capturing what happens a long time ago).
Now I got the idea today giving a talk about that and now that am writing that down, that I ought to look only at the analyzes that DID detect gene flow to see what is going on.
I am going to look at it right now.