In traditional maximum likelihood phylogenetic tree inference, only the mutational process is considered for explaining the variation seen in the sequences, and the history of each gene analyzed is assumed to reflect the history of the species.
However, there are other mechanisms responsible for genetic variation between species, and the most influential of them is the coalescent process, which explains how sequence variation can be retained in a population, and how each gene tree does not necessarily reflect the history of the species.
With next-generation sequencing becoming less expensive, there will be a massive influx of sequence data in the near future, and with multi-gene datasets, the effect of the coalescent process will be more important to take into consideration when estimating the species tree. A set of sequenced transcriptomes will have genes sampled randomly, with a high frequency of missing data for each gene when considered across all sampled species.
Here I present a simulation study on the effects of missing data on estimating the species tree from a set of gene trees when taking the coalescent process into consideration. We have examined the effects on species tree estimation from sampling several lineages per species, different degrees and patterns of missing data and recent and older speciations.