Phylogenetics and dating with
By using sampling dates in conjunction with sequence data, it is possible to estimate the rate of evolution, and hence generate phylogenetic trees calibrated in calendar time.These ‘time-trees’ are more straightforward to interpret in terms of the time to the most recent common ancestor and changes in effective population size, which can then be linked to external epidemiological information, as in the case of the spread of hepatitis C virus in Egypt during antischistosomiasis injection campaigns (Pybus et al.2015), and the latest version of the least squares dating (LSD) software also includes PB routines (To et al. In addition to running on multiple bootstrapped phylogenies, Monte Carlo simulation and PB approaches offer a highly flexible and parallelizable approach for estimating uncertainty in substitution rates and node dates (Efron and Tibshirani 1994).The PB approach implemented in so that substitutions on each branch will follow a NB distribution as in equation 1.Such pathogens have been dubbed ‘measurably evolving’ (Drummond et al.2003), as sequences typically accumulate mutations over epidemiological timescales of years or even months.We estimate confidence intervals for rates, dates, and tip dates using parametric and non-parametric bootstrap approaches.
Sometimes, the exact sampling time is not known; it may be missing from the annotations, or recorded to a particular precision (e.g. Given an initial guess of tip dates model is optimized heuristically, it is challenging to apply standard likelihood based approaches such as profiling to estimate confidence intervals.
We assume that the length of the sequence alignment, denoted , and the position of the root of the phylogeny.
For now, let us assume that the data take the form of a bifurcating rooted phylogeny with branch lengths in units of substitutions per site and that all tip dates are known.
2, The sampling distribution of estimated rates and time of the most recent common ancestors (TMRCAs) is asymptotically normal and the SD of the sampling distribution is well approximated by the PB distribution of estimated rates and TMRCAs.
algorithm provides several statistics associated with each sampled lineage that can be useful for identifying outlier lineages; these may represent sequencing error or samples that are poorly described by the fitted substitution model.
Algorithm 1 can be repeated for every good candidate root position and the dated tree with the highest likelihood is returned.