Simulate Phylogenetic Trees and Sequences

The notebook in this directory demonstrates the "Ten Rules for Reproducible Research in Jupyter Notebooks". Throughout the notebook, we refer to some the rules we applied.

For example, this notebook demonstrates:


Rule 1: Tell a Story for an Audience. This notebook was developed for computational biologists to simulate phylogenetic trees and sequence evolution.

Rule 3: Use Divisions to Make Steps Clear. We broke the workflow into separate notebooks and use this top-level notebook to explain and organize the workflow.

Rule 7: Build a Pipeline. This notebook describes the entire workflow, including model selection, tree simulation, sequence simulation, and analytics.


Introduction

When modeling the evolution of sequences, it is common to imagine the generative process as two steps:

  • Modeling the tree evolution (tree topology and branch lengths)
  • Modeling the sequence evolution down the given tree (root sequence, nucleotide frequencies, and transition rates)

We can sample the probability distributions defined by models of tree and sequence evolution in order to generate simulated datasets.

Goal

This notebook demonstrates how to create a reproducible record to simulate and analyze a phylogenetic tree and sequences. We simulate a phylogenetic tree under the dual-birth model (Moshiri & Mirarab, 2017), and we subsequently simulate a random DNA sequence down the phylogenetic tree under the Jukes-Cantor (JC69) model (Jukes & Cantor, 1969)

Run the following notebooks and explore how we applied the Ten Simple Rules.

1. Simulate the Tree

First, we simulate a phylogenetic tree under the dual-birth model (Moshiri & Mirarab, 2017), and we perform some basic analyses of the tree.

Run the following notebook to simulate the tree and perform the basic analyses of the tree.

This notebook saves the simulated tree in the file ./intermediate_data/dualbirth.tre.

2. Simulate the Sequences

Next, we simulate sequences under the previously-simulated phylogenetic tree under the Jukes-Cantor (JC69) model (Jukes & Cantor, 1969), and we perform basic comparisons between the simulated sequences and the simulated tree.

Run the following notebook to simulate the sequences and perform basic comparisons between the simulated sequences and the simulated tree.

This notebook saves the simulated sequences in the file ./intermediate_data/sequences.fas.

Version and Hardware Information


Rule 5: Record Dependencies. Here we use the watermark extension to print software, operating system, and hardware version information.


In [1]:
%load_ext watermark
%watermark -v -m -p matplotlib,pyvolve,seaborn,treesap,treeswift
CPython 3.6.3
IPython 6.3.1

matplotlib 2.2.2
pyvolve 0.8.8
seaborn 0.9.0
treesap n
treeswift n

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

Author: Niema Moshiri, UC San Diego, October 2, 2018