5 Making stuff up!

The dispRity package also offers some advanced data simulation features to allow to test hypothesis, explore ordinate-spaces or metrics properties or simply playing around with data! All the following functions are based on the same modular architecture of the package and therefore can be used with most of the functions of the package.

5.1 Simulating discrete morphological data

The function sim.morpho allows to simulate discrete morphological data matrices (sometimes referred to as “cladistic” matrices). It allows to evolve multiple discrete characters on a given phylogenetic trees, given different models, rates, and states. It even allows to include “proper” inapplicable data to make datasets as messy as in real life!

In brief, the function sim.morpho takes a phylogenetic tree, the number of required characters, the evolutionary model, and a function from which to draw the rates. The package also contains a function for quickly checking the matrix’s phylogenetic signal (as defined in systematics not phylogenetic comparative methods) using parsimony. The methods are described in details below

set.seed(3)
## Simulating a starting tree with 15 taxa as a random coalescent tree
my_tree <- rcoal(15)

## Generating a matrix with 100 characters (85% binary and 15% three state) and
## an equal rates model with a gamma rate distribution (0.5, 1) with no 
## invariant characters.
my_matrix <- sim.morpho(tree = my_tree, characters = 100, states = c(0.85,
    0.15), rates = c(rgamma, 0.5, 1), invariant = FALSE)

## The first few lines of the matrix
my_matrix[1:5, 1:10]
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## t10 "1"  "0"  "1"  "0"  "1"  "0"  "0"  "1"  "0"  "0"  
## t1  "0"  "0"  "1"  "0"  "0"  "0"  "0"  "1"  "0"  "0"  
## t9  "0"  "0"  "1"  "0"  "0"  "0"  "0"  "1"  "0"  "0"  
## t14 "1"  "0"  "1"  "0"  "0"  "0"  "0"  "1"  "0"  "0"  
## t13 "1"  "0"  "1"  "0"  "0"  "0"  "0"  "1"  "0"  "0"
## Checking the matrix properties with a quick Maximum Parsimony tree search
check.morpho(my_matrix, my_tree)
##                                     
## Maximum parsimony        144.0000000
## Consistency index          0.7430556
## Retention index            0.9160998
## Robinson-Foulds distance   2.0000000

Note that this example produces a tree with a great consistency index and an identical topology to the random coalescent tree! Nearly too good to be true…

5.1.1 A more detailed description

The protocol implemented here to generate discrete morphological matrices is based on the ones developed in (Guillerme and Cooper 2016; O’Reilly et al. 2016; Puttick et al. 2017; E. et al., n.d.).

  • The first tree argument will be the tree on which to “evolve” the characters and therefore requires branch length. You can generate quick and easy random Yule trees using ape::rtree(number_of_taxa) but I would advise to use more realistic trees for more realistic simulations based on more realistic models (really realistic then) using the function tree.bd from the diversitree package (FitzJohn 2012).
  • The second argument, character is the number of characters. Pretty straight forward.
  • The third, states is the proportion of characters states above two (yes, the minimum number of states is two). This argument intakes the proportion of n-states characters, for example states = c(0.5,0.3,0.2) will generate 50% of binary-state characters, 30% of three-state characters and 20% of four-state characters. There is no limit in the number of state characters proportion as long as the total makes up 100%.
  • The forth, model is the evolutionary model for generating the character(s). More about this below.
  • The fifth and sixth, rates and substitution are the model parameters described below as well.
  • Finally, the two logical arguments, are self explanatory: invariant whether to allow invariant characters (i.e. characters that don’t change) and verbose whether to print the simulation progress on your console.

5.1.1.1 Available evolutionary models

There are currently three evolutionary models implemented in sim.morpho but more will come in the future. Note also that they allow fine tuning parameters making them pretty plastic!

  • "ER": this model allows any number of character states and is based on the Mk model (Lewis 2001). It assumes a unique overall evolutionary rate equal substitution rate between character states. This model is based on the ape::rTraitDisc function.
  • "HKY": this is binary state character model based on the molecular HKY model (Hasegawa, Kishino, and Yano 1985). It uses the four molecular states (A,C,G,T) with a unique overall evolutionary rate and a biased substitution rate towards transitions (A <-> G or C <-> T) against transvertions (A <-> C and G <-> T). After evolving the nucleotide, this model transforms them into binary states by converting the purines (A and G) into state 0 and the pyrimidines (C and T) into state 1. This method is based on the phyclust::seq.gen.HKY function and was first proposed by O’Reilly et al. (2016).
  • "MIXED": this model uses a random (uniform) mix between both the "ER" and the "HKY" models.

The models can take the following parameters: (1) rates is the evolutionary rate (i.e. the rate of changes along a branch: the evolutionary speed) and (2) substitution is the frequency of changes between one state or another. For example if a character can have high probability of changing (the evolutionary rate) with, each time a change occurs a probability of changing from state X to state Y (the substitution rate). Note that in the "ER" model, the substitution rate is ignore because… by definition this (substitution) rate is equal!

The parameters arguments rates and substitution takes a distributions from which to draw the parameters values for each character. For example, if you want an "HKY" model with an evolutionary rate (i.e. speed) drawn from a uniform distribution bounded between 0.001 and 0.005, you can define it as rates = c(runif, min = 0.001, max = 0.005), runif being the function for random draws from a uniform distribution and max and min being the distribution parameters. These distributions should always be passed in the format c(random_distribution_function, distribution_parameters) with the names of the distribution parameters arguments.

5.1.1.2 Checking the results

An additional function, check.morpho runs a quick Maximum Parsimony tree search using the phangorn parsimony algorithm. It quickly calculates the parsimony score, the consistency and retention indices and, if a tree is provided (e.g. the tree used to generate the matrix) it calculates the Robinson-Foulds distance between the most parsimonious tree and the provided tree to determine how different they are.

5.1.1.3 Adding inapplicable characters

Once a matrix is generated, it is possible to apply inapplicable characters to it for increasing realism! Inapplicable characters are commonly designated as NA or simply -. They differ from missing characters ? in their nature by being inapplicable rather than unknown(see Brazeau, Guillerme, and Smith 2018 for more details). For example, considering a binary character defined as “colour of the tail” with the following states “blue” and “red”; on a taxa with no tail, the character should be coded as inapplicable (“-”) since the state of the character “colour of tail” is known: it’s neither “blue” or “red”, it’s just not there! It contrasts with coding it as missing (“?” - also called as ambiguous) where the state is unknown, for example, the taxon of interest is a fossil where the tail has no colour preserved or is not present at all due to bad conservation!

This type of characters can be added to the simulated matrices using the apply.NA function/ It takes, as arguments, the matrix, the source of inapplicability (NAs - more below), the tree used to generate the matrix and the two same invariant and verbose arguments as defined above. The NAs argument allows two types of sources of inapplicability:

  • "character" where the inapplicability is due to the character (e.g. coding a character tail for species with no tail). In practice, the algorithm chooses a character X as the underlying character (e.g. “presence and absence of tail”), arbitrarily chooses one of the states as “absent” (e.g. 0 = absent) and changes in the next character Y any state next to character X state 0 into an inapplicable token (“-”). This simulates the inapplicability induced by coding the characters (i.e. not always biological).
  • "clade" where the inapplicability is due to evolutionary history (e.g. a clade loosing its tail). In practice, the algorithm chooses a random clade in the tree and a random character Z and replaces the state of the taxa present in the clade by the inapplicable token (“-”). This simulates the inapplicability induced by evolutionary biology (e.g. the lose of a feature in a clade).

To apply these sources of inapplicability, simply repeat the number of inapplicable sources for the desired number of characters with inapplicable data.

## Generating 5 "character" NAs and 10 "clade" NAs
my_matrix_NA <- apply.NA(my_matrix, tree = my_tree,
                         NAs = c(rep("character", 5),
                                 rep("clade", 10)))

## The first few lines of the resulting matrix
my_matrix_NA[1:10, 90:100]
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## t10 "-"  "1"  "1"  "2"  "1"  "0"  "0"  "0"  "1"  "0"   "0"  
## t1  "-"  "1"  "0"  "0"  "1"  "0"  "0"  "0"  "-"  "0"   "0"  
## t9  "-"  "1"  "1"  "0"  "1"  "0"  "0"  "0"  "-"  "0"   "0"  
## t14 "-"  "1"  "0"  "0"  "1"  "0"  "0"  "0"  "-"  "0"   "0"  
## t13 "-"  "1"  "0"  "0"  "1"  "0"  "0"  "0"  "-"  "0"   "0"  
## t5  "-"  "1"  "0"  "0"  "1"  "0"  "0"  "0"  "-"  "0"   "0"  
## t2  "1"  "1"  "0"  "0"  "1"  "0"  "0"  "0"  "0"  "0"   "0"  
## t8  "2"  "1"  "0"  "0"  "1"  "0"  "0"  "0"  "0"  "0"   "0"  
## t6  "-"  "1"  "1"  "0"  "0"  "1"  "1"  "2"  "0"  "1"   "1"  
## t15 "-"  "1"  "1"  "0"  "0"  "1"  "1"  "2"  "0"  "1"   "1"

5.1.2 Parameters for a realistic(ish) matrix

There are many parameters that can create a “realistic” matrix (i.e. not too different from the input tree with a consistency and retention index close to what is seen in the literature) but because of the randomness of the matrix generation not all parameters combination end up creating “good” matrices. The following parameters however, seem to generate fairly “realist” matrices with a starting coalescent tree, equal rates model with 0.85 binary characters and 0.15 three state characters, a gamma distribution with a shape parameter (\(\alpha\)) of 5 and no scaling (\(\beta\) = 1) with a rate of 100.

set.seed(0)
## tree
my_tree <- rcoal(15)
## matrix
morpho_mat <- sim.morpho(my_tree,
                         characters = 100,
                         model = "ER",
                         rates = c(rgamma, rate = 100, shape = 5),
                         invariant = FALSE)
check.morpho(morpho_mat, my_tree)
##                                     
## Maximum parsimony        103.0000000
## Consistency index          0.9708738
## Retention index            0.9919571
## Robinson-Foulds distance   4.0000000

5.2 Simulating multidimensional spaces

Another way to simulate data is to directly simulate an ordinated space with the space.maker function. This function allows users to simulate multidimensional spaces with a certain number of properties. For example, it is possible to design a multidimensional space with a specific distribution on each axis, a correlation between the axes and a specific cumulative variance per axis. This can be useful for creating ordinated spaces for null hypothesis, for example if you’re using the function null.test (Dı́az et al. 2016).

This function takes as arguments the number of elements (data points - elements argument) and dimensions (dimensions argument) to create the space and the distribution functions to be used for each axis. The distributions are passed through the distribution argument as… modular functions! You can either pass a single distribution function for all the axes (for example distribution = runif for all the axis being uniform) or a specific distribution function for each specific axis (for example distribution = c(runif, rnorm, rgamma)) for the first axis being uniform, the second normal and the third gamma). You can of course use your very own functions or use the ones implemented in dispRity for more complex ones (see below). Specific optional arguments for each of these distributions can be passed as a list via the arguments argument.

Furthermore, it is possible to add a correlation matrix to add a correlation between the axis via the cor.matrix argument or even a vector of proportion of variance to be bear by each axis via the scree argument to simulate realistic ordinated spaces.

Here is a simple two dimensional example:

## Graphical options
op <- par(bty = "n")

## A square space
square_space <- space.maker(100, 2, runif)

## The resulting 2D matrix
head(square_space)
##           [,1]       [,2]
## [1,] 0.2878797 0.82110157
## [2,] 0.5989886 0.72890558
## [3,] 0.8401571 0.53042419
## [4,] 0.3663870 0.75545936
## [5,] 0.2122375 0.98768804
## [6,] 0.9612441 0.07285561
## Visualising the space
plot(square_space, pch = 20, xlab = "", ylab = "",
     main = "Uniform 2D space")

Of course, more complex spaces can be created by changing the distributions, their arguments or adding a correlation matrix or a cumulative variance vector:

## A plane space: uniform with one dimensions equal to 0
plane_space <- space.maker(2500, 3, c(runif, runif, runif),
                           arguments = list(list(min = 0, max = 0),
                           NULL, NULL))

## Correlation matrix for a 3D space
(cor_matrix <- matrix(cbind(1, 0.8, 0.2, 0.8, 1, 0.7, 0.2, 0.7, 1), nrow = 3))
##      [,1] [,2] [,3]
## [1,]  1.0  0.8  0.2
## [2,]  0.8  1.0  0.7
## [3,]  0.2  0.7  1.0
## An ellipsoid space (normal space with correlation)
ellipse_space <- space.maker(2500, 3, rnorm,
                             cor.matrix = cor_matrix)

## A cylindrical space with decreasing axes variance
cylindrical_space <- space.maker(2500, 3, c(rnorm, rnorm, runif),
                                 scree = c(0.7, 0.2, 0.1))

5.2.1 Personalised dimensions distributions

Following the modular architecture of the package, it is of course possible to pass home made distribution functions to the distribution argument. For example, the random.circle function is a personalised one implemented in dispRity. This function allows to create circles based on basic trigonometry allowing to axis to covary to produce circle coordinates. By default, this function generates two sets of coordinates with a distribution argument and a minimum and maximum boundary (inner and outer respectively) to create nice sharp edges to the circle. The maximum boundary is equivalent to the radius of the circle (it removes coordinates beyond the circle radius) and the minimum is equivalent to the radius of a smaller circle with no data (it removes coordinates below this inner circle radius).

## Graphical options
op <- par(bty = "n")

## Generating coordinates for a normal circle with a upper boundary of 1
circle <- random.circle(1000, rnorm, inner = 0, outer = 1)

## Plotting the circle
plot(circle, xlab = "x", ylab = "y", main = "A normal circle")

## Creating doughnut space (a spherical space with a hole)
doughnut_space <- space.maker(5000, 3, c(rnorm, random.circle),
     arguments = list(list(mean = 0),
                      list(runif, inner = 0.5, outer = 1)))

5.2.2 Visualising the space

I suggest using the excellent scatterplot3d package to play around and visualise the simulated spaces:

## Graphical options
op <- par(mfrow = (c(2, 2)), bty = "n")
## Visualising 3D spaces
require(scatterplot3d)
## Loading required package: scatterplot3d
## The plane space
scatterplot3d(plane_space, pch = 20, xlab = "", ylab = "", zlab = "",
              xlim = c(-0.5, 0.5), main = "Plane space")

## The ellipsoid space
scatterplot3d(ellipse_space, pch = 20, xlab = "", ylab = "", zlab = "",
              main = "Normal ellipsoid space")

## A cylindrical space with a decreasing variance per axis
scatterplot3d(cylindrical_space, pch = 20, xlab = "", ylab = "", zlab = "",
              main = "Normal cylindrical space")
## Axes have different orders of magnitude

## Plotting the doughnut space
scatterplot3d(doughnut_space[,c(2,1,3)], pch = 20, xlab = "", ylab = "",
              zlab = "", main = "Doughnut space")

par(op)

5.2.3 Generating realistic spaces

It is possible to generate “realistic” spaces by simply extracting the parameters of an existing space and scaling it up to the simulated space. For example, we can extract the parameters of the BeckLee_mat50 ordinated space and simulate a similar space.

## Loading the data
data(BeckLee_mat50)

## Number of dimensions
obs_dim <- ncol(BeckLee_mat50)

## Observed correlation between the dimensions
obs_correlations <- cor(BeckLee_mat50)

## Observed mean and standard deviation per axis
obs_mu_sd_axis <- mapply(function(x,y) list("mean" = x, "sd" = y),
                         as.list(apply(BeckLee_mat50, 2, mean)),
                         as.list(apply(BeckLee_mat50, 2, sd)), SIMPLIFY = FALSE)

## Observed overall mean and standard deviation
obs_mu_sd_glob <- list("mean" = mean(BeckLee_mat50), "sd" = sd(BeckLee_mat50))

## Scaled observed variance per axis (scree plot)
obs_scree <- variances(BeckLee_mat50)/sum(variances(BeckLee_mat50))

## Generating our simulated space
simulated_space <- space.maker(1000, dimensions = obs_dim, 
                               distribution = rep(list(rnorm), obs_dim),
                               arguments = obs_mu_sd_axis,
                               cor.matrix = obs_correlations)

## Visualising the fit of our data in the space (in the two first dimensions)
plot(simulated_space[,1:2], xlab = "PC1", ylab = "PC2")
points(BeckLee_mat50[,1:2], col = "red", pch = 20)
legend("topleft", legend = c("observed", "simulated"),
        pch = c(20,21), col = c("red", "black"))

It is now possible to simulate a space using these observed arguments to test several hypothesis:

  • Is the space uniform or normal?
  • If the space is normal, is the mean and variance global or specific for each axis?
## Measuring disparity as the sum of variance
observed_disp <- dispRity(BeckLee_mat50, metric = c(median, centroids))

## Is the space uniform?
test_unif <- null.test(observed_disp, null.distrib = runif)

## Is the space normal with a mean of 0 and a sd of 1?
test_norm1 <- null.test(observed_disp, null.distrib = rnorm)

## Is the space normal with the observed mean and sd and cumulative variance
test_norm2 <- null.test(observed_disp, null.distrib = rep(list(rnorm), obs_dim),
                        null.args = rep(list(obs_mu_sd_glob), obs_dim),
                        null.scree = obs_scree)

## Is the space multiple normal with multiple means and sds and a correlation?
test_norm3 <- null.test(observed_disp, null.distrib = rep(list(rnorm), obs_dim),
                        null.args = obs_mu_sd_axis, null.cor = obs_correlations)

## Graphical options
op <- par(mfrow = (c(2, 2)), bty = "n")
## Plotting the results
plot(test_unif, main = "Uniform (0,1)")
plot(test_norm1, main = "Normal (0,1)")
plot(test_norm2, main = paste0("Normal (", round(obs_mu_sd_glob[[1]], digit = 3),
                              ",", round(obs_mu_sd_glob[[2]], digit = 3), ")"))
plot(test_norm3, main = "Normal (variable + correlation)")

If we measure disparity as the median distance from the morphospace centroid, we can explain the distribution of the data as normal with the variable observed mean and standard deviation and with a correlation between the dimensions.

References

Brazeau, Martin D, Thomas Guillerme, and Martin R Smith. 2018. “An algorithm for Morphological Phylogenetic Analysis with Inapplicable Data.” Systematic Biology 68 (4): 619–31. https://doi.org/10.1093/sysbio/syy083.

Dı́az, Sandra, Jens Kattge, Johannes HC Cornelissen, Ian J Wright, Sandra Lavorel, Stéphane Dray, Björn Reu, et al. 2016. “The Global Spectrum of Plant Form and Function.” Nature 529 (7585): 167. http://dx.doi.org/10.1038/nature16489.

E., O’Reilly Joseph, Puttick Mark N., Pisani Davide, and Donoghue Philip C. J. n.d. “Probabilistic Methods Surpass Parsimony When Assessing Clade Support in Phylogenetic Analyses of Discrete Morphological Data.” Palaeontology 61 (1): 105–18. https://doi.org/10.1111/pala.12330.

FitzJohn, Richard G. 2012. “Diversitree: Comparative Phylogenetic Analyses of Diversification in R.” Methods in Ecology and Evolution 3 (6): 1084–92. https://doi.org/10.1111/j.2041-210X.2012.00234.x.

Guillerme, Thomas, and Natalie Cooper. 2016. “Effects of Missing Data on Topological Inference Using a Total Evidence Approach.” Molecular Phylogenetics and Evolution 94, Part A: 146–58. https://doi.org/http://dx.doi.org/10.1016/j.ympev.2015.08.023.

Hasegawa, M., H. Kishino, and T. A. Yano. 1985. “Dating of the Human Ape Splitting by a Molecular Clock of Mitochondrial-DNA.” Journal of Molecular Evolution 22 (2): 160–74.

Lewis, P. 2001. “A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data.” Systematic Biology 50 (6): 913–25. https://doi.org/10.1080/106351501753462876.

O’Reilly, Joseph E., Mark N. Puttick, Luke Parry, Alastair R. Tanner, James E. Tarver, James Fleming, Davide Pisani, and Philip C. J. Donoghue. 2016. “Bayesian Methods Outperform Parsimony but at the Expense of Precision in the Estimation of Phylogeny from Discrete Morphological Data.” Biology Letters 12 (4). https://doi.org/10.1098/rsbl.2016.0081.

Puttick, Mark N, Joseph E O’Reilly, Alastair R Tanner, James F Fleming, James Clark, Lucy Holloway, Jesus Lozano-Fernandez, et al. 2017. “Uncertain-Tree: Discriminating Among Competing Approaches to the Phylogenetic Analysis of Phenotype Data.” Proceedings of the Royal Society B 284 (1846): 20162290. http://dx.doi.org/10.1098/rspb.2016.2290.