Containers and Datatypes

We want to store the genetic data in a memory efficent way that allows the fastest access. Memory access will always be the dominant performance factor for search algorithms and compact storage will make the most difference.

boost::genetics::dna_string behaves much like std::string, we store the bases as two bits per character. This is optimal for very large databases such as the human genome which contains around 3.2 billion bases.

boost::genetics::augmented_string stores the bases as two bits but allows long runs of 'N' and other characters. This is intended to store genomes as a single, very long string. This should not be used for normal text data, however, std:string is the best choice for this.

boost::genetics::two_stage_index has a first index based on content and a second index based on position. After an initial content look-up we can then search on position without having to touch the reference data - important as random memory access is very slow.

boost::genetics::fasta uses an augmented_string, a two_stage_index and a vector of chromosome data to represent an entire genome.