Home | Libraries | People | FAQ | More |
We want to store the genetic data in a memory efficent way that allows the fastest access. Memory access will always be the dominant performance factor for search algorithms and compact storage will make the most difference.
boost::genetics::dna_string
behaves much like std::string, we store the bases as two bits per character.
This is optimal for very large databases such as the human genome which contains
around 3.2 billion bases.
boost::genetics::augmented_string
stores the bases as two bits but allows long runs of 'N' and other characters.
This is intended to store genomes as a single, very long string. This should
not be used for normal text data, however, std:string is the best choice for
this.
boost::genetics::two_stage_index
has a first index based on content and a second index based on position. After
an initial content look-up we can then search on position without having to
touch the reference data - important as random memory access is very slow.
boost::genetics::fasta
uses an augmented_string, a two_stage_index and a vector of chromosome data
to represent an entire genome.