Home | Libraries | People | FAQ | More |
dna_string
is a specialization
of boost::genetics::basic_dna_string
.
boost::genetics::basic_dna_string::find_inexact
is available for the search.
There are other functions in the class boost::genetics::basic_dna_string
too boost::genetics::basic_dna_string::append
,
boost::genetics::basic_dna_string::compare
,
boost::genetics::basic_dna_string::find
,
boost::genetics::basic_dna_string::compare_inexact
.
Our genomes are more or less identical with only the occasional tiny difference between them. When we have a sequence to search against a reference, we often allow a few errors.
For RNA-SEQ searches, for example, matching gene translation to a reference such as ENSEMBL, we will possibly allow two errors but out of 100. For CRISPR searches, for gene editing, this can be more like 6 out of 20. These are two very different search criteria and different algorithms are used for different mixes.
For the 6/20 case, at present we use a brute force search, checking 6.4 billion subsequences against the search string.
void inexact_find() { using namespace boost::genetics; // Make a DNA string. Only A, C, G or T allowed. dna_string str("ACAGAAACAGTACGTAGGATACAGGTACA"); // GAAACA GATACA // Search the string for a substring with up to one error. dna_string GATACA = "GATACA"; size_t gaaaca_pos = str.find_inexact(GATACA, 0, (size_t)-1, 1); if (gaaaca_pos != dna_string::npos) { // The first search finds GAAACA (distance one from GATACA) std::cout << str.substr(gaaaca_pos, 6) << " found at location " << gaaaca_pos << "\n"; size_t gataca_pos = str.find_inexact(GATACA, gaaaca_pos+1, (size_t)-1, 1); if (gataca_pos != dna_string::npos) { // The second search finds GATACA itself. std::cout << str.substr(gataca_pos, 6) << " found at location " << gataca_pos << "\n"; } } } // void inexact_find()
The full code for these is at misc_dna_examples.cpp.