Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

Inexact searches (or distance searches)

dna_string is a specialization of boost::genetics::basic_dna_string.

boost::genetics::basic_dna_string::find_inexact is available for the search.

There are other functions in the class boost::genetics::basic_dna_string too boost::genetics::basic_dna_string::append, boost::genetics::basic_dna_string::compare, boost::genetics::basic_dna_string::find, boost::genetics::basic_dna_string::compare_inexact.

Our genomes are more or less identical with only the occasional tiny difference between them. When we have a sequence to search against a reference, we often allow a few errors.

For RNA-SEQ searches, for example, matching gene translation to a reference such as ENSEMBL, we will possibly allow two errors but out of 100. For CRISPR searches, for gene editing, this can be more like 6 out of 20. These are two very different search criteria and different algorithms are used for different mixes.

For the 6/20 case, at present we use a brute force search, checking 6.4 billion subsequences against the search string.

void inexact_find()
{
    using namespace boost::genetics;

    // Make a DNA string. Only A, C, G or T allowed.
    dna_string str("ACAGAAACAGTACGTAGGATACAGGTACA");
    //                 GAAACA        GATACA

    // Search the string for a substring with up to one error.
    dna_string GATACA = "GATACA";
    size_t gaaaca_pos = str.find_inexact(GATACA, 0, (size_t)-1, 1);
    if (gaaaca_pos != dna_string::npos) {
        // The first search finds GAAACA (distance one from GATACA)
        std::cout << str.substr(gaaaca_pos, 6) << " found at location " << gaaaca_pos << "\n";
        size_t gataca_pos = str.find_inexact(GATACA, gaaaca_pos+1, (size_t)-1, 1);
        if (gataca_pos != dna_string::npos) {
            // The second search finds GATACA itself.
            std::cout << str.substr(gataca_pos, 6) << " found at location " << gataca_pos << "\n";
        }
    }
} // void inexact_find()

The full code for these is at misc_dna_examples.cpp.


PrevUpHomeNext