Hi everyone,

First post here so feel free to school me if my etiquette needs it.

I'm reading a very long string of data (genome stuff AT TT GT AA AG AA ... etc) and comparing it using hamming distance with a similar string. (Naively checking for similarity entry by entry).

I have quite a few questions to put to you but lets begin with one that will probably demonstrate my level of C++.

My code reads as follows:

int main () {
    ifstream indata;
    string str[2][899701];
    int j,count;    
    indata.open("file"); 

    if(!indata) { 
        cerr << "Error: file could not be opened" << endl;
        exit(1);
    }

    for (j=0; j<899701; j++){
        indata >> str[0][j];        //read in base line
    }
    while(indata){
        count=0;
        for (j=0; j<899701; j++){
                indata >> str[1][j];       //read in line to be compared
                count=count+(str[0][j]==str[1][j]);   //hamming distance
        }
    }

    indata.close();
    return 0;
}

My first question: Can I replace

    indata >> str[1][j];       //read in line to be compared
    count=count+(str[0][j]==str[1][j]);   //hamming distance

with something like

    count=count+(str[0][j]==indata.get);  

which would combine the two steps. Also, if this is possible would it significantly speed things up. I have >5000 strings to compare and each has ~800000 pairs of letters.

I am also open to any suggestions about different approaches to take here. Is C++ a good choice for this kind of problem?

Any help/suggestions/scathing criticism is much appreciated.
Peace.

A one-off calculation of the hamming distance between two strings would take O(N) time. This code would not be noticeably faster - but it is perhaps cleaner.

#include <string>

std::size_t hamming_distance( const std::string& a, const std::string& b )
{
    if( a.size() != b.size() ) return -1 ;
    std::size_t hd = 0 ;
    for( std::size_t i = 0 ; i < a.size() ; ++i ) hd += ( a[i] == b[i] ) ;
    return hd ;
}

#include <fstream>

int main()
{
    std::ifstream file( "file" ) ;
    std::string a, b ;
    std::getline( file, a ) ;
    while( std::getline( file, b ) )
    {
        std::size_t distance = hamming_distance( a, b ) ;
        // use distance
    }
}

If you are repeatedly calculating the hamming distance with a constant set of strings in the file, optimizations may be possible. For instance, encode the genome strings into a bit sequence and then calculate using bit-wise operations (with a std::bitset<>)

#include <unordered_map>

std::string create_bit_sequence( const std::string& str )
{
    static std::unordered_map<char,std::string> m =
                   { { 'A', "00" }, { 'C', "01" }, { 'T', "10" }, { 'G', "11" } } ;
    std::string result ;
    for( char c : str ) result += m[c] ;
    return result ;
}

Many thanks. You've given me a lot to think about. i will certainly try using bit-wise operations.

I'm having a number of problems implementing the bit sequence function. For instance

 for( char c : str ) result += m[c] ;

gives the error: "a function definition is not allowed here before the ':' token". I should mention I'm using XCode on Mac OSX so I've had to make a couple of changes.

#include <tr1/unordered_map>

std::string create_bit_sequence( const std::string& str )
{
    static std::tr1::unordered_map<char,std::string> m;
    m['A']="00";
    m['C']="01";
    m['G']="10";
    m['T']="11";
    std::string result ;
    for( char c : str ) result += m[c] ;
    return result ;
}

I had to change the declaration of m because I was getting the error:'m' must be initialised by the constructor, not by '{...}'.

Thanks.

I'm using XCode on Mac OSX

If it's a recent version of GCC or clang, you should be able to compile the code with
-std=c++11 compiler option.

Otherwise:

std::string create_bit_sequence( const std::string& str )
{
    std::string result ;

    for( std::string::size_type i = 0 ; i < str.size() ; ++i  )
    {
        switch( str[i] )
        {
            case 'A' : result += "00" ; break ;
            case 'C' : result += "01" ; break ;
            case 'G' : result += "10" ; break ;
            case 'T' : result += "10" ; break ;
            default : { /* unexpected */ }
        }
    }

    return result ;
}
 for( char c : str ) result += m[c] ;

It's a Java-like implementation of itration over a string, when char c will be at each iteration a character from the string.
Here's what my compiler says:

range-based-for loops are not allowed in C++98 mode

Also, perhaps this answer exaplins better:

This file requires compiler and library support for the upcoming ISO C++ standard, C++0x. This support is currently experimental, and must be enabled with the -std=c++0x or -std=gnu++0x compiler options.

That is, if you have an "older" compiler, such as me. :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.