Parse or Tokenize String

Question

Akilah712 0 Newbie Poster

17 Years Ago

I am working on a project and I need to process a text file.

I have read in the text file.
What I want to do is break the textfile up. The textfile looks like this:

>Name 1
ABCDEF
GHIJKLM
>Name2
GHIJKLM

What I want is to store each name and each sequence that follows separately it. For instance Name[0] = Name1. Name2 = Name2. Letters[0] = ABCDEF and Letter[1] = GHIJKLM.

I have done this in Java where I used the strink tokenizer, but from what I have read, there is no tokenizer in C++.

So far I have read the entire contents of the text file into a buffer. Then from there I have split up the file into two parts. Now I need to separate each name from the set of letters. Here is what I have so far

void processFile (){
	string contents;
	string fileName;

	cout << "Enter the file name: ";
	getline (cin,fileName);
	
	//Open file
	ifstream file(fileName.c_str());    // might want to add binary mode here
	
	//Read contents of file into a string
	stringstream buffer;
	buffer << file.rdbuf();	
	string str(buffer.str()); 
	contents = str.c_str();//entire file
	
       //close the file	
	file.close();
	
	
	//Use tokenizer function to get name and sequence sets		
	//will store the tokes of each name+sequence
	vector<string> sets;

	//get the sets - name+sequence
	Tokenize (contents, sets, ">");
	
	//stores the splitted names and sequences
	vector<string>dna;
	//split the sets	
	for (int x = 0; x < sets.size(); x++){
		Tokenize (sets[x], dna, "\n");		
			
	}		
	

	//store the names
	for (int i = 0; i<dna.size();){
	
		names.push_back(dna[i]);
		i = i + 2;
		
	}
	//store each sequence
	for (int j = 1; j<dna.size();){
	
		sequences.push_back(dna[j]);
		j = j + 2;
	}

}//End processFile


void Tokenize(const string& str,vector<string>& tokens, string del)
{
    string delimiters = del;
	
	// Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos     = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

The problem is with getting the names and the sets of letters. Basically I split the string I made at each occurence of ">". After that it does not work well.

c++

4 Contributors
7 Replies
282 Views
2 Days Discussion Span
Latest Post 17 Years Ago Latest Post by Akilah712

Bench 212 Posting Pro

17 Years Ago

Why are you copying the file contents out of a filestream, into a stringstream, then, into a string, and then, into another string?

you can extract each line of the file one by one straight into your vector, without all that fuss...

#include <iostream>
#include <vector>
#include <fstream>
#include <string>

using namespace std;

int main()
{
    ifstream fs("test.txt");
    string input;
    vector<string> sets;
    while( getline(fs, input) )
        sets.push_back(input);
}

All which remains is to work out which elements in your vector are names (the ones which start with '<' )

vijayan121 1,152 Posting Virtuoso

17 Years Ago

#include <fstream>
#include <string>
#include <vector>
#include <cassert>
using namespace std ;

int main()
{
  const char DELIM = '>' ;
  const char* const file_name = "whatever" ;
  ifstream file( file_name ) ; assert(file) ;
  vector<string> names, sequences ;
  string line ; 

  // skip lines till we get one starting with DELIM
  while( getline(file, line) ) 
    if( !line.empty() && line[0]==DELIM ) break ;

  names.push_back( line.substr(1) ) ;
  string charseq ;
  while( getline(file, line) )
  {
    if( !line.empty() && line[0] == DELIM )
    {
      sequences.push_back(charseq) ;
      charseq.clear() ;
      names.push_back( line.substr(1) ) ;
    }
    else
     charseq += line + '\n' ;
  }
  sequences.push_back(charseq) ;
}

iamthwee commented: short n sweet +11

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Akilah712 0 Newbie Poster · Answer 1 · 2007-07-19T09:03:26+00:00

I tried that. However it only made things more difficult.

It split the contents of the file into individual lines therefore splitting up the information that I need.

Example.
>Name
ABCDEFG
HIJKLMNO

I have to extract the name, and then the letters must be stored together????

Jessehk 20 Newbie Poster · Answer 2 · 2007-07-19T10:40:30+00:00

I don't exactly understand the question, but have you considered using something like boost::tokenizer?

http://boost.org/libs/tokenizer/index.html

Bench 212 Posting Pro · Answer 3 · 2007-07-19T18:30:38+00:00

I tried that. However it only made things more difficult.
It split the contents of the file into individual lines therefore splitting up the information that I need.
Example.
>Name
ABCDEFG
HIJKLMNO
I have to extract the name, and then the letters must be stored together????

That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.

There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.

Akilah712 0 Newbie Poster · Answer 4 · 2007-07-20T08:28:17+00:00

That's easy enough. Identify and isolate the vector elements which contain names, then concatenate the others together.
There's loads of different ways of doing this .. If this doesn't work for you, then you need to tell us how you intend to store the individual data 'records'. You may or may not be able to do it all in a single step.

No luck here.

I can isolate the names and that's it.

I want an array of names and an array of sequences.

Names[0] = name1

Names[1] = name2

Sequences[0] = ABC...
Sequences[1] = ABC....

I just want to parse the text file at the ">" symbol.

Akilah712 0 Newbie Poster · Answer 5 · 2007-07-21T05:31:49+00:00

Thanks.

It worked, but I had to change charseq.clear() to charseq.erase(0, charseq.length());

My compiler gave me an error for clear. Said it's not part of the basic string library.

Thanks again!!!