Just to introduce myself again, My name is Shaun and I'm a 21 year old engineering student with about... six weeks of c++ experience.
I main concern isn't my ability to get the job done. I have enough Matlab experience and logical thinking to accomplish this task. My worry is that with all the amazing libraries available for C++, there's probably a MUCH more efficient way of doing this.
I have a text file. The data is arranged in the following format.
<spaces> "Column 1 Data" <spaces> "Column 2 Data" <spaces> "Column 3 Data" <new line>
The number of spaces is not consistent. I need to extract the data from column three and do n-gram analysis on it. If anyone is interested as to what exactly n-grams are and how they're used, I'll be happy to explain. However, for the sake of brevity, I'll just provide an example; this should be sufficient.
For the string "Shaun" I would need to produce
S
h
a
u
n
Sh
ha
au
un
Sha
hau
aun
Shau
haun
Shaun
I should point out that I did NOT stop there because I had reached the length of the word. No matter what the size of the string, I will only break it up into a maximum string length of 5.
So, using a column based approach, I was able to accomplish this using a combination of Matlab and Excel. However, I'd like to do it in Visual Studio C++ 7.1.
My idea is to first use regular expressions to look for a space followed by any number of optional spaces. I'd replace every match with a comma, thus giving me a file delimited by commas and not a varying number of spaces.
Next, I can use the ifstream.get() function to break up the columns, discarding the first and second column and writing the characters in the this column to an object str of the class string, while looking for a \n to stop on.
Once I have str, I can break it up using... some function. This is the part I really need your help on.
Once I have broken it up, I'll store the pieces somewhere (I can do this part later, it's more complicated and is my task for next week) and then loop through again, discarding columns 1 and 2 from the next line and so on.
That's where I stand, I'm installing Boost right now and I'm reading up on the regular expression capabilities.
Thanks!