Here is my code:

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <fstream>
#include <stdlib.h>//(for atoi to work)

using namespace std;

void usage()
{
	cout << "Usage: <input1> <input2> <output>\n";
	cout << "\n see README for more details.\n";
	exit(1);
}

int main(int argc, char *argv[])
{
	cout << "\nshmoosh - concatenates and uniques wordlists into one\n";

	if(argc!=4)
		usage();

	vector<string> vec_wordlist_compilation;

///////////////////////////input1//////////////////////////////

	ifstream wordlistfile(argv[1]);
	if(!wordlistfile.is_open())
	{
		cout<<"\nError opening file \'"<<argv[1]<<"\'\n";
		exit(1);
	}
	int x=0;
	string word;
	while(getline(wordlistfile,word)){
		vec_wordlist_compilation.push_back(word);
		x++;
	}
	cout << x << " words loaded from file \'"<<argv[1]<<"\'\n";

	wordlistfile.close();

///////////////////////////input2//////////////////////////////

	ifstream wordlistfiletwo(argv[2]);
	if(!wordlistfiletwo.is_open())
	{
		cout<<"\nError opening file \'"<<argv[2]<<"\'\n";
		exit(1);
	}
	int v=0;
	while(getline(wordlistfiletwo,word)){
		vec_wordlist_compilation.push_back(word);
		v++;
	}
	cout << v << " words loaded from file \'"<<argv[2]<<"\'\n";

	wordlistfiletwo.close();

////////////////////////////sort//////////////////////////////
	cout << "\nsorting " << v+x << " words, removing duplicates...\n";

	//sort vector (least to greatest)...
	sort(vec_wordlist_compilation.begin(),vec_wordlist_compilation.end());
	//remove duplicates...
	

vec_wordlist_compilation.resize((unique(vec_wordlist_compilation.begin(),vec_wordlist_compilation.end()))-vec_wordlist_compil

ation.begin());

	/*for(unsigned int c=0;c<vec_wordlist_compilation.size();c++)
		cout << vec_wordlist_compilation[c] << "\n";*/
	cout << vec_wordlist_compilation.size() << " unique words remain.\n";

////////////////////////////output//////////////////////////////

	ofstream output(argv[3]);
	for(unsigned int c=0;c<vec_wordlist_compilation.size();c++)
		output << vec_wordlist_compilation[c] << "\n";

return 0;
}

I made it to put two wordlists together and remove duplicates. The problem is that it crashes on large wordlists. I cannot say exactly how many words it takes without crashing, but somewhere around 200 megs, it crashes when loading the wordlists. I have 8 gigs of ram, so I know it's not running out of space. Is there a limitation (in MB) that a C++ vector can hold? Is there a way around this? If not, does anybody know of some library which will let me do this?

I thought about making a version that writes a temporary file to the hard drive and scans the file for every new word to majke sure it is not in there already, but I figured this would be waaaaay too slow.

Can anybody help?

A std::vector is guaranteed to maintan contiguous data space -- meaning it cannot handle really large data.

Use a std::deque instead. I looks much the same, but the data need not be stored contiguously -- meaning it can handle a great deal larger amount of data (because it can work with the OS/compiler's memory management more flexibly).

BTW, you shouldn't be using atoi(). Use a stringstream instead...

#include <sstream>
#include <stdexcept>
#include <string>

int myatoi( const std::string& s )
  {
  int result;
  std::istringstream ss( s );
  ss >> result;
  if (!ss.eof()) throw std::runtime_error( "not an integer" );
  return result;
  }

Untested!

Hope this helps.

>>I have 8 gigs of ram, so I know it's not running out of space

32-bit programs can not access all that memory at one time. Each 32-bit program is limited to about 2 gig ram.

I created a 1.2 gig text file that contained 10-character words (generated randomly). Then tried to read it into a std::list. The program crashed after reading just over 24 million words. Changed the program to use deque instead of list, and it read even fewer words before crashing. (my computer is running vista home, has 5 gig ram, and used vc++ 2008 express compiler/IDE)

Of course it would have been easier to check by calling the list's max_size() method. For deque

maxsize = 134217727
Press any key to continue . . .

Changed the program to use try/catch and got this:

23020000
23030000
Out of memory

// final size of the deque
size = 23031567

Curses! I forgot about the 2GB 32-bit limitation. I don't suppose there is any simple way to reconfigure my compiler (Microsoft Visual C++ 2008 Express Edition) to compile this code in 64-bit mode and thus enable it to access the necessary RAM? Would I have to re-write the code?

anybody? Compile this code in 64-bit?

I will start another more appropriately labeled thread.

Curses! I forgot about the 2GB 32-bit limitation. I don't suppose there is any simple way to reconfigure my compiler (Microsoft Visual C++ 2008 Express Edition) to compile this code in 64-bit mode and thus enable it to access the necessary RAM? Would I have to re-write the code?

You can not configure the Express edition to do that. You will have to buy a pro edition (or maybe standard). You also might want to check out GNU g++ because I think it will compile 64-bit programs.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.