So I must build an app that has to do with mail address management and more.
Let's say the user has an excel file with 2 millions of rows (email addresses). I made it the oledb way and the first mistake I made was putting ~500k rows in a datagridview, bad bad mistake. The tiny app turned out to occupy ~700mb of ram.
I ditched the datagridview for now (I will later implement it with virtualization + on demand pages). Now with only the dataset it goes to about 170mb then settles at around 100mb of ram.
I would really appreciate some advice on what's the best way to deal with this kind of files(excel, text, csv all with about 2 mil rows) keeping in mind that I need to verify each row against a regex expression, delete duplicates and export to excel, csv or text files.

Are you stuck with this technology choice? Because from your description, this project has outgrown Excel as a useful backing data store. I'd consider a database-oriented solution like MySQL (which does incorporate regular expressions).

I think that, if possible, the data could be read in chunks of, say, 2K rows. Each time a chunk is read, first look for duplicates inside the chunk and, then, compare to the other 2M – 2K. If a row is present in the chunk of 2K and in the reminding 1,998,000 rows, discard from the chunk. Supposing there are 2K- N duplicates left in the chunk, proceed to write into a text file formatted in CSV, for example. Continue with another chunk of 2K rows in the same manner, appending the unique email from the second 2k rows to the text file with the first chunk. This will require 1000 chunks to be treated and examined by Regex (1000x2000=2M).

A spreadsheet is not the application to use for something with two million rows. For that you should be using a database. There are queries you can run to detect duplicates, or you can make the email address the primary key and detect the duplicate on insert with a try/catch. Exporting database records to Excel is trivial.

I'm not making the excel file, the client is. I don't think it can be helped. I appreciate all your answers.

What you could do is create a database table and import the Excel data. Then you could use that table to do your computations (dup check, etc.). Of course, you could always go to whomever created that spreadsheet originally and convince them of the error of their ways.

Yeah. Right. Good luck with that ;-P

What I would do in this situation is build the database with a simple GUI to maintain it, then give it to the user as a replacement for the spreadsheet. I would also try very hard to avoid using the word "moron" in that conversation.

You can determine which emails occur more than once by

SELECT email, COUNT(email) AS NumOcc
  FROM myTable
 GROUP BY email
HAVING (COUNT(email) > 1)

and you can select all the unique records by

SELECT DISTINCT *
  FROM myTable t1
  JOIN myTable t2
    ON t1.email  = t2.email
   AND t1.field2 = t2.field2
    (repeat prev line for all remaining fields)

Funny and helpful, thank you. I need to deliver this project friday and I have to do a lot more than this. I managed to optimized it alot (in my opinion). Also I found out that Excel has a limit of ~1.048.000 rows. When I load 1 mil rows in the app it goes to 300mb and then settles at about 200mb.
Eliminating dupes and checking the syntax works pretty fast(seconds). Exporting to txt takes about 2 seconds. Most users will work with max 100k rows. I was testing the absolute limit.
Update 1.1 will come with databases and more. :)
I wanted to implement a progress bar on excel file loading but that is a bit complicated for me (I have almost 0 knowledge about threading).
Now I wonder what happens if the user has a custom server and tries to send 1 mil emails.

  • to 1 million recipients, not 1 mil emails, sry.
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.