Recently made a decision for an app I'm working on to accumulate data in batches AND THEN to insert it into the database with one statement rather than adding 1,000 records through one thousand separate inserts.

Is this done better through file or memory? If I write the segments of the bulk-insert statement to a file and then execute from the file, the security people will complain and say that I'm writing self-modifying code.

Yet, if I try to assemble the individual segments in memory, I run the risk that one of my end users will try to feed it too much data at once and that the thing will choke.

What's the lessser of two evils here? I think I might just be better off long term equipping the software to detect big batches and to break them up followed by execution from memory?

DGPickett commented: Good topic! +4

If your batch size is say 1K, and your user decides to spam you with say 10K+ records, what does it matter to you?

  • flush your existing batch queue
  • loop through the large request 1K at a time and immediately commit to the database using some batch update
  • add the remainder of the request to your batch queue.

I suppose that will partly depend on whether or not the inserted records are an "all or nothing" transaction. When I had to bulk insert 8000+ records at one time it was important that either all the records go in, or none. Recovering from a partial insert would have taken hours and would have interfered with further (every five minutes) inserts as well as the downstream processes.

What we do here at DaniWeb is write to Redis. Redis is a distributed, in-memory key-value pair storage engine and supports different data structures including lists, sets, hashes, streams, etc. It can be configured to be strictly in-memory or to be persistent.

No RDBMS or ETL app has a magic bullet. Getting data in is the big challenge. Some RDBMS have bulk input for CSV or the like.

One pitfall is that a big insert may trigger long and many row/page/table locks. One middle path is to import in bulk to a temporary table, a staging table, and use row count limits to apply just N rows at a time. Or you can just limit the batch size before any import. Locks may not concern a reporting or batch database, but are critical in a transactional database.

Once I compacted a huge table with clustered indices after a 90% delete this way, but I slept after each mini-batch as long as it took to apply so my DB used was under 50%, and the interactive users never noticed a thing. It took a few days to compact the whole table: copying to a batch table, deleting, reinserting, deleting the batch. The related indices also compacted! I recall setting N to the DRBMS page size in bytes, 2048, to use whole pages in each pass.

On one project, I opened a flat file in a UDF so, with a stored procedure supplying empty rows, I could call related UDFs to fetch fields in an insert-select. In another project, we used java in Oracle to map data like a table in a query for insert-select, which in Oracle can insert to many tables. My one insert-select for 5 tables was 4k long and terrified all the other developers!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.