Hey Gang!

I'm hitting a point with my (python/mysql/linux) app for processing large amounts of network records where I need to make a design decision.
I definitely want my app to have high performance. Because optimization as a skill set is so rare there is no reason not to employ it if you have it. No one can copy you because innovation is not what most tech start-ups do and they grew up coding on pay-per-flop architectures.
My methodology is to get the data into the database and to let the database do database things. I got very good advice here at Dani Web to switch to MySQL. This has been a tremendous time saver!
And yet some things I can rely upon python and python alone to do.
I am trying to minimize those things. For example, I plan to hand-write warehousing drivers in ANSI-C to get the data into the database without python.
And yet not everything can be accomplished in C.
Do any of you have any general advice about python optimization? I have tried all of the obvious things like optimizing the interpreter, things that have always worked with perl or ruby. Python has been less than cooperative.

I wish I could help but I do MySQL and PHP. I definitely agree with you that it makes sense to get data into the database and then process it and query it as much as you can from in there. However, it’s been my experience that, just because you can do it from within MySQL, doesn’t mean you should. MySQL is not always the best tool for the job, the same way Python or PHP are not always the best tools for the job. Sometimes it’s more efficient to do things, especially manipulating records, from outside the database.

I can't speak on optimization in general without seeing the code. In my previous life I was a Windows SysAdmin/dbadmin as well as a digital plumber. I wrote many apps that had to move large quantities of data from place to place, for example, importing 16000+ records into a SQL database every five minutes. I did all this with vbScript (today I would choose Python). The trick to processing that many records quickly was using vbScript to format the records, then using BULK INSERT to insert all of the records in one transaction. This drastically reduced the processing time by not having to submit each insert separately. The import load later grew to 16,000 records on the hour and 16,000 at five past the hour, plus the regular 16,000 every five minutes. Scripting easily handled the load. You could easily write the massaging code in c/c++ and compare it to the equivalent in Python. Considering the overhead for file I/O would be the same in both, I'd be surprised if the difference was significant.

commented: Thank you. I have decided that I shall go for the bulk insert. +1
  1. Get something working in Python if that's your comfort zone.
  2. Profile it to find out where the bottlenecks are (do not make guesses, find evidence)

For example, if the program is spending a lot of time waiting for I/O (aka the disk), then writing chunks of code in C isn't going to buy much (it will just wait more efficiently).

If it's spending a lot of time talking MySQL, then research alternative methods of driving MySQL. Driving the same methods in C won't fix anything.

How much faster could it be?
Say for example all the input conversion was zero cost. Time how long it takes to dump say 10K records into the MySQL database.
If that time is only marginally faster than doing all the work, nothing will be gained by rewriting chunks in C.

How much faster could you make it?
Turning 1 hour into 55 minutes isn't going to change the way you use the program. You're still going to run it and then go to lunch.
You need an order of magnitude improvement (like hours into minutes) to make it worth doing. Shaving off a few % might be academically interesting, but if your time is someone's $, then think carefully.

Next, what's the payback period?
Say you spend a month (say 150 hours) carefully crafting, debugging and testing the best C version ever.
If you made it 10 minutes faster, you need to run it 1000 times before you see a return on that time investment.
But bear in mind the previous point, it's only a payback if the users could use those 10 minutes productively. If it's 10 minutes saved in an hour, people are still coming back from lunch after an hour regardless.

commented: Thank you, I realize that I must now profile my system. +1

Yes, it's a non-negotiable that you mass insert records into the database. In my situation, when I mentioned it was more performant to update records in PHP than MySQL, I was specifically trying to transform strings with regex and some very simple logic. Although there were many, many rows, it was more performant to update them one at a time in PHP than MySQL because MySQL's IF() and REGEXP() functions aren't anywhere as performant as PHP's. So, in my case, using the right tool for the job meant recognizing that MySQL was best as a datastore and PHP was best for handling all the business logic.

commented: Thanks, mass records insert it is! +1

To improve Python's performance, you may want to utilize libraries like NumPy and pandas for effective data management. Profiling tools like cProfile can help pinpoint any bottlenecks in your code. Employing multiprocessing can make use of multiple CPU cores, and integrating Cython can accelerate vital parts of your Python code. Lastly, for high-performance tasks, incorporating C/C++ extensions can markedly boost performance.

commented: poster of chatgpt hand waving pith with no actionable substance -4
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.