Recently, I wrote a tool to verify data in a .xlsx spreadsheet. This tool checks each column in the sheet against a specific regex pattern defined for that column. The tool was working great and I had checked a few of the typical smaller sheets. A few days later, I ran across a larger .xlsx file. I kicked off the script, checked back a few minutes later for the expected result and saw it was still processing this file. A few minutes I returned again to see it still running, I realized I had a performance problem on my hands.
The time it took the single threaded version to completely process this large file was:
Single-Threaded real 54m59.385s user 54m54.503s sys 0m4.123s
I immediately started looking at the options available to me. I started with the threading module. This seemed like the obvious solution.
I started all the threads and started monitoring the load on the machine. I was unimpressed with promised “threading”, it appeared to have made no difference in machine load or core utilization. When
time came back, I was surprised to see that performance was worse than single-threaded implementation.
Threaded real 74m39.904s user 71m49.452s sys 15m45.056s
There was a lot of time spent in the kernel space as you can see by the increased time in sys, but there was no performance increase.
I kept researching for answers and started to study the Global Interpreter Lock or GIL. I had heard of the GIL before, mostly in articles complaining about Python. I started to wonder if I had wasted my time writing this tool in Python and should have chosen my new found friend Go. Go does not suffer from the GIL and is designed for concurrency, but no-one that has to use or maintain the tool other than myself are familiar with Go, hence why I wrote it in Python. I knew that there had to be a solution out there somewhere, since Python is a very popular language. Some suggestions included using an alternative interpreter like Jython or IronPython, but that seemed just as extreme as a rewrite in another language.
I finally found the alternative to threading, the Multiprocess package. Multiprocess advertises itself as: “a package that supports spawning processes using an API similar to the
threading module.” After my experience with the poor performance of using the threading module, I was already very skeptical. Interestingly enough the arguments passed to the
multiprocess.Process function are almost identical syntax to
threading.Thread, meaning that this required minimal code changes on my part. I started the benchmark again, very skeptical of what was going to happen. I could almost tell immediately there was going to be a difference, even before opening up
top, because almost immediately the fans on my laptop kicked on. I could see that all cores were being used, and load was where I expected.
Here is the result:
Multiprocess real 17m32.867s user 137m13.869s sys 0m18.850s
As you can see a night and day difference. With just a couple of changes to the package import and function call that I made for threading and now I am seeing the results I expected. I wonder what the use case for the threading package is, since it appears that the Multiprocess package is significantly better. I was able to continue to use Python for this project and in addition lessen my worries about performance problems that I might encounter in future projects.