Python GIL / Threads / Processes

Recently, I wrote a tool to verify data in a .xlsx spreadsheet. This tool checks each column in the sheet against a specific regex pattern defined for that column. The tool was working great and I had checked a few of the typical smaller sheets. A few days later, I ran across a larger .xlsx file. I kicked off the script, checked back a few minutes later for the expected result and saw it was still processing this file. A few minutes I returned again to see it still running, I realized I had a performance problem on my hands.

The time it took the single threaded version to completely process this large file was:

Single-Threaded
real 54m59.385s
user 54m54.503s
sys 0m4.123s

I immediately started looking at the options available to me. I started with the threading module. This seemed like the obvious solution.

I started all the threads and started monitoring the load on the machine. I was unimpressed with promised “threading”, it appeared to have made no difference in machine load or core utilization. When time came back, I was surprised to see that performance was worse than single-threaded implementation.

Threaded
real 74m39.904s
user 71m49.452s
sys 15m45.056s

There was a lot of time spent in the kernel space as you can see by the increased time in sys, but there was no performance increase.

I kept researching for answers and started to study the Global Interpreter Lock or GIL. I had heard of the GIL before, mostly in articles complaining about Python. I started to wonder if I had wasted my time writing this tool in Python and should have chosen my new found friend Go. Go does not suffer from the GIL and is designed for concurrency, but no-one that has to use or maintain the tool other than myself are familiar with Go, hence why I wrote it in Python. I knew that there had to be a solution out there somewhere, since Python is a very popular language. Some suggestions included using an alternative interpreter like Jython or IronPython, but that seemed just as extreme as a rewrite in another language.

I finally found the alternative to threading, the Multiprocess package. Multiprocess advertises itself as: “a package that supports spawning processes using an API similar to the threading module.” After my experience with the poor performance of using the threading module, I was already very skeptical. Interestingly enough the arguments passed to the multiprocess.Process function are almost identical syntax to threading.Thread, meaning that this required minimal code changes on my part. I started the benchmark again, very skeptical of what was going to happen. I could almost tell immediately there was going to be a difference, even before opening up top, because almost immediately the fans on my laptop kicked on. I could see that all cores were being used, and load was where I expected.

Here is the result:

Multiprocess
real 17m32.867s
user 137m13.869s
sys 0m18.850s

As you can see a night and day difference. With just a couple of changes to the package import and function call that I made for threading and now I am seeing the results I expected. I wonder what the use case for the threading package is, since it appears that the Multiprocess package is significantly better.  I was able to continue to use Python for this project and in addition lessen my worries about performance problems that I might encounter in future projects.

Advertisements
Python GIL / Threads / Processes

Using parallel to multi thread scripts

Traditional awk, sed or grep commends do not multi thread by default. Multi Threading a task on a large data set can improve the time up to 50%.

For additional information this package: http://www.gnu.org/software/parallel/

Dependencies: 

Debian 

sudo  apt-get install parallel

CentOS 6

wget http://download.opensuse.org/repositories/home:/tange/CentOS_CentOS-6/home:tange.repo

yum install parallel

Some use cases:
You want to find a string in text:
cat ./logs.csv | parallel  awk ‘/string/’ > ./stringoutput.csv

This parallel contains no additional functionality.

Lets say that you want to add the ability to break the file up into block sizes. After adding –block and size of block (ex. 100M, 10M, etc) you must add –pipe

cat ./logs.csv | parallel –block –pipe 100M  awk ‘/string/’ > ./stringoutput.csv

Since parallel is focused on maximizing the threads you can limit this by using: –jobs

By default –jobs is the same as the number of CPU cores. Arguments such as:

cat ./logs.csv | parallel –jobs 1 –pipe 100M  awk ‘/string/’ > ./stringoutput.csv

This will limit job to single CPU thread.

Summary:

parallel is very powerful addition to scripts that require additional or focused resources.

Source

 

Using parallel to multi thread scripts

Installation of Fedora 23 on HP EliteBook 840 G1

Before I start with the brief tutorial I want to just list the configuration, results may vary based on the specs of the machine.

  • 14″ inch 1920 x 1080 display
  • i5-4300U 1.9 Base clock 2 core 4 thread
  • 8gb DDR3 12800 ADATA RAM
  • 180GB 530 2.5′ Intel SSD

I decided to use the latest version of Fedora based on the good experience that I have had with my desktop. The latest version of Fedora runs a very clean, stable and beautiful version of GNOME.

First, I downloaded the .iso and burned it to a USB drive. Since, the laptop might either ship with Windows 7 or 8.1 the BIOS might be set to either UEFI or Legacy Boot. In order to check this power off the machine, at BOOT when at the HP logo screen select ESC.

BIOS1

Once you have selected ESC. You will see a list of options, you will select the option f10.

IMG_0072BIOS3

This will take you to the BIOS setup screen, go to the Advanced Tab at the top right corner.

BIOS4

Make sure that USB boot is selected.

BIOS5

Scroll down further to the Boot Mode and select Legacy Boot. Once you have set these settings make sure to save and exit. You should now be able to boot from your USB bootable drive.

Most of the devices worked out of the box, the only devices that did not were:

  • Sound
  • Fingerprint Reader
  • f8 Mic Disable hot-key (There were so few I had to list it.)

Before I got into resolving the issues with these devices I wanted to get all the updates:

sudo dnf update

After all the updates finished I went to install Google Chrome, Google has really made this easier than in past years.

https://www.google.com/chrome/browser/desktop/

It should open in Software Updater, then select install and enter root password.

There are a few guides to “Things to Install after Installing Fedora 23”

It will vary based on your usage what will want to install, however here is a link:

http://www.binarytides.com/better-fedora-23/

The issue with the sound driver was related to an incomaptibilty issue with PulseAudio. I was easily able to resolve this by simeple Terminal Command:

sudo dnf|yum remove alsa-plugins-pulseaudio

Once I rebooted the issue was immediately resolved. I have tried so far without success in the past few hours since I have had the machine to probe the FIngeprint driver to see what drivers might exist for it. I will update this blog if I do find a fix for this. Please let me know if you think there are any other important additions.

Installation of Fedora 23 on HP EliteBook 840 G1