May 2016 – Rush Simonson III

Traditional awk, sed or grep commends do not multi thread by default. Multi Threading a task on a large data set can improve the time up to 50%.

For additional information this package: http://www.gnu.org/software/parallel/

Dependencies:

Debian

sudo apt-get install parallel

CentOS 6

wget http://download.opensuse.org/repositories/home:/tange/CentOS_CentOS-6/home:tange.repo

yum install parallel

Some use cases:

You want to find a string in text:

cat ./logs.csv | parallel awk ‘/string/’ > ./stringoutput.csv

This parallel contains no additional functionality.

Lets say that you want to add the ability to break the file up into block sizes. After adding –block and size of block (ex. 100M, 10M, etc) you must add –pipe

cat ./logs.csv | parallel –block –pipe 100M awk ‘/string/’ > ./stringoutput.csv

Since parallel is focused on maximizing the threads you can limit this by using: –jobs

By default –jobs is the same as the number of CPU cores. Arguments such as:

cat ./logs.csv | parallel –jobs 1 –pipe 100M awk ‘/string/’ > ./stringoutput.csv

This will limit job to single CPU thread.

Summary:

parallel is very powerful addition to scripts that require additional or focused resources.

Source

Rush Simonson III

A few tips and tricks..

Month: May 2016

Using parallel to multi thread scripts