Using parallel to multi thread scripts

Traditional awk, sed or grep commends do not multi thread by default. Multi Threading a task on a large data set can improve the time up to 50%.

For additional information this package: http://www.gnu.org/software/parallel/

Dependencies: 

Debian 

sudo  apt-get install parallel

CentOS 6

wget http://download.opensuse.org/repositories/home:/tange/CentOS_CentOS-6/home:tange.repo

yum install parallel

Some use cases:
You want to find a string in text:
cat ./logs.csv | parallel  awk ‘/string/’ > ./stringoutput.csv

This parallel contains no additional functionality.

Lets say that you want to add the ability to break the file up into block sizes. After adding –block and size of block (ex. 100M, 10M, etc) you must add –pipe

cat ./logs.csv | parallel –block –pipe 100M  awk ‘/string/’ > ./stringoutput.csv

Since parallel is focused on maximizing the threads you can limit this by using: –jobs

By default –jobs is the same as the number of CPU cores. Arguments such as:

cat ./logs.csv | parallel –jobs 1 –pipe 100M  awk ‘/string/’ > ./stringoutput.csv

This will limit job to single CPU thread.

Summary:

parallel is very powerful addition to scripts that require additional or focused resources.

Source

 

Advertisements
Using parallel to multi thread scripts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s