Traditional awk, sed or grep commends do not multi thread by default. Multi Threading a task on a large data set can improve the time up to 50%.
For additional information this package: http://www.gnu.org/software/parallel/
Dependencies:
Debian
sudo apt-get install parallel
CentOS 6
wget http://download.opensuse.org/repositories/home:/tange/CentOS_CentOS-6/home:tange.repo
yum install parallel
cat ./logs.csv | parallel awk ‘/string/’ > ./stringoutput.csv
This parallel contains no additional functionality.
Lets say that you want to add the ability to break the file up into block sizes. After adding –block and size of block (ex. 100M, 10M, etc) you must add –pipe
cat ./logs.csv | parallel –block –pipe 100M awk ‘/string/’ > ./stringoutput.csv
Since parallel is focused on maximizing the threads you can limit this by using: –jobs
By default –jobs is the same as the number of CPU cores. Arguments such as:
cat ./logs.csv | parallel –jobs 1 –pipe 100M awk ‘/string/’ > ./stringoutput.csv
This will limit job to single CPU thread.
Summary:
parallel is very powerful addition to scripts that require additional or focused resources.