Skip to content
Advertisement

sorting takes 2 hours on vagrant – approx 100m lines

What can I do to optimize this sort?

I am running:

mytime="$(time ( cat quotes_2009-04.txt | tr [:space:] 'n' | grep -v "^s*$" | sort | uniq -c  ) 2>&1 1>/dev/null )"

and then:

echo "$mytime"

getting the following output:

real    134m0.107s
user    105m27.274s
sys 42m34.889s

here’s the datatset that I am using:

enter image description here

a preview of the original dataset:

enter image description here

here are the details on the vagrant machine:

enter image description here

What can I do to optimize this sort?

Advertisement

Answer

Split your data into several files, sort each file in parallel, then merge the files together. see here for example.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement