Skip to content
Advertisement

Run html2text using parallel

I am using html2text from Github in-which I was able to run-it on all the .html files in my folder using for file in *.html; do html2text "$file" > "$file.txt"; done but it’s some-what slow. How can I use html2text with parallel on all my .html files?

Advertisement

Answer

The original answer was:

for file in *.html
do
    html2text "$file" > "$file.txt" & 
done

The & sign at end of command tells bash to put the command in background and return control to calling place.

Not sure if it will work well for 1000s of files as it would spawn a new process for each file.


However, as OP asked for this to work for millions of files, this is obviously not feasible, as it would spawn millions of background processes, potentially hanging the machine.

What you have to understand is that processing millions of files WILL take more time, exactly depending on your hardware and OS limits. Technically a million more times than single file.

The reason the above answer seemed to work for you for 100 of files instantly, was because you got command prompt back immediately. However, it does not mean that the work was finished at that point, because all those background processes might be still working until they finish, even though you can do something else meanwhile.

You could theoretically divide the file list in chunks and work chunk by chunk, however, after testing this approach I do not think you will get the end result much faster than doing parallel.

So, based on the number of files you have to process, I would suggest running parallel as you yourself found out, maybe with tweaking the number of parallel jobs significantly though.

So something like this should work:

find . -type f -name *html > FLIST
parallel --a FLIST -j 1000 'html2text {} > {.}.txt'

Note, this is syntax for OP’s Python version of html2text. For options using eg. Ubuntu distro available html2text binary package, please see previous edit of the answer.

This will do your html in chunks of 1000 parallel files and not use piping (which can sometimes slow down things considerably).

If this is too slow, try increasing -j to maybe 10000 — but then you are venturing into hardware/operating system limitations of having 10000 parallel processes spawned all the time.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement