Skip to content
Advertisement

How to speed-up sed that uses Regex on very large single cell BAM file

I have the following simple script that tries to count the tag encoded with “CB:Z” in SAM/BAM file:

JavaScript

Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.

This line sed 's/.*CB:Z:([ACGT]*).*/1/' is very time consuming. How can I speed it up?

The reason I used the Regex is that the “CB” tag column-wise position is not fixed. Sometimes it’s at column 20 and sometimes column 21.

Example BAM file can be found HERE.


Update

Speed comparison on complete 40 million lines file:

My initial code:

JavaScript

James Brown’s with AWK:

JavaScript

James Brown’s with MAWK:

JavaScript

Advertisement

Answer

Another awk, pretty much like @tripleee’s, I’d assume:

JavaScript

Sample output:

JavaScript

Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.

Notice 2: Quoting @Sundeep in the comments: – – using LC_ALL=C mawk '..' will give even better speed.

Advertisement