Skip to content
Advertisement

How to speed-up sed that uses Regex on very large single cell BAM file

I have the following simple script that tries to count the tag encoded with “CB:Z” in SAM/BAM file:

samtools view -h small.bam |  grep "CB:Z:" |
    sed 's/.*CB:Z:([ACGT]*).*/1/' |
    sort |
    uniq -c |
    awk '{print $2 " " $1}'

Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.

This line sed 's/.*CB:Z:([ACGT]*).*/1/' is very time consuming. How can I speed it up?

The reason I used the Regex is that the “CB” tag column-wise position is not fixed. Sometimes it’s at column 20 and sometimes column 21.

Example BAM file can be found HERE.


Update

Speed comparison on complete 40 million lines file:

My initial code:

real    21m47.088s
user    26m51.148s
sys 1m27.912s

James Brown’s with AWK:

real    1m28.898s
user    2m41.336s
sys 0m6.864s

James Brown’s with MAWK:

real    1m10.642s
user    1m41.196s
sys 0m6.484s

Advertisement

Answer

Another awk, pretty much like @tripleee’s, I’d assume:

$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) {               # use match for the regex match
    a[substr($0,RSTART+5,RLENGTH-5)]++   # len(CB:z:)==5, hence +-5
}
END {
    for(i in a)
        print i,a[i]                     # sample output,tweak it to your liking
}' 

Sample output:

...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...

Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.

Notice 2: Quoting @Sundeep in the comments: – – using LC_ALL=C mawk '..' will give even better speed.

Advertisement