How to speed-up sed that uses Regex on very large single cell BAM file

Question

I have the following simple script that tries to count the tag encoded with "CB:Z" in SAM/BAM file: Typically it needs to process 40 million lines. That codes takes around 1 hour to finish. This line sed 's/.*CB:Z:([ACGT]*).*/1/' is very time consuming. How can I speed it up? The reason I used the Regex is that the "CB" tag column-wise

Accepted Answer

Another awk, pretty much like @tripleee&#8217;s, I&#8217;d assume:$ samtools view -h small.bam | awk 'match($0,/CB:Z:[ACGT]*/) {               # use match for the regex match    a[substr($0,RSTART+5,RLENGTH-5)]++   # len(CB:z:)==5, hence +-5}END {    for(i in a)        print i,a[i]                     # sample output,tweak it to your liking}' Sample output:...TCTTAATCGTCC 175GGGAAGGCCTAA 190TCGGCCGATCGG 32GACTTCCAAGCC 76CCGCGGCATCGG 36TAGCGATCGTGG 125...Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.Notice 2: Quoting @Sundeep in the comments: &#8211; &#8211; using LC_ALL=C mawk '..' will give even better speed.

Advertisement

Answer