Skip to content
Advertisement

How to find sequence of characters matching an identifier in Linux/Unix?

I have a fasta file called mytext.fasta.

mytext.fasta

>lcl|NW_001820834.1_gene_4 [locus_tag=SS1G_01081] [db_xref=GeneID:5493597] [partial=5',3'] [location=complement(<6452..>8801)] [gbkey=Gene]
ATGCAATTGGCAGCAGTCCTAAGCCTCGTGGGCTTGGTTACGGCTCAATGTCCGTACGGATTTGACACAC
CACTTCAAAAGCGTGAATCTATTGATGCTCAAGCCAGTAGTTCTAGTTTCTTGAATCAATTCACAATTAA
CGATACCGATGCACACTTTACCACCGACGCAGGTGGGCCTATGCAAGAGGACACTAGTTTGAAAGCTGGG
>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA
>lcl|NW_001820834.1_gene_6 [locus_tag=SS1G_01083] [db_xref=GeneID:5494096] [partial=5',3'] [location=<12203..>15199] [gbkey=Gene]
ATGAGAGGCAAGCTTGGTGTCACAGTTGCTGCATTTGCGACGGCATTTCTAAATACGACACTTGCTCAAG
ACTCAACATCATCACAAGCGGATGCGGATACTACCACAAGTTATTGTCCCGTTTACACGCTCACAGCTTC
AGTTGATGCCAGCGCACCTATTATCCCAAACATCCACGATCCGCAGGCAATTAATCCACAAGATGTTTGT
CCGGGGTATACTGCATCCAATGTGAAGCGAACCTCTCACGGATTGACGGCTTCTCTGTCATTGGCTGGTG

When I do grep -A1 'SS1G_01082' mytext.fasta, I get:

>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA

Instead I want to get:

>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA

If you notice, every sequence starts with > in this file, so I want to get the full length of sequence when I do grep. How can I get this done?

Advertisement

Answer

It is easier with gnu awk using a custom RS:

awk -v RS='(^|n)>' '/SS1G_01082/{print RT $0}' file

>lcl|NW_001820834.1_gene_5 [locus_tag=SS1G_01082] [db_xref=GeneID:5493601] [partial=5',3'] [location=<9695..>10785] [gbkey=Gene]
ATGTTTTCCGGTCCCCAGAAACTTGGCAACGCCAAACAAAAATCAATTGGCCTCGCTTGTCACACAATTA
GTCCCCACGAAGCCTTGTACAAACTAGCCACTGGCTCGTCCCGGACCATTAGGGCAATGTTCAACAGAGA
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement