Skip to content
Advertisement

Problem filtering 2 files using AWK for a given condition

First of all, thank you for your help. I have a problem filtering 2 files using AWK conditionals. The two files I want to filter are these one: Fasta.fa

>SiiA   lcl|NC_003197.2_prot_NP_463122.1_4111   100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NC_010102.1_prot_WP_000389232.1_4169    99.048  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|CP052796.1_prot_QJV25805.1_4154 97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NZ_CP009559.1_prot_WP_000389229.1_1106  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NZ_CP029897.1_prot_WP_000389235.1_4284  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NZ_CP053416.1_prot_WP_079774927.1_2027  77.619  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

species_id (the file is larger and contain the name of different species)

Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_010102.1_prot_WP_000389232.1_4169`
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_010102.1_prot_WP_000389232.1_4169

I want to use awk so it puts in fasta.fa the name of the species if both $2 in both files are the same in so the output in a new file will be something similar as this:

SiiA    Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154 97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE

I have tried this two codes but none of them give the result I expect. Final file contains the same information as fasta.fa, so nothing has changed

awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' species_id fasta.fa >> final
awk 'NR==FNR {a[$2]=$1; next} $1 in a {$3=$4;$2=$3;$2=a[$1];$4=$5;$5=$6}1' species_id fasta.fa >> final

Advertisement

Answer

You may use this awk:

awk 'FNR==NR {map[$2] = $1; next} $2 in map {$2 = map[$2] OFS $2; print}' species_id Fasta.fa

>SiiA Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement