file1:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 chr1 14969 15038 NR_024540_1_r_WASH7P_69 chr1 15795 15947 NR_024540_2_r_WASH7P_152 chr1 16606 16765 NR_024540_3_r_WASH7P_15 chr1 16857 17055 NR_024540_4_r_WASH7P_198
and file2:
NR_024540 11
I need find match file2
in file1
and print whole file1 + second column of file2
So ouptut is:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11 chr1 14969 15038 NR_024540_1_r_WASH7P_69 11 chr1 15795 15947 NR_024540_2_r_WASH7P_152 11 chr1 16606 16765 NR_024540_3_r_WASH7P_15 11 chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
My solution is very slow in bash:
#!/bin/bash while read line; do c=$(echo $line | awk '{print $1}') d=$(echo $line | awk '{print $2}') grep $c file1 | awk -v line="$d" -v OFS="t" '{print $1,$2,$3,$4"_"line}' >> output done < file2
I am prefer FASTER any bash or awk solution. Output can be modified, but need keep all the informations (order of column can be different).
EDIT:
Right now it looks like fastest solution according @chepner:
#!/bin/bash while read -r c d; do grep $c file1 | awk -v line="$d" -v OFS="t" '{print $1,$2,$3,$4"_"line}' done < file2 > output
Advertisement
Answer
Another solution using join
and sed
, Under the assumption that file1
and file2
are sorted
join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output
If the output order doesn’t matter, to use awk
awk 'FNR==NR{d[$1]=$2; next} {split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]} ' file2 file1
you get,
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11 chr1 14969 15038 NR_024540_1_r_WASH7P_69 11 chr1 15795 15947 NR_024540_2_r_WASH7P_152 11 chr1 16606 16765 NR_024540_3_r_WASH7P_15 11 chr1 16857 17055 NR_024540_4_r_WASH7P_198 11