Skip to content
Advertisement

awk combine sequence with substring key

I have two files for combining with tab delimited file. That two files key might be different only for “Reads” number.

I want to compare that two files and combined based on substring key and match.

Forexample,

File1 Key : "Cluster0_Reads255" 
File2 Key : "Cluster0_Reads50"
This case is same because "Cluster0_Reads" is identical.

This case I want to combine that two column with File1 Key name. Please check below example case.

File 1.

       A                B             
Cluster0_Reads255      500
Cluster1_Reads253      300
Cluster2_Reads100      200
Cluster3_Reads100      350

File 2.

        A               C          
Cluster0_Reads50       GE
Cluster1_Reads200      GA
Cluster2_Reads100      GA

Result.

       A                B       C            
Cluster0_Reads255      500     GE
Cluster1_Reads253      300     GA
Cluster2_Reads100      200     GA
Cluster3_Reads100      350     -

I made one awk with exact match find and combine like below,

awk '
BEGIN { FS = OFS = "t" }
{key = $1}
FNR == NR {result[key] = $0; next;}
(key in result) { updated[key]=1 ; for (i=2; i <= NF; i++) result[key] = result[key] FS $i }
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"    # if using GNU awk
    for (key in result) {
            if(!(key in updated)) result[key] = result[key] FS "-"
            if(!(length(key)==0)) print result[key]
    }
}
' File1 File2 > File3

Is there any way to combine after substring?

Thank you.

Advertisement

Answer

Below kinda dirty awk script does the job, but i’m sure you would find a better one ere.

awk -v FS="t" -v OFS="t" '
NR==FNR{f1=$1;sub(/[0-9]*$/,"",f1);file1info[f1]=$0;next}
       {sub(/[0-9]*$/,"",$1);file2info[$1]=$2}
    END{
        for(i in file1info){
        print file1info[i],(i in file2info)?file2info[i]:"-";
        }
       }' File1 File2 | expand -t 20 | sort -nk1

Output

A                   B                   C          
Cluster0_Reads255   500                 GE
Cluster1_Reads253   300                 GA
Cluster2_Reads100   200                 GA
Cluster3_Reads100   350                 -

Edit

Finally managed to get a smaller faster one. The trick was reversing the files considering that file2’s clusters always form a subset of file1’s.

awk -v FS="t" -v OFS="t" '
NR==FNR{sub(/[0-9]*$/,"",$1);file2info[$1]=$2;next}
       {f1=$1;sub(/[0-9]*$/,"",f1);print $0,(f1 in file2info)?file2info[f1]:"-"}
       ' File2 File1   | expand -t 20 |sort -nk1

Output

A                   B                   C          
Cluster0_Reads255   500                 GE
Cluster1_Reads253   300                 GA
Cluster2_Reads100   200                 GA
Cluster3_Reads100   350                 -
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement