I am a beginner and my work starts to become difficult for me. I explain my problem. I have two tables File1 and File2 (reference table). I don’t have any identical column but I have some similarity between &2File1 and &2File2. File1 is from user and we want to standardize everything so I have a lot of different cases. I

awk merged two files with 2 columns based on string character comparison

I am a beginner and my work starts to become difficult for me. I explain my problem. I have two tables File1 and File2 (reference table).

File1
num, Name
1, 1_1_busteni
13, 23_Doicesti
40, 2_AR_Moreni
47, 2_AR_Moreni_SUD
55, Petrolul_Romanesc
62, castor

File2
ID_ref, Name_ref
R_001,  BUSTENI
R_002,  DOICESTI-23
R_003,  MORENI
R_004,  MORENI-SUD
R_005,  ROMANESC
R_006,  CASTOR

File3
num, Name,ID_ref,Name_ref
1, 1_1_busteni, R_001, BUSTENI
13, 23_Doicesti, R_002, DOICESTI-23
40, 2_AR_Moreni, R_003, MORENI
47, 2_AR_Moreni_SUD, R_004, MORENI-SUD
55, Petrolul_Romanesc, R_005, ROMANESC
62, castor, R_006, CASTOR

JavaScript
​x
 
File1num, Name1, 1_1_busteni13, 23_Doicesti40, 2_AR_Moreni47, 2_AR_Moreni_SUD55, Petrolul_Romanesc62, castor​File2ID_ref, Name_refR_001,  BUSTENIR_002,  DOICESTI-23R_003,  MORENIR_004,  MORENI-SUDR_005,  ROMANESCR_006,  CASTOR​File3num, Name,ID_ref,Name_ref1, 1_1_busteni, R_001, BUSTENI13, 23_Doicesti, R_002, DOICESTI-2340, 2_AR_Moreni, R_003, MORENI47, 2_AR_Moreni_SUD, R_004, MORENI-SUD55, Petrolul_Romanesc, R_005, ROMANESC62, castor, R_006, CASTOR​

I don’t have any identical column but I have some similarity between &2File1 and &2File2. File1 is from user and we want to standardize everything so I have a lot of different cases. I don´t know how to start. My idea was to remove all the “_” in my first file and “-“ in my second and compare them. I managed to do it with

awk 'BEGIN {FS=OFS=","} {gsub(/_/,"",$2)}1' file1.txt and awk 'BEGIN {FS=OFS=","} {gsub(/-/,"",$2)}1’ file2.txt

JavaScript
 
awk 'BEGIN {FS=OFS=","} {gsub(/_/,"",$2)}1' file1.txt and awk 'BEGIN {FS=OFS=","} {gsub(/-/,"",$2)}1’ file2.txt​

separately but I don’t know how to combine and compare my two files.

I know also I have to think about lowercase. A nice guy give me this code above: It works for CASTOR
but How can I associate it with my gsub ???

$ awk '
BEGIN { FS=OFS="," }
NR==FNR {                                                  
    a[tolower($2)]=$0                                      
    next
}
{                                                          
    split($2,b,"[^[:alpha:]]")                             
    print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"")  
}' file2 file1

JavaScript
 
$ awk 'BEGIN { FS=OFS="," }NR==FNR {                                                      a[tolower($2)]=$0                                          next}{                                                              split($2,b,"[^[:alpha:]]")                                 print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"")  }' file2 file1 ​

Maybe it exists a better way, I am open !!!

Answer

Here is one shot at it in awk:

$ awk 'BEGIN { FS=", *"; OFS="," }
NR==FNR {
    a[tolower($2)]=$0
    next
}
{
    for(i in a)               # for every city in file2
        if(tolower($2)~i) {   # compare it to a record from file1
            print $0,a[i]     # print it if there is a match
            next
        }
}1' file2 file1
num, Name
1, 1_1_busteni,R_001,  BUSTENI
13, 23_Doicesti
40, 2_AR_Moreni,R_003,  MORENI
47, 2_AR_Moreni_SUD,R_003,  MORENI
55, Petrolul_Romanesc,R_005,  ROMANESC
62, castor,R_006,  CASTOR

JavaScript
 
$ awk 'BEGIN { FS=", *"; OFS="," }NR==FNR {    a[tolower($2)]=$0    next}{    for(i in a)               # for every city in file2        if(tolower($2)~i) {   # compare it to a record from file1            print $0,a[i]     # print it if there is a match            next        }}1' file2 file1num, Name1, 1_1_busteni,R_001,  BUSTENI13, 23_Doicesti40, 2_AR_Moreni,R_003,  MORENI47, 2_AR_Moreni_SUD,R_003,  MORENI55, Petrolul_Romanesc,R_005,  ROMANESC62, castor,R_006,  CASTOR​

Any better than that would require rules on processing the underscores and dashes in the names or approximate pattern matching with approriate algorithms (see for example Levenshtein distance).

Advertisement

Answer