I am a beginner and my work starts to become difficult for me. I explain my problem. I have two tables File1 and File2 (reference table).
File1 num, Name 1, 1_1_busteni 13, 23_Doicesti 40, 2_AR_Moreni 47, 2_AR_Moreni_SUD 55, Petrolul_Romanesc 62, castor File2 ID_ref, Name_ref R_001, BUSTENI R_002, DOICESTI-23 R_003, MORENI R_004, MORENI-SUD R_005, ROMANESC R_006, CASTOR File3 num, Name,ID_ref,Name_ref 1, 1_1_busteni, R_001, BUSTENI 13, 23_Doicesti, R_002, DOICESTI-23 40, 2_AR_Moreni, R_003, MORENI 47, 2_AR_Moreni_SUD, R_004, MORENI-SUD 55, Petrolul_Romanesc, R_005, ROMANESC 62, castor, R_006, CASTOR
I don’t have any identical column but I have some similarity between &2File1 and &2File2. File1 is from user and we want to standardize everything so I have a lot of different cases. I don´t know how to start. My idea was to remove all the “_” in my first file and “-“ in my second and compare them. I managed to do it with
awk 'BEGIN {FS=OFS=","} {gsub(/_/,"",$2)}1' file1.txt and awk 'BEGIN {FS=OFS=","} {gsub(/-/,"",$2)}1’ file2.txt
separately but I don’t know how to combine and compare my two files.
I know also I have to think about lowercase.
A nice guy give me this code above: It works for CASTOR
but How can I associate it with my gsub ???
$ awk ' BEGIN { FS=OFS="," } NR==FNR { a[tolower($2)]=$0 next } { split($2,b,"[^[:alpha:]]") print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"") }' file2 file1
Maybe it exists a better way, I am open !!!
Advertisement
Answer
Here is one shot at it in awk:
$ awk 'BEGIN { FS=", *"; OFS="," } NR==FNR { a[tolower($2)]=$0 next } { for(i in a) # for every city in file2 if(tolower($2)~i) { # compare it to a record from file1 print $0,a[i] # print it if there is a match next } }1' file2 file1 num, Name 1, 1_1_busteni,R_001, BUSTENI 13, 23_Doicesti 40, 2_AR_Moreni,R_003, MORENI 47, 2_AR_Moreni_SUD,R_003, MORENI 55, Petrolul_Romanesc,R_005, ROMANESC 62, castor,R_006, CASTOR
Any better than that would require rules on processing the underscores and dashes in the names or approximate pattern matching with approriate algorithms (see for example Levenshtein distance).