Skip to content
Advertisement

Finding Duplicate rows based on a column in Unix File

I have a file of about 1 Million records. I need to extract the records which have different FName and LName for id.

Input File

Col1,Col2,Col3,Col4,ID,FName,Col5,LName,Col6,Col7,Col8
AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
AP,abc@gmail.com,xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,abc@gmail.com,xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,abc@gmail.com,xyz1,abc1,567,Alex,,Smith,phn2,fax2,url1
AP,abc@gmail.com,xyz1,abc1,789,Allen,,Prack,phn2,fax2,url1

The result that I want to see

AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Any AWK or Sed command or script can help? Thanks

Advertisement

Answer

Using GNU awk for arrays of arrays:

$ awk -F, '
    { vals[$5][$6 FS $8] = $0 }
    END {
        for ( id in vals ) {
            if ( length(vals[id]) > 1 ) {
                for (name in vals[id]) {
                    print vals[id][name]
                }
            }
        }
    }
' file
AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

or if your input file is sorted by “id” as shown in your sample input then with any awk and without storing the input file in memory:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR > 1 {
    id   = $5
    name = $6 FS $8

    if ( id == prevId ) {
        if ( name != prevName ) {
            if ( firstRec != "" ) {
                print firstRec
                firstRec = ""
            }
            print
        }
    }
    else {
        firstRec = $0
    }

    prevId   = id
    prevName = name
}

$ awk -f tst.awk file
AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement