I have a file of about 1 Million records. I need to extract the records which have different FName and LName for id.
Input File
Col1,Col2,Col3,Col4,ID,FName,Col5,LName,Col6,Col7,Col8 AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1 AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1 AP,abc@gmail.com,xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1 AP,abc@gmail.com,xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1 AP,abc@gmail.com,xyz1,abc1,567,Alex,,Smith,phn2,fax2,url1 AP,abc@gmail.com,xyz1,abc1,789,Allen,,Prack,phn2,fax2,url1
The result that I want to see
AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1 AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
Any AWK or Sed command or script can help? Thanks
Advertisement
Answer
Using GNU awk for arrays of arrays:
$ awk -F, ' { vals[$5][$6 FS $8] = $0 } END { for ( id in vals ) { if ( length(vals[id]) > 1 ) { for (name in vals[id]) { print vals[id][name] } } } } ' file AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1 AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
or if your input file is sorted by “id” as shown in your sample input then with any awk and without storing the input file in memory:
$ cat tst.awk BEGIN { FS=OFS="," } NR > 1 { id = $5 name = $6 FS $8 if ( id == prevId ) { if ( name != prevName ) { if ( firstRec != "" ) { print firstRec firstRec = "" } print } } else { firstRec = $0 } prevId = id prevName = name }
$ awk -f tst.awk file AP,abc@gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1 AP,abc2@gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1