Skip to content
Advertisement

How can I select only the rows In file 1 that match column values in file 2?

I have multiple measurements per ‘Subject’ in file 1. I only want to use the highest quality, singular measurement per Subject. In my second file I have the exact list of which measurement is the best for each Subject. This information is contained in the column ‘seriesnumber’. The number in the ‘seriesnumber’ column in file 2 corresponds to the best measurement for a Subject. I Need to extract only these rows from my file 1.

I have tried to use awk, join, and merge to try and accomplish this but came up with errors and strange incomplete files.

join code:

join -j2 file1 file2 

awk code:

awk ' FILENAME=="file1" {arr[$2]=$0; next}
        FILENAME=="file2"  {print arr[$2]} ' file1 file2 > newfile

File 1 Example

Subject   Seriesnumber  
19-1-1001   2   8655    661 15250   60747   8005    3919    7393    2264    1479    1663    22968   4180    1712    689 781 4255    90  1260    7233    154 15643   63421   7361    4384    6932    2062    4526    1742    686 4575    100 1684    0   1194    0   0   5   0   0   147 699 315 305 317 565 1361200 1338210 1338690 304258  308180  612438  250614  255920  506534  66645   802424  1206450 1187010 1185180 1816840 1   1   21  17  38  1765590
19-1-1001   10  8992    507 15722   64032   8728    3929    7208    2075    1529    1529    22503   3993    1819    710 764 3870    87  1247    7361    65  16128   66226   8165    4384    6669    1805    4405    1752    779 4039    103 1705    0   1280    0   0   10  0   0   186 685 300 318 320 598 1370490 1347160 1347520 306588  307188  613775  251704  256521  508225  65808   808802  1208880 1189150 1187450 1827880 1   1   22  26  48  1778960
19-1-1103   2   3303    317 12146   57569   7008    3617    6910    2018    811 1593    18708   4708    1429    408 668 3279    14  1289    2351    85  13730   60206   6731    4137    7034    2038    4407    1483    749 3576    85  1668    0   948 0   0   7   0   0   129 602 288 291 285 748 1250030 1238540 1238820 301810  301062  602872  215029  218080  433108  61555   781150  1107360 1098510 1097220 1635560 1   1   32  47  79  1555850
19-1-1103   9   3236    286 12490   59477   7000    3558    6782    2113    894 1752    19338   4818    1724    387 649 3345    56  1314    2077    133 13885   60414   6628    4078    7063    2031    4269    1709    610 3707    112 1947    0   990 0   0   8   0   0   245 604 279 280 284 693 1269820 1258050 1258320 306856  309614  616469  215658  220876  436534  61859   796760  1124870 1115990 1114510 1630740 1   1   32  42  74  1556790
19-10-1010  2   3344    608 14744   59165   8389    4427    6962    2008    716 1496    21980   4008    1474    769 652 3715    61  1400    3049    1072    15767   61919   8325    4824    7117    1936    4001    1546    684 3935    103 1434    0   1624    0   0   3   0   0   316 834 413 520 517 833 1350760 1337040 1336840 311985  312592  624577  246800  251133  497933  65699   809736  1200320 1189410 1188280 1731270 1   1   17  13  30  1606700
19-10-1010  6   3242    616 15205   61330   8019    4520    6791    2093    735 1558    22824   3981    1546    653 614 3672    96  1227    2992    1070    16450   64189   8489    4407    6953    2099    4096    1668    680 4116    99  1449    0   2161    0   0   19  0   0   263 848 387 525 528 824 1339090 1325830 1325780 309464  311916  621380  239958  244616  484574  65493   810887  1183120 1172600 1171430 1720000 1   1   16  26  42  1587100

File 2 Example

Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6

The desired output would like something like this: Where I no longer have duplicate entries per subject. The second column will look different because the preferred series number will differ per subject.

19-10-1010  2   3344    608 14744   59165   8389    4427    6962    2008    716 1496    21980   4008    1474    769 652 3715    61  1400    3049    1072    15767   61919   8325    4824    7117    1936    4001    1546    684 3935    103 1434    0   1624    0   0   3   0   0   316 834 413 520 517 833 1350760 1337040 1336840 311985  312592  624577  246800  251133  497933  65699   809736  1200320 1189410 1188280 1731270 1   1   17  13  30  1606700
19-10-1166  2   3699    312 15373   61787   8026    4248    6385    1955    608 2194    21394   4260    1563    886 609 3420    25  1101    3415    417 16909   63040   7236    4264    5933    1852    4156    1213    654 4007    53  1336    5   1597    0   0   18  0   0   110 821 300 514 466 854 1193020 1179470 1179420 282241  273236  555477  204883  203228  408111  61343   740736  1036210 1026080 1024910 1563950 1   1   39  40  79  1415890
19-102-10005    2   8733    514 13024   50735   7729    3775    4955    1575    1045    1141    20415   3924    1537    990 651 3515    134 1259    8571    232 13487   51374   7150    4169    5192    1664    3760    1620    596 3919    189 1958    0   1479    0   0   36  0   0   203 837 459 409 439 1072    1224350 1200010 1200120 287659  290445  578104  216976  220545  437521  57457   737161  1095770 1074440 1073050 1637570 1   1   31  22  53  1618600
19-102-10006    2   8347    604 13735   42231   7266    3836    6473    2057    1099    1007    18478   3769    1351    978 639 3332    125 1197    8207    454 13774   43750   6758    4274    6148    1921    3732    1584    614 3521    180 1611    0   1241    0   0   25  0   0   254 813 410 352 372 833 1092800 1069450 1069190 244104  245787  489891  202201  205897  408098  59170   634640  978807  958350  957462  1485600 1   1   19  19  38  1472020
19-103-10009    2   4222    596 14702   52038   7428    4065    6598    2166    835 1854    22613   3397    1387    879 568 3729    93  1315    3414    222 14580   52639   7316    3997    6447    1986    4067    1529    596 3778    113 1689    0   2097    0   0   23  0   0   260 761 326 400 359 772 1204670 1190100 1189780 256560  260381  516941  237316  243326  480642  60653   681040  1070620 1059370 1058440 1605990 1   1   25  23  48  1593730
19-103-10010    2   5254    435 14688   47120   7772    3130    5414    1711    741 1912    20643   3594    1449    882 717 3663    41  999 6465    605 14820   49390   6361    3826    5527    1523    3513    1537    639 3596    80  1261    0   1475    0   0   18  0   0   283 827 383 414 297 627 1135490 1117320 1116990 243367  245896  489263  221809  227084  448893  55338   639719  1009370 994519  993639  1568140 1   1   14  11  25  1542210
19-104-10013    2   7276    341 11836   53018   7912    3942    6105    2334    795 2532    21239   4551    1258    1176    430 3636    83  1184    8811    396 12760   53092   7224    4361    6306    1853    4184    1278    543 3921    175 1814    0   2187    0   0   8   0   0   266 783 381 382 357 793 1011640 987712  987042  206633  228397  435031  170375  191222  361597  61814   601948  879229  859619  859103  1586150 1   1   224 162 386 1557120
19-104-10014    2   5964    355 13297   55439   8599    4081    5628    1730    970 1308    20196   4519    1363    992 697 3474    62  1232    6830    472 14729   59478   7006    4443    6156    1825    4492    1726    827 4017    122 1804    0   1412    0   0   17  0   0   259 672 299 305 319 779 1308470 1288970 1288910 284018  285985  570003  258525  257355  515880  62485   746108  1166160 1149700 1148340 1826660 1   1   33  24  57  1630580
19-105-10017    2   7018    307 13848   53855   8345    3734    6001    2095    899 1932    20712   4196    1349    645 823 4212    72  1475    3346    1119    13970   55202   7411    3975    5672    1737    3778    1490    657 4089    132 1689    0   1318    0   0   23  0   0   234 745 474 367 378 760 1122360 1104380 1104520 235806  233881  469687  217939  220736  438675  61471   639143  985718  970903  969619  1583800 1   1   51  51  102 1558470
19-105-10018    2   16454   1098    12569   52521   8215    3788    5858    1805    788 1147    21028   3496    1492    665 634 3796    39  1614    10700   617 12813   52098   8091    3901    5367    1646    3544    1388    723 3938    47  1819    0   1464    0   0   42  0   0   330 832 301 319 400 788 1148940 1114080 1113560 225179  227218  452397  237056  237295  474351  59172   614884  1019300 986820  986144  1607900 1   1   19  28  47  1591480
19-105-10020    2   4096    451 13042   48597   7601    3228    5665    1582    778 1670    19769   3612    1187    717 617 3672    103 962 2627    467 13208   48466   6619    3461    5217    1360    3575    1388    718 3783    90  1370    0   862 0   0   6   0   0   216 673 386 439 401 682 1081580 1068850 1068890 233290  235396  468686  209666  214472  424139  54781   619447  958522  948737  947554  1493740 1   1   16  11  27  1452900

Advertisement

Answer

For file1 containing (I removed long useless lines):

Subject   Seriesnumber  
19-1-1001   2   8655    661 15250   60747   800
19-1-1001   10  8992    507 15722   64032   872
19-1-1103   2   3303    317 12146   57569   700
19-1-1103   9   3236    286 12490   59477   700
19-10-1010  2   3344    608 14744   59165   838
19-10-1010  6   3242    616 15205   61330   801

and file2 containig:

Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6

The following awk will output:

$ awk 'NR==FNR{a[$1, $2];next} ($1, $2) in a' file2 file1
19-10-1010  2   3344    608 14744   59165   838

Note that the first file argument to awk is file2 not file1 (small optimization)! How it works:

  • NR == FNR – if line number is file line number. Ie. choose only first file passed to awk.
  • a[$1, $2] – remember index $1,$2 in associative array a
  • next – do not parse rest of script and restart with next line
  • ($1, $2) in a – check if $1, $2 is in associative array a
    • because of next this is run only for the second file as passed to awk
    • if this expression returns with true, then the line will be printed (this is how awk works).

Alternatively you could do the follow, but it will store the whole file1 in memory, which is… memory consuming…, the code above only stores $1, $2 indexes in memory.

awk 'NR==FNR{arr[$1, $2]=$0}  NR!=FNR{print arr[$1, $2]}' file1 file2
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement