Skip to content
Advertisement

How can I select only the rows In file 1 that match column values in file 2?

I have multiple measurements per ‘Subject’ in file 1. I only want to use the highest quality, singular measurement per Subject. In my second file I have the exact list of which measurement is the best for each Subject. This information is contained in the column ‘seriesnumber’. The number in the ‘seriesnumber’ column in file 2 corresponds to the best measurement for a Subject. I Need to extract only these rows from my file 1.

I have tried to use awk, join, and merge to try and accomplish this but came up with errors and strange incomplete files.

join code:

JavaScript

awk code:

JavaScript

File 1 Example

JavaScript

File 2 Example

JavaScript

The desired output would like something like this: Where I no longer have duplicate entries per subject. The second column will look different because the preferred series number will differ per subject.

JavaScript

Advertisement

Answer

For file1 containing (I removed long useless lines):

JavaScript

and file2 containig:

JavaScript

The following awk will output:

JavaScript

Note that the first file argument to awk is file2 not file1 (small optimization)! How it works:

  • NR == FNR – if line number is file line number. Ie. choose only first file passed to awk.
  • a[$1, $2] – remember index $1,$2 in associative array a
  • next – do not parse rest of script and restart with next line
  • ($1, $2) in a – check if $1, $2 is in associative array a
    • because of next this is run only for the second file as passed to awk
    • if this expression returns with true, then the line will be printed (this is how awk works).

Alternatively you could do the follow, but it will store the whole file1 in memory, which is… memory consuming…, the code above only stores $1, $2 indexes in memory.

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement