Skip to content
Advertisement

How to separate data from large tsv file based on conditions and write on another file using Linux command

I have a tsv file named userid-timestamp-artid-artname-traid-traname.tsv of more than 2 GB of size with the following data

JavaScript

Consider the first input line: Where first column is userid i.e. user_000022, second column is timestamp i.e. 2007-08-26T20:11:33Z, third column is artid i.e. ede96d89-515f-4b00-9635-2c5‌​‌​f5a1746fb, fourth column is the artname i.e. The Aislers Set, fifth column is the traid i.e 9eed842d-5a1a-42c8-9788-0a11e818f35c and sixth column is the traname i.e The Red Door Here each column is TAB saperated.

My goal is to process it and create a file with the following format:

JavaScript

where useridi is the ith user and traidij is the jth track id that will contain only the songs that each user has listened at least 20 times. So every line must contain all the tracks that a user has listened to at least 20 times, and each track id of those must appear just once.

I want know how to write Linux command for this since I am using Linux for the first time. I only know how to separate column using awk.

Advertisement

Answer

your sample data doesn’t have condition that satisfies your spec, so I used 0 as the threshold, also assumes trackIds are unique, otherwise use a composite key (songId,trackId). File doesn’t need to be sorted…

JavaScript

gives

JavaScript

FINAL UPDATE

using your updated file and editing for the format, this script produces the listed output

JavaScript

output

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement