I have a tsv file named userid-timestamp-artid-artname-traid-traname.tsv of more than 2 GB of size with the following data
user_000022 2007-08-26T20:11:33Z ede96d89-515f-4b00-9635-2c5f5a1746fb The Aislers Set 9eed842d-5a1a-42c8-9788-0a11e818f35c The Red Door user_000022 2007-08-26T20:09:17Z 2aca01cc-256e-4a0d-b337-2be263ef9302 All Girl Summer Fun Band 722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0 Cut Your Hair user_000022 2007-08-26T20:06:07Z 606da894-aa6e-4a13-b253-e37b4403e9fc Sambassadeur 9af3dae2-f73e-43be-9784-85d9c1690b38 Whatever Season
Consider the first input line: Where first column is userid i.e. user_000022, second column is timestamp i.e. 2007-08-26T20:11:33Z, third column is artid i.e. ede96d89-515f-4b00-9635-2c5f5a1746fb, fourth column is the artname i.e. The Aislers Set, fifth column is the traid i.e 9eed842d-5a1a-42c8-9788-0a11e818f35c and sixth column is the traname i.e The Red Door Here each column is TAB saperated.
My goal is to process it and create a file with the following format:
useridi<TAB>traidi1 traidi2 ...
where useridi is the ith user and traidij is the jth track id that will contain only the songs that each user has listened at least 20 times. So every line must contain all the tracks
that a user has listened to at least 20 times, and each track id of those must appear just once.
I want know how to write Linux command for this since I am using Linux for the first time. I only know how to separate column using awk.
Advertisement
Answer
your sample data doesn’t have condition that satisfies your spec, so I used 0 as the threshold, also assumes trackIds are unique, otherwise use a composite key (songId,trackId). File doesn’t need to be sorted…
awk '{user[$1];track[$4];count[$1,$4]++}
END {for(u in user)
{sep="t";
for(t in track)
{if(count[u,t]>0)
{if(sep=="t") {printf "%s",u sep; sep=OFS}
printf "%s", sep t}}
printf "%s", ORS}}' file
gives
002 id_05 idRC id2 001 id_01 id2
FINAL UPDATE
using your updated file and editing for the format, this script produces the listed output
sed -r 's/ +/t/g' file | # you won't need this step for tsv format
awk -F't' '{print $1,$5}' |
sort |
uniq -c |
awk '$1<1 {next}
$2==p {tracks=tracks FS $3}
$2!=p {if(tracks) print tracks; p=$2; tracks=p "t" $3}
END {print tracks}'
output
user_000022 722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0 9af3dae2-f73e-43be-9784-85d9c1690b38 9eed842d-5a1a-42c8-9788-0a11e818f35c