I have a tsv file named userid-timestamp-artid-artname-traid-traname.tsv
of more than 2 GB of size with the following data
user_000022 2007-08-26T20:11:33Z ede96d89-515f-4b00-9635-2c5f5a1746fb The Aislers Set 9eed842d-5a1a-42c8-9788-0a11e818f35c The Red Door user_000022 2007-08-26T20:09:17Z 2aca01cc-256e-4a0d-b337-2be263ef9302 All Girl Summer Fun Band 722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0 Cut Your Hair user_000022 2007-08-26T20:06:07Z 606da894-aa6e-4a13-b253-e37b4403e9fc Sambassadeur 9af3dae2-f73e-43be-9784-85d9c1690b38 Whatever Season
Consider the first input line: Where first column is userid
i.e. user_000022
, second column is timestamp
i.e. 2007-08-26T20:11:33Z
, third column is artid
i.e. ede96d89-515f-4b00-9635-2c5f5a1746fb
, fourth column is the artname
i.e. The Aislers Set
, fifth column is the traid
i.e 9eed842d-5a1a-42c8-9788-0a11e818f35c
and sixth column is the traname
i.e The Red Door
Here each column is TAB
saperated.
My goal is to process it and create a file with the following format:
useridi<TAB>traidi1 traidi2 ...
where useridi
is the ith
user and traidij
is the jth
track id that will contain only the songs that each user has listened at least 20 times. So every line must contain all the tracks
that a user has listened to at least 20 times, and each track id of those must appear just once.
I want know how to write Linux command for this since I am using Linux for the first time. I only know how to separate column using awk.
Advertisement
Answer
your sample data doesn’t have condition that satisfies your spec, so I used 0 as the threshold, also assumes trackIds are unique, otherwise use a composite key (songId,trackId). File doesn’t need to be sorted…
awk '{user[$1];track[$4];count[$1,$4]++} END {for(u in user) {sep="t"; for(t in track) {if(count[u,t]>0) {if(sep=="t") {printf "%s",u sep; sep=OFS} printf "%s", sep t}} printf "%s", ORS}}' file
gives
002 id_05 idRC id2 001 id_01 id2
FINAL UPDATE
using your updated file and editing for the format, this script produces the listed output
sed -r 's/ +/t/g' file | # you won't need this step for tsv format awk -F't' '{print $1,$5}' | sort | uniq -c | awk '$1<1 {next} $2==p {tracks=tracks FS $3} $2!=p {if(tracks) print tracks; p=$2; tracks=p "t" $3} END {print tracks}'
output
user_000022 722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0 9af3dae2-f73e-43be-9784-85d9c1690b38 9eed842d-5a1a-42c8-9788-0a11e818f35c