Skip to content
Advertisement

How to separate data from large tsv file based on conditions and write on another file using Linux command

I have a tsv file named userid-timestamp-artid-artname-traid-traname.tsv of more than 2 GB of size with the following data

user_000022  2007-08-26T20:11:33Z   ede96d89-515f-4b00-9635-2c5‌​‌​f5a1746fb    The Aislers Set  9eed842d-5a1a-42c8-9788-0a11e818f35c   The Red Door 

user_000022  2007-08-26T20:09:17Z   2aca01cc-256e-4a0d-b337-2be‌​263ef9302  All Girl Summer Fun Band    722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0    Cut Your Hair 

user_000022  2007-08-26T20:06:07Z   606da894-aa6e-4a13-b253-e37‌​‌​b4403e9fc    Sambassa‌​de‌​ur    9af3dae2-f73e‌​-43b‌​e-9784-85d9c16‌​90b38  ‌​Whatever Season

Consider the first input line: Where first column is userid i.e. user_000022, second column is timestamp i.e. 2007-08-26T20:11:33Z, third column is artid i.e. ede96d89-515f-4b00-9635-2c5‌​‌​f5a1746fb, fourth column is the artname i.e. The Aislers Set, fifth column is the traid i.e 9eed842d-5a1a-42c8-9788-0a11e818f35c and sixth column is the traname i.e The Red Door Here each column is TAB saperated.

My goal is to process it and create a file with the following format:

useridi<TAB>traidi1 traidi2 ...

where useridi is the ith user and traidij is the jth track id that will contain only the songs that each user has listened at least 20 times. So every line must contain all the tracks that a user has listened to at least 20 times, and each track id of those must appear just once.

I want know how to write Linux command for this since I am using Linux for the first time. I only know how to separate column using awk.

Advertisement

Answer

your sample data doesn’t have condition that satisfies your spec, so I used 0 as the threshold, also assumes trackIds are unique, otherwise use a composite key (songId,trackId). File doesn’t need to be sorted…

awk      '{user[$1];track[$4];count[$1,$4]++}
      END {for(u in user) 
             {sep="t";
              for(t in track) 
                {if(count[u,t]>0) 
                   {if(sep=="t") {printf "%s",u sep; sep=OFS}
                    printf "%s", sep t}}
           printf "%s", ORS}}' file

gives

002  id_05 idRC id2
001  id_01 id2

FINAL UPDATE

using your updated file and editing for the format, this script produces the listed output

sed -r 's/  +/t/g' file   |  # you won't need this step for tsv format
awk -F't' '{print $1,$5}' | 
sort                       | 
uniq -c                    |
awk '$1<1  {next}
     $2==p {tracks=tracks FS $3} 
     $2!=p {if(tracks) print tracks; p=$2; tracks=p "t" $3}
     END   {print tracks}'

output

user_000022 722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0 9af3dae2-f73e‌​-43b‌​e-9784-85d9c16‌​90b38 9eed842d-5a1a-42c8-9788-0a11e818f35c
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement