How to separate data from large tsv file based on conditions and write on another file using Linux command

Question

I have a tsv file named userid-timestamp-artid-artname-traid-traname.tsv of more than 2 GB of size with the following data Consider the first input line: Where first column is userid i.e. user_000022, second column is timestamp i.e. 2007-08-26T20:11:33Z, third column is artid i.e. ede96d89-515f-4b00-9635-2c5‌​‌​f5a1746fb, fourth column is the artname i.e. The Aislers Set, fifth column is the traid i.e 9eed842d-5a1a-42c8-9788-0a11e818f35c and sixth

Accepted Answer

your sample data doesn&#8217;t have condition that satisfies your spec, so I used 0 as the threshold, also assumes trackIds are unique, otherwise use a composite key (songId,trackId). File doesn&#8217;t need to be sorted&#8230;awk      '{user[$1];track[$4];count[$1,$4]++}      END {for(u in user)              {sep="t";              for(t in track)                 {if(count[u,t]>0)                    {if(sep=="t") {printf "%s",u sep; sep=OFS}                    printf "%s", sep t}}           printf "%s", ORS}}' filegives002  id_05 idRC id2001  id_01 id2FINAL UPDATEusing your updated file and editing for the format, this script produces the listed outputsed -r 's/  +/t/g' file   |  # you won't need this step for tsv formatawk -F't' '{print $1,$5}' | sort                       | uniq -c                    |awk '$1<1  {next}     $2==p {tracks=tracks FS $3}      $2!=p {if(tracks) print tracks; p=$2; tracks=p "t" $3}     END   {print tracks}'outputuser_000022 722bd5fd-1b27-4ec1-ba21-3e7dc3c514b0 9af3dae2-f73e‌​-43b‌​e-9784-85d9c16‌​90b38 9eed842d-5a1a-42c8-9788-0a11e818f35c

Advertisement

Answer