Instead of making a script, it there a one liner to shuffle a large tab separated text file, based on the unique elements in the first column. That means, for each unique element in the first column, number of rows will be equal and be specified by the user.
There are two output possibilities, maintaining the row order or randomized row order.
Input :
chr1 3003204 3003454 * 37 + chr1 3003235 3003485 * 37 + chr1 3003148 3003152 * 37 - chr1 3003461 3003711 * 37 + chr11 71863609 71863647 * 37 + chr11 71864025 71864275 * 37 + chr11 71864058 71864308 * 37 - chr11 71864534 71864784 * 37 + chrY 90828920 90829170 * 23 - chrY 90829096 90829346 * 23 + chrY 90828924 90829174 * 23 - chrY 90828925 90829175 * 23 -
Output (1 row per category – defined by the user) Output1 (randomized – row order will change) :
chr1 3003235 3003485 * 37 + chr11 71863609 71863647 * 37 + chrY 90828925 90829175 * 23 -
Output1 (randomized – row order will be maintained) :
chr1 3003204 3003454 * 37 + chr11 71863609 71863647 * 37 + chrY 90828920 90829170 * 23 -
I tried using sort -u
with cut
on first column to fetch unique elements and then running a combination of grep
and head
for each element to generate the output file, which can be randomized using shuf
, there might be a better solution as the file can be huge > 50 Million lines.
Cheers
Advertisement
Answer
Try using awk
Maintaining row order
awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }' file
Output:
chr1 3003204 3003454 * 37 + chr11 71863609 71863647 * 37 + chrY 90828920 90829170 * 23 -
Random row order
For this, just pipe output of shuf to above awk command
shuf file | awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }'
Output (different with each run)
chr1 3003148 3003152 * 37 - chr11 71864025 71864275 * 37 + chrY 90829096 90829346 * 23 +
Variable number of rows
#!/bin/bash numRow=3 awk 'n[$1]<'$numRow' {a[$1]=a[$1]"n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }' file
Output:
chr1 3003204 3003454 * 37 + chr1 3003235 3003485 * 37 + chr1 3003148 3003152 * 37 - chr11 71863609 71863647 * 37 + chr11 71864025 71864275 * 37 + chr11 71864058 71864308 * 37 - chrY 90828920 90829170 * 23 - chrY 90829096 90829346 * 23 + chrY 90828924 90829174 * 23 -