Skip to content
Advertisement

Shuffling a large text file without/with group order maintained

Instead of making a script, it there a one liner to shuffle a large tab separated text file, based on the unique elements in the first column. That means, for each unique element in the first column, number of rows will be equal and be specified by the user.

There are two output possibilities, maintaining the row order or randomized row order.

Input :

chr1    3003204 3003454 *   37  +
chr1    3003235 3003485 *   37  +
chr1    3003148 3003152 *   37  -
chr1    3003461 3003711 *   37  +
chr11   71863609    71863647    *   37  +
chr11   71864025    71864275    *   37  +
chr11   71864058    71864308    *   37  -
chr11   71864534    71864784    *   37  +
chrY    90828920    90829170    *   23  -
chrY    90829096    90829346    *   23  +
chrY    90828924    90829174    *   23  -
chrY    90828925    90829175    *   23  -

Output (1 row per category – defined by the user) Output1 (randomized – row order will change) :

chr1    3003235 3003485 *   37  +
chr11   71863609    71863647    *   37  +
chrY    90828925    90829175    *   23  -

Output1 (randomized – row order will be maintained) :

chr1    3003204 3003454 *   37  +
chr11   71863609    71863647    *   37  +
chrY    90828920    90829170    *   23  -

I tried using sort -u with cut on first column to fetch unique elements and then running a combination of grep and head for each element to generate the output file, which can be randomized using shuf, there might be a better solution as the file can be huge > 50 Million lines.

Cheers

Advertisement

Answer

Try using

Maintaining row order

awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }' file

Output:

chr1    3003204 3003454 *   37  +
chr11   71863609    71863647    *   37  +
chrY    90828920    90829170    *   23  -

Random row order

For this, just pipe output of shuf to above awk command

shuf file | awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }'

Output (different with each run)

chr1    3003148 3003152 *   37  -
chr11   71864025    71864275    *   37  +
chrY    90829096    90829346    *   23  +

Variable number of rows

#!/bin/bash
numRow=3
awk 'n[$1]<'$numRow' {a[$1]=a[$1]"n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }' file

Output:

chr1    3003204 3003454 *   37  +
chr1    3003235 3003485 *   37  +
chr1    3003148 3003152 *   37  -

chr11   71863609    71863647    *   37  +
chr11   71864025    71864275    *   37  +
chr11   71864058    71864308    *   37  -

chrY    90828920    90829170    *   23  -
chrY    90829096    90829346    *   23  +
chrY    90828924    90829174    *   23  -
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement