I have a big genome data file (.txt) in the format below. I would like to split it based on chromosome column chr1, chr2..chrX,chrY
and so forth keeping the header line in all splitted files. How can I do this using unix/linux command?
genome data
variantId chromosome begin end 1 1 33223 34343 2 2 44543 46444 3 2 55566 59999 4 3 33445 55666
result
file.chr1.txt variantId chromosome begin end 1 1 33223 34343 file.chr2.txt variantId chromosome begin end 2 2 44543 46444 3 2 55566 59999 file.chr3.txt variantId chromosome begin end 4 3 33445 55666
Advertisement
Answer
Is this data for the human genome (i.e. always 46 chromosomes)? If so, how’s this:
for chr in $(seq 1 46) do head -n1 data.txt >chr$chr.txt done awk 'NR != 1 { print $0 >>("chr"$2".txt") }' data.txt
(This is a second edit, based on @Sasha’s comment above.)
Note that the parens around ("chr"$2".txt")
are apparently not needed on GNU awk, but they are on my OS X version of awk.