So I have a file like this, with each row representing a position in the scaffolds with some positions omitted. (There are actually a lot more rows for each scaffold):
SCF_1 0 1 SCF_1 3 4 SCF_1 9 10 SCF_2 0 1 SCF_2 4 5 SCF_2 12 13 SCF_2 23 24 SCF_2 79 80 SCF_3 2 3 SCF_4 1 2 ...
and ultimately i want to make 100kb sized windows for each scaffold separately (the last window on each scaffold would be less than 100kb).This is what it should look like:
SCF_1 0 280000 SCF_1 280000 576300 SCF_1 576300 578000 SCF_2 9002 630000 ...
The ranges should not appear uniform because some positions are omitted. I was thinking to somehow make another column with ascending numbers for each scaffold but i’m a newbie to coding and don’t know how.
SCF_1 0 1 0 SCF_1 3 4 1 SCF_1 9 10 2 SCF_2 0 1 0 SCF_2 4 5 1 SCF_2 12 13 2 SCF_2 23 24 3 SCF_2 79 80 4 SCF_3 2 3 0 SCF_3 5 6 1
Advertisement
Answer
All right, I finished a bash script that will do exactly what you need. Go ahead and save the following as num_count.sh (or whatever you want as long as it’s a shell script format) and it should do the trick for you:
#!/bin/bash #Color declarations RED='33[0;31m' GREEN='33[0;32m' LIGHTBLUE='33[1;34m' LIGHTGREEN='33[1;32m' NC='33[0m' # No Color #Ignore the weird spacing. I promise it looks good when it's echoed out to the screen. echo -e ${LIGHTBLUE}"############################################################" echo "# Running string counting script. #" echo "# #" echo -e "# ${LIGHTGREEN}Syntax: num_count.sh inputFile outputFile${LIGHTBLUE} #" echo "# #" echo "# The script will count the number of instances of #" echo "# the first string and increment the number as it #" echo "# finds a new one, appending it to the end of each line. #" echo -e "############################################################"${NC} numCount=0 oldStr=null if [ -z "$1" ] || [ -z "$2" ]; then echo "Insufficient arguments. Please correct your parameters and run the script again." exit fi > $2 while IFS= read -r line; do firstStr=$(echo $line | awk '{print $1;}') if [ $oldStr == $firstStr ] ; then ((numCount++)) echo -e "$linet$numCount" >> $2 else oldStr=$firstStr numCount=0 echo -e "$linet$numCount" >> $2 fi done < $1
Essentially, you’ll need to run the script with the first argument as the file that contains the lines you want counted, and the second argument as the output file. Be careful because the output file will be overwritten with the output data. I hope this helps!
Here’s the before and after:
Before:
SCF_1 0 1 SCF_1 3 4 SCF_1 9 10 SCF_2 0 1 SCF_2 4 5 SCF_2 12 13 SCF_2 23 24 SCF_2 79 80 SCF_3 2 3 SCF_4 1 2
After:
SCF_1 0 1 0 SCF_1 3 4 1 SCF_1 9 10 2 SCF_2 0 1 0 SCF_2 4 5 1 SCF_2 12 13 2 SCF_2 23 24 3 SCF_2 79 80 4 SCF_3 2 3 0 SCF_4 1 2 0