Skip to content
Advertisement

How can I add a column of ascending numbers for each scaffold in my bed file

So I have a file like this, with each row representing a position in the scaffolds with some positions omitted. (There are actually a lot more rows for each scaffold):

SCF_1     0  1
SCF_1     3  4
SCF_1     9  10
SCF_2     0  1
SCF_2     4  5
SCF_2     12 13
SCF_2     23 24
SCF_2     79 80
SCF_3     2  3
SCF_4     1  2
...

and ultimately i want to make 100kb sized windows for each scaffold separately (the last window on each scaffold would be less than 100kb).This is what it should look like:

SCF_1 0       280000
SCF_1 280000  576300
SCF_1 576300  578000
SCF_2 9002    630000
... 

The ranges should not appear uniform because some positions are omitted. I was thinking to somehow make another column with ascending numbers for each scaffold but i’m a newbie to coding and don’t know how.

SCF_1     0  1   0     
SCF_1     3  4   1       
SCF_1     9  10  2        
SCF_2     0  1   0       
SCF_2     4  5   1       
SCF_2     12 13  2        
SCF_2     23 24  3        
SCF_2     79 80  4        
SCF_3     2  3   0       
SCF_3     5  6   1

Advertisement

Answer

All right, I finished a bash script that will do exactly what you need. Go ahead and save the following as num_count.sh (or whatever you want as long as it’s a shell script format) and it should do the trick for you:

#!/bin/bash

#Color declarations
RED='33[0;31m'
GREEN='33[0;32m'
LIGHTBLUE='33[1;34m'
LIGHTGREEN='33[1;32m'
NC='33[0m' # No Color

#Ignore the weird spacing. I promise it looks good when it's echoed out to the screen.
echo -e ${LIGHTBLUE}"############################################################"
echo "# Running string counting script.                          #"
echo "#                                                          #"
echo -e "# ${LIGHTGREEN}Syntax: num_count.sh inputFile outputFile${LIGHTBLUE}                #"
echo "#                                                          #"
echo "# The script will count the number of instances of         #"
echo "# the first string and increment the number as it          #"
echo "# finds a new one, appending it to the end of each line.   #"
echo -e "############################################################"${NC}

numCount=0
oldStr=null
if [ -z "$1" ] || [ -z "$2" ]; then
    echo "Insufficient arguments. Please correct your parameters and run the script again."
    exit
fi
> $2
while IFS= read -r line; do
    firstStr=$(echo $line | awk '{print $1;}')
    if [ $oldStr == $firstStr ] ; then
        ((numCount++))
        echo -e "$linet$numCount" >> $2
    else
        oldStr=$firstStr
        numCount=0
        echo -e "$linet$numCount" >> $2
    fi
done < $1

Essentially, you’ll need to run the script with the first argument as the file that contains the lines you want counted, and the second argument as the output file. Be careful because the output file will be overwritten with the output data. I hope this helps!

Here’s the before and after:

Before:

SCF_1     0  1
SCF_1     3  4
SCF_1     9  10
SCF_2     0  1
SCF_2     4  5
SCF_2     12 13
SCF_2     23 24
SCF_2     79 80
SCF_3     2  3
SCF_4     1  2

After:

SCF_1     0  1  0
SCF_1     3  4  1
SCF_1     9  10 2
SCF_2     0  1  0
SCF_2     4  5  1
SCF_2     12 13 2
SCF_2     23 24 3
SCF_2     79 80 4
SCF_3     2  3  0
SCF_4     1  2  0
Advertisement