Tag: bioinformatics

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to

Lifting over GWAS summary statististic file from build 38 to build 37

bioinformatics gwas linux

I am using the UCSC lift over tool and the associated chain to lift over the results of my GWAS summary statistic file (a tab separated file) from build 38 to build 37. The GWAS summary stat file looks like: Follwing is the UCSC tool with the associated chain I am using: liftover: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver chain file: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz I want to

How to speed-up sed that uses Regex on very large single cell BAM file

awk bioinformatics linux sed unix

I have the following simple script that tries to count the tag encoded with “CB:Z” in SAM/BAM file: Typically it needs to process 40 million lines. That codes takes around 1 hour to finish. This line sed ‘s/.*CB:Z:([ACGT]*).*/1/’ is very time consuming. How can I speed it up? The reason I used the Regex is that the “CB” tag column-wise

How can I add a column of ascending numbers for each scaffold in my bed file

bioinformatics dna-sequence indexing linux range

So I have a file like this, with each row representing a position in the scaffolds with some positions omitted. (There are actually a lot more rows for each scaffold): and ultimately i want to make 100kb sized windows for each scaffold separately (the last window on each scaffold would be less than 100kb).This is what it should look like:

How to make the bash script work with one command after another?

bash bioinformatics linux shell

I have a bash script like below. First it will take sorted.bam files as input and use “stringtie” tool give each sample gtf as output. Then path for each sample gtf will be given into mergelist.txt. and then use “stringtie merge” on them to get “stringtie_merged.gtf”. I totally have 40 sorted.bam files. I separated the commands with ; After running

batch extracting data from files, naming new files according to string in input file

bioinformatics linux

With Linux I want to automatically extract data from .dat files and name the new files according to a string in the input files: I have 300 .dat files with a data structure as follows: . . . DE name1, contig1 . . SQ information1 // . . DE name1, contig2 . . SQ information2 // . where the “.”

blast could not create a unit counts container

bioinformatics blast c++ linux

I build a blast local database. However, when I run the blastn command I got this error message: T0 “/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_250088_130.14.22.10_9008__PrepareRelease_Linux64-Centos_1448906370/c++/compilers/unix/../../src/algo/winmask/seq_masker_istat_factory.cpp”, line 170: Error: ncbi::CSeqMaskerIstatFactory::DiscoverStatType() – could not open T0 “/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_250088_130.14.22.10_9008__PrepareRelease_Linux64-Centos_1448906370/c++/compilers/unix/../../src/algo/winmask/seq_masker_istat_factory.cpp”, line 271: Error: ncbi::CSeqMaskerIstatFactory::create() – could not create a unit counts container I am using this command to create the blast local database: And this is my command for executing

What is the best way to evaluate two variables representing a single pipeline command in bash?

bash bioinformatics linux pipe pipeline

I have a function produce which determines whether a file is present and if not it runs the following command. This works fine when the command output simply writes to stdout. However in the command below I pipe the output to a second command and then to a third command before it outputs to stdout. In this scenario I get

split text file (Genome data) based on column values keeping header line

bioinformatics linux unix

I have a big genome data file (.txt) in the format below. I would like to split it based on chromosome column chr1, chr2..chrX,chrY and so forth keeping the header line in all splitted files. How can I do this using unix/linux command? genome data result Answer Is this data for the human genome (i.e. always 46 chromosomes)? If so,