Category Archives: Lab blog

Linux security tips and tricks

So you want to allow users to upload files to your server? This can be dangerous, very quickly someone will upload a malicious PHP script that allows them access to the directories of your web applications.

Here are some tips and tricks to aid in the safety of your server. We use all of these, and some others that are not included here so that the bad guys can’t figure out all of our security approaches!

Continue reading

Reading and writing to the same file

If you try to modify a file (removing all empty lines for example) using a command like:

 cat file.txt | sed '/^$/d' > file.txt 

you will end up with and empty file.txt. The reason is that bash parses the command line looking for “metacharacters” (  “|” , “>” and “space”  in this case) that separate words, then groups and executes those words according to their precedence. This means that “> file.txt”  get executed FIRST. This creates an empty file.txt (overwriting any existing file) and a “process”  to redirect standard output to that file. Then “cat file.txt” get executed, but by now file.txt is empty. So “cat file.txt” outputs 0 lines,  “sed ‘/^$/d’ ” deletes all 0 empty lines, and 0 lines get written to file.txt . This works as “intended” and bash outputs no error.

You can get around this using a temporal file.

 cat file.txt | sed '/^$/d' > tmp_file.txt
mv tmp_file.txt file.txt
 

But, as file.txt is technically a new file you might lose some information, in particular permissions and whether file.txt was originally a symbolic link or not.

Other options is to use sponge, which is part of moreutils and sadly not standard in many systems.

 cat file.txt | sed '/^$/d' | sponge file.txt
 

 

SoCal Hackathon 2018

We are pleased to announce the SoCal Bioinformatics Hackathon.

From 10-12 January, 2018, the NCBI will help run a bioinformatics hackathon in Southern California hosted by San Diego State University!  The hackathon will focus on advanced bioinformatics analysis of next generation sequencing data, proteomics, and metadata. This event is for researchers, including students and postdocs, who have already engaged in the use of bioinformatics data or in the development of pipelines for bioinformatics analyses from high-throughput experiments. Some projects are available to other non-scientific developers, mathematicians, or librarians.

Continue reading

NSF Conflicts

For several years the NSF have been prototyping a spreadsheet based conflicts reporting system. The spreadsheet typically has the following fields:

 

C Name: Organizational Affiliation Optional (email, Department) Last Active

The problem is you need to make this file every time you submit a grant. Here is a somewhat trivial solution, but hopefully it will help you create this file.

Continue reading

Splitting and pairing fastq files

A lot of software benefits from paired fastq files that contain mate pair information, and usually you get these from your sequence provider. However, sometimes (e.g. when you download them from the SRA) you get sequences that are not appropriately paired.

There are lots of solutions (e.g. this thread suggests using Trimmomatic and this thread has an awk solution) but none split the sequences and order the sequences. Until now.

We’ve developed a bunch of different solutions to this problem in python (including fastq_pairs.pypair_fastq_fast.pypair_fastq_files.py, and pair_fastq_lowmem.py).

Recently, however, we’ve been handling very large files and the performance of these programs, (yes, even the lowmem version) is hindering our ability to process these files.

Therefore, we introduce fastq_pair, a C-implementation for pairing fastq files and sorting out which reads have matches in both files and which are singletons. This code starts with two fastq files and creates four output files. It is quick, and efficient, especially if you manipulate the size of the hash table (which you can do with a command line option).

It takes advantage of the random access ability to read files. We open a file and make an index of the ids in the file and the positions those indices occur in the file. Then, we read the second file, and if the IDs match, we scoot to the start of the appropriate line and write out those two sequences to the “pairs” files. We also set a flag in our data structure so we know that we’ve printed that sequence out. If the IDs don’t match, we write them to the “singles” file, and atthe end of all the processing we go through the IDs in our data structure and make print out those sequences we haven’t printed yet.

Take a look and give it a try!