Category Archives: Lab blog

PhAnToMe

For a long time we ran the project PhAnToMe at the website phantome.org.

Alas, all great things come to and end, and it is with sadness that we are winding down the phage annotation tools and methods project. However, we have not given up and are still working on new challenges.

We have migrated most of our tools to the Edwards’ lab website, but if you can’t find anything, please let us know. We still have all our tools, we are just not maintaining phantome.org any longer

Memory and Core Usage on SGE

We are running into issues with one of our applications requesting too much memory on the cluster. We need to set appropriate limits and ensure that the application knows how much memory it has available.

To begin, we need some code to test how many cores and memory the application thinks it has. The application that is causing us issues is written in Java, but we’re going to do this in python3 to ensure we can debug what is going on.

There are two python3 modules that you can use to test what is available. psutil (python system and process utilities) and resources (basic mechanisms for measuring and controlling system resources utilized by a program). The former gives us access to core system information, while the latter gives us access to available resources. Here is some code to print what is, or maybe, available. Before we start, a little helper function to convert bytes to human-readable format (from this SO post)

def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

Here we figure out our host and important information:

hostname = socket.gethostname()
m = psutil.virtual_memory()
c = psutil.cpu_count()
t = sizeof_fmt(m.total)
a = sizeof_fmt(m.available)
(ms, mh) = resource.getrlimit(resource.RLIMIT_AS)
if ms > 0:
ms = sizeof_fmt(ms)
if mh > 0:
mh = sizeof_fmt(mh)
print(f"Running on: {hostname}")
print(f"Number of cpus: {c}")
print(f"Total memory: {t}\nAvailable memory: {a}")
print(f"Memory limit (ulimit): soft: {ms} hard: {mh}")

You maybe wondering why we used resource.RLIMIT_AS to get the virtual memory as opposed to resource.RLIMIT_VMEM. Linux systems don’t report RLIMIT_VMEM, and instead use RLIMIT_AS for address space.

Note that we are using both psutil and resources to get information, and they tell us different things. If I run these on my laptop, I see something like this:

Running on: Laptop
Number of cpus: 8
Total memory: 15.6GiB
Available memory: 7.0GiB
Memory limit (ulimit): soft: -1 hard: -1

Note that the memory limit is -1 for both hard and soft limits (from the resource man page: the soft limit is the current limit, and may be lowered or raised by a process over time. The soft limit can never exceed the hard limit. The hard limit can be lowered to any value greater than the soft limit, but not raised.) This value is actually the value of resource.RLIM_INFINITY and so may not be -1 in your case (but probably is)!

The equivalent information is pulled from /proc/cpuinfo or /proc/meminfo on a Linux system, and the memory limit comes from ulimit (see the man page)

So … how does this help us on the cluster. Lets try a few simple tests. I create a file called mem.sh that basically just runs that python3 code above. When I submit it with default parameters, this is what I get

$ qsub -cwd -o mem.out -e mem.err ./mem.sh

Running on: node15
At the start:
Number of cpus: 16
Total memory: 125.9GiB
Available memory: 124.1GiB
Memory limit (ulimit): soft: -1 hard: -1

On my cluster, node15 has 16 CPUs and 126 GiB RAM, but some of it is currently being used.

With SGE, you can pass a couple of parameters to adjust memory settings. If we restrict memory usage using the h_vmem setting, we see this answer:

$ qsub -cwd -o mem.out -e mem.err -l h_vmem=1G ./mem.sh

Running on: node48
At the start:
Number of cpus: 16
Total memory: 125.9GiB
Available memory: 123.8GiB
Memory limit (ulimit): soft: 1.0GiB hard: 1.0GiB

In this case, adding the -l h_vmem option has limited the amount of resources available via ulimit, and has set both hard and soft limits.

In contrast, setting s_vmem sets the ulimit soft limit, but leaves the hard limit unchanged:

$ qsub -cwd -o mem.out -e mem.err -l s_vmem=2G ./mem.sh

Running on: node47
At the start:
Number of cpus: 16
Total memory: 125.9GiB
Available memory: 123.8GiB
Memory limit (ulimit): soft: 2.0GiB hard: -1

Using Java?

Unfortunately, setting the limit on SGE using -l h_vmem causes Java to crash with a known bug. You will see an error like this:

Error occurred during initialization of VM 
Could not allocate metaspace: 1073741824 bytes

There is a work around, and on my cluster I have to set both of these:

First, export MALLOC_ARENA_MAX and ensure that your qsub inherits this variable (e.g. qsub -V)

export MALLOC_ARENA_MAX=4

Then append this Java option:

-XX:CompressedClassSpaceSize=64m

It got Java to run, but it would still crash if I was trying to do anything remotely complex.

snakemake tutorial

There are a lot of snakemake tutorials out there to get you started:

This tutorial is not one of those! It is quick, hands-on, and we’ll jump right in with little explanation (OK, there will be some). But I encourage you to read those tutorials, especially Titus’s tutorials as you will learn a lot from them.

Continue reading

New features in CentOS8

We have just made the transition of most of the servers from CentOS6 or CentOS7 to CentOS8. Most everything should be unified on CentOS8 (unless you know what you are doing). 

This brings several new changes (as always) and some added benefits. This is a summary and does not reflect all the changes.

To check your servers operating system version, use this command:

cat /etc/redhat-release

Software Installs

The biggest changes should allow you to install software by yourself! There are two different ways you can install easily install software if either are supported by whatever you are trying to install.

Please note, that if you do not want to do either of these, it is fine. Just let me know and I am happy to install software for you (and everyone else) to use.

Conda

A lot of bioinformatics software is now available via conda. It is installed globally, but you can not install packages globally. You can create your own environment and then use that. 

The first time you use conda, you will need to create a local environment. Start with:

source /usr/local/anaconda3/bin/activate
conda create --name <username>

But use your username instead of <username>!

After this has run, any time you need to use conda, you can use the command

conda activate <username>

And you will get into your environment. 

A simple test is to install my fastq-pair package and see if it works:

conda install -c bioconda fastq-pair

once it has installed, this command should give some output

fastq-pair

Docker

Another popular way of sharing software is by using docker. We don’t support docker, but we support a drop-in replacement called podman.

Anywhere you see docker, you can use podman instead. For example, we created a focus docker image for the cami challenge described here: https://hub.docker.com/r/linsalrob/cami-focus and you can install that with

podman pull linsalrob/cami-focus

pip

If you are trying to run some python code and don’t have the appropriate library, you should be able to use pip install as a user to add it. For example:

pip3 install --user xmlschema

this will install the appropriate libraries into your account. Of course, if you want them globally installed, just let me know.

Deprecated software and alternatives

DeprecatedAlternateUsed ForAlternative
screentmuxVirtual terminals. You should use this!tmux has similar keys to screen but uses ctrl-b instead of ctrl-a to access them. eg. create a new window: “ctrl-b n
cd-hitmmseqsClustering sequencescd-hit is still an option if you want, but mmseqs2 appears to be much better