Several people ask me about tips for learning new programming languages. Here, we talk about some of the broader concepts in learning a language.
In your bioinformatics program you’ll probably use Java, matlab, python, and maybe C/C++, Perl, Julia, and some other languages. If you have not already, you will learn the core concepts of programming (e.g. flow and control, data structures, algorithms, etc) in classes. Although the semantics of these things change from language to language, the theory is the same.
You should also have learnt the core concepts of computational thinking in a class: don’t sit down and try and write the program. Sit down and think about what the program has to do. How are you going to approach the problem? Which things happen in which order? How can you simplify the problem or break it into smaller pieces?
Instead of sitting down and trying to learn a language, it is always better to have a problem in hand and ask “how can I solve it using this language?” or even “Is this the right language to solve the problem?”.
The difference between languages is more in the construction of your code. In that way programming languages are analogous to spoken languages. All spoken languages have a grammar and syntax, but the wording, phrasing, and ordering is specific to each.
When starting with a new language, one of the first things I do is think about how I am going to write and debug the code. You should use an Integrated Development Environment (IDE) for this. IDE’s are really helpful when you are starting out because they help you correct the syntax, most have autocompletion that will tell you available options, and they will help you find the documentation for methods. Each language has its own favorite, and here are some of them that you can try:
The next thing to do is to watch some videos. I always highly recommend videos by the New Boston for beginners. They have youtube videos on a whole bunch of different languages [ Java | Python | C++ ]and they are a good place to start. Most of their videos are short, and you can chop and choose the things you want to learn. There is also a series of videos from eXscript that includes a complete series on Python.
The How to Think Like a Computer Scientist: Interactive Edition has a complete Python class.
Finally, learn by doing. Write some code! When you are writing code you are solving problems and learning your craft.
When you get stuck, Google is your friend. Before having someone else go over your code, search the web using whatever error message you are getting. Part of learning to program, is how to identify and fix bugs in your code. The best way to do that is look at the actual error message that you get, and then remove things that are specific to your code (e.g. variable names), and use a Google search with quotes. In a list of Google results, my first go to site is StackOverflow. If you want to get some practice learning a language you can also answer some questions there!
For bioinformatics, here are some simple things that you should be able to do. Write code in different languages to try and accomplish these tasks. For some of these we’ve provided pointers of things to think about when you’re thinking about the problem
- Read a fasta file and print out all the definition lines
- Read a fasta file and summarize (which one is it? how long is it?) the longest sequence, the shortest sequence, and what are the mean and median sequence lengths?
- Read a fasta file make a hash with all kmers of length k (Start with k=7) and their frequency.
- Read a fastq file and print out all the definition lines
- Read a fastq file and summarize (which one is it? how long is it?) the longest sequence, the shortest sequence, and what are the mean and median sequence lengths?
- Read a fastq file and print out the quality scores as a number
- Reverse complement a DNA sequence
- Calculate the Shannon entropy for a DNA sequence
- Calculate the N50 of a sequence
- Here are some more tutorials you should work through.
- Here are some resources for teaching bioinformatics
Remember, don’t just try and hack away until you get the answer. Think about the problem. For example, the first problem we pose is Read a fasta file and print out all the definition lines. Here is how we might break that problem down:
- List the tasks that must be executed. Also, are there functions in your language that help perform each task? For example, using Python:
- Open a file. Use the open() function.
- Iterate through the each line in the file. Loop through the file using a for loop.
- Only process definition lines and skip sequence lines. Look for the “>” sign at the beginning of a sequence. Use an if statement to make a decision on what to do next. Check if the “>” is at the beginning of the line or somewhere else in the line.
- Print out the definition line when you find it. Use the print() statement.
- What data structures are required for each task?
- We are mainly dealing with reading in a file and printing information per line, so we don’t need to remember data or store it in variables.
- We will have a File object that is iterated through using the for loop.
- We will deal with Strings when we read from the file and print to the screen
- How is input data formatted or organized?
- Data is in FASTA format
- Definition line begins with a “>”
- There could be more than one sequence line afterwards. This affects the decision regarding when to skip the line and when to print the line
- How do we want the output formatted or organized? Options to consider are:
- Print out definition lines to the screen one line at a time
- Print out definition lines to the screen without the “>” sign
- Write out definition lines to a file without the “>” sign. We’ll need to use the open() function to write to a file.
I suggest that you make a module or library in your chosen language that solves each of those problems. At some point in your bioinformatics career, your future self will thank you!