Category Archives: Research

Virus Hunting in the Cloud Codeathon v2!

We are pleased to announce the second installment of the Virus Hunting Codeathon!

From 4-6 November, 2019, the NCBI will help run a bioinformatics codeathon in College Park, MD  hosted by the UMIACS and CBCB at the University of Maryland. We are going to put a few hundred thousand metagenomic datasets on cloud infrastructure and further identify known, taxonomically definable and novel viruses with even faster approaches!  We’re specifically looking for folks who have experience in Computational Virus Hunting or adjacent fields! If this describes you, please apply! This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for virological analyses from high-throughput experiments. The event is open to anyone selected for the codeathon and willing to travel to College Park (see below).

Working groups of five to six individuals will be formed into five to eight teams.  These teams will build pipelines to analyze large datasets within a cloud infrastructure. 

TOPICS

  • Fast, federated indexing
    • Big Query
  • Metadata features 
  • Genome graphs for viruses
  • Approximate taxonomic analysis
  • Domain/HMM Boundary and Taxonomic Refinement
  • Bringing together approximate taxonomy and domain models
  • Sequence data quality metrics
  • Phage-host interactions

The final list of projects will be unveiled before the codeathon starts, and will build off of previous NCBI codeathons.

Organization

After a brief organizational session, teams will spend three days addressing a challenging set of scientific problems related to a group of datasets. Participants will analyze and combine datasets in order to work on these problems. We will be writing code and solving problems.

Throughout the three days will breakout to discuss progress on each of the topics, bioinformatics best practices, coding styles, etc.

Datasets

Datasets will come from public repositories, with a focus on metagenomics datasets in the sequence read archive that were been ported to cloud infrastructure, as well as derivative contigs of the above.

Products

All pipelines and other scripts, software, and programs generated in this codeathon will be added to a public GitHub repository designed for that purpose (currently github.com/NCBI-Hackathons, but a new one may exist by the event).

Manuscripts describing the design and usage of the software tools constructed by each team may be submitted to an appropriate journal such as the F1000Research hackathons channel, BMC Bioinformatics, GigaScience, Genome Research or PLoS Computational Biology.  Ideally, we will present a searchable, streamlined virological index from these datasets on cloud infrastructure.

How To Apply

To apply, please complete this form (approximately 10 minutes to complete). Initial applications are due Monday, October 7th, 2019 by 3 pm ET. Participants will be selected based on the experience and motivation they provide on the form.

Prior participants and applicants are especially encouraged to apply. The first round of accepted applicants will be notified on October 8th by 11:59 pm ET, and have until October 11th at 4 pm ET to confirm their participation.  International applicants or those with particular skillsets may be accepted early. If you confirm, please make sure it is highly likely you can attend, as confirming and not attending prevents other data scientists from attending this event. Please include a monitored email address, in case there are follow-up questions.

Note: Participants will need to bring their own laptop to this program. A working knowledge of scripting (e.g., Shell, Python, R) is useful but not necessary to be successful in this event. Employment of higher level scripting or programming languages may also be useful. Applicants must be willing to commit to all three days of the event.

No financial support for travel, lodging or meals is available for this event. Also, note that the codeathon may extend into the evening hours each day. Please make any necessary arrangements to accommodate this possibility. Depending on the number of people that need accommodation, we will attempt to get a group rate at one of the local hotels. Please indicate on the registration form if you need a hotel room.

There will be no registration fee or cost associated with attending this event.

For more information, or with any questions, please contact Ben Busby (ben.busby@nih.gov ) with any questions.

Phage Identification

We are interested in phages — viruses that infect bacteria. For years the Edwards’ lab has been looking at new, undiscovered phages.

Recently, we identified the crAssphage, a new type of virus that has never been seen before. By looking at the sequences in metagenomes we were able to identify a set of contigs that were common among many different metagenomes. When we assembled them, they looked like a phage. We could compare them to other known phages in our database of sequences.

Working with folks in the biology department we proved that this is a circular virus by using PCR. However, we have so far been unable to culture the virus in vivo. We’re working on it, and hopefully others are too, but until that point we don’t have an image of the virus or an idea of what it does.

Halophile Genome Sequencing

Together with the Eisen and Facciotti labs at UC Davis, we sequenced and annotated the genomes of eight Halophilic Archaea. These are bugs that love to grow in environments with very high salt, and are often found in solar salterns, crystalizer ponds where salt is dried out of the ocean. The challenge for this project was that all of the sequence annotation and analysis was completed at the American Society of Microbiology’s 2008 annual meeting in Philadelphia, PA. We love these kind of crazy challenges, and were able to annotate the data, find new discoveries and present them to the audience in hours. For more information, visit the halophile website.
Together with the Eisen and Facciotti labs at UC Davis, we sequenced and annotated the genomes of eight Halophilic Archaea. These are bugs that love to grow in environments with very high salt, and are often found in solar salterns, crystalizer ponds where salt is dried out of the ocean. The challenge for this project was that all of the sequence annotation and analysis was completed at the American Society of Microbiology’s 2008 annual meeting in Philadelphia, PA. We love these kind of crazy challenges, and were able to annotate the data, find new discoveries and present them to the audience in hours. For more information, visit the halophile website.

Marine Sciences

The US-Brazilian Consortium for Marine Sciences is funded by the Department of Education through its Fund for the Improvement of Postsecondary Education (FIPSE), and the Fundacao Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES) from the Brazilian Ministry of Education. We’ve assembled a team of marine sciences researchers from San Diego State University and Scripps Institution of Oceanography, together with a team from the Federal University of Rio de Janeiro (UFRJ), the Universidade Federal de Pernambuco, and Universidade Federal da Paraíba, together with FIOCRUZ and the Rio de Janeiro Botanical Gardens. Together, we will develop a completely new marine sciences course to be held in Brazil in 2011 and 2012, and exchange students between San Diego, Rio de Janeiro, Pernambuco, and Paraíba.

Dark Matter

The viral dark matter is all the sequences that we find in metagenomes that we don’t know what to do with. In a project funded by the National Science Foundation, together with Dr. Forest Rohwer and Dr. Anca Segall in the SDSU Biology Department, and Dr. Alex Burgin, we will tackle some of this dark matter. We’re going to combine metagenomics, metaproteomics, metabolomics, and structural biology to unearth the functions of sets of genes that we have no idea what they do.

Identifying Prophages in Bacterial Genomes

Finding prophages in microbial genomes remains a problem with no definitive answer. The majority of existing tools rely on detecting genomic regions enriched in proteins with known phage homologs, which hinders the de novo discovery of phage regions. In this study, a weighted phage detection algorithm, Phage_detector was developed based on seven distinctive characteristics of prophages i.e. protein length, transcription strand directionality, customized AT and GC skew, the abundance of unique phage words, phage insertion points and the similarity of phage proteins. The first five characteristics are capable of identifying prophages without any sequence similarity with known phage genes. Phage_detector locates prophages by ranking genomic regions enriched in distinctive phage traits, which leads to the successful prediction of 92% of prophages (including 33 previously unidentified prophages) in 95 complete bacterial genomes with 8% false negative and 18% false positive.

PHACTS: Phage Classification Tool Set

There are two distinct phage lifestyles: lytic and lysogenic. The lysogenic lifestyle has many implications for phage therapy, genomics, and microbiology, however it is often very difficult to determine whether a newly sequenced phage isolate grows lytically or lysogenically just from the genome. Using the ~200 known phage genomes, a supervised random forest classifier was built to determine which proteins of phage are important for determining lytic and lysogenic traits. A similarity vector is created for each phage by comparing each protein from a random sampling of all known phage proteins to each phage genome. Each value in the similarity vector represents the protein with the highest similarity score for that phage genome. This vector is used to train a random forest to classify phage according to their lifestyle. To test the classifier each phage is removed from the data set one at a time and treated as a single unknown. The classifier was able to successfully group 188 of the 196 phages for whom the lifestyle is known, giving my algorithm an estimated 4% error rate. The classifier also identifies the most important genes for determining lifestyle; in addition to integrases, expected to be important, the composition of the phage (capsid and tail) also determines the lifestyle. A large number of hypothetical proteins are also involved in determining whether a phage is lytic or lysogenic.

Metagenome Sequence Matcher

Metagenome analysis spans a large range of different methods and tools in the bioinformatics community. These tools provide scientists with biological information present in a sequenced environmental sample, more specifically the genetic functions encoded in the DNA of the sampled metagenome. Most often those tools have been developed to compare a specific metagenome file against databases that are filled with sequences and annotation data.

This project is directed to performing a comparative analysis between multiple metagenomic FASTA files. By importing n-length pieces of the sequences from one file into a hash table structure, comparing other metagenome sequences from other files will be done quickly and precisely. Finding similar sequences and structures between numerous metagenomes can give insight into what biological functions are shared between related and unrelated organisms.

Pangenomes

A project that started with the question, “how many microbial genes are there in the world?” has grown to potentially lead to answers to this and broader questions about the microbial universe. First, known taxa (E. coli) were organized into matrices, with strains as rows, and proteins as columns. Hamming distances define a metric for organizing strains into phylogenetic trees. The phylogenetic distance is the importance of the split between the strains, or the alpha score, as refered to in d-splits literature. This approach became our main focus when we attempted the same heuristic with viral data, with surprisingly strong results. At present, we are taking “pie slices” of the phage proteonomic tree, and seeing to what extent we can recreate that observed internal structure, as a “proof of concept” for viral applicability. Reading and work on splitstrees, d-splits, and consecutive ones property, will drive the next developments. In addition, this coming week, on August 18th, our group will be attending a lecture on whole genome taxonomy, which should help drive further progress on our project.