Blog

COVID-19 didn’t cancel these summer internships

by David Brownlee

With economies locked down this past summer and companies reluctant to hire, for many students, internship opportunities looked rather grim. Despite the shutdown of the McGill campus, the Canadian Center for Computational Genomics (C3G) shifted gears and revved up for a summer of working from home internships.

For Solomia Yanishevsky to complete her Master’s of Health Bioinformatics at Dalhousie University, she either had to do an internship or write a thesis. Desiring practical experience, Solomia expanded her job search beyond Halifax, NS and landed at the Montreal node of C3G. After many companies backed out of their hiring intentions, of the 20 students in her program, she was one of only four to find an internship position. Over the summer, Solomia worked with the Data Team to generate artificial FHIR datasets to successfully test a data ingestion algorithm. She also mapped the mCODE data standard to C3G’s Phenopacket metadata service to broaden its compatibility. As part of Genome Canada’s Canadian COVID Genomics Network (CanCOGen) project, Solomia worked on integrating live COVID-19 data elements from six different sources into a cohesive framework, a project that is as ongoing as the global health pandemic. Seeing her work put to use and understanding how it integrated into the bigger picture, Solomia embraced the responsibility. She found the work environment highly collaborative and engaging. Her coworkers were welcoming and her mentor was patient to her learning process.

Not every Queen’s University computer engineering student can tell the difference between DNA and RNA, but
Soulaine Theocharides is the exception. Currently pursuing a secondary degree in biology, her summer work here at McGill (Queen’s University’s great rival) occupied the space at the intersection of the two fields. Working alongside the TechDev Team, Soulaine used computing clusters, Unix and Python to develop a searchable organizational structure for massive epigenomic datasets. Her project integrated databases and file hierarchies with genetic analysis pipelines. Soulaine rigorously documented her resilient data organizational structure so that it survives well beyond her internship.

McGill software engineering student Sebastian Ballesteros didn’t let his absence of genomic experience interfere with his summer. Instead, the novelty of genetics was a strong motivational factor for him. Leveraging previous internships working with computer graphics, Sebastian developed an online tool to visualize genomic variations as part of the Bento platform. He also transformed complicated genomic policy maps into a privacy assessment tool named D-Path (Data Privacy Assessment Tool) which allows data stewards worldwide to easily comply with regional regulations governing genetic data storage and use. Despite living in the McGill ghetto only a few blocks away from the C3G office, his internship was carried out entirely online. Though initially difficult to get a feel for his coworkers and associations, Sebastian found remote work good preparation for his current semester which is being taught entirely online. He loves Montreal’s unique personality, multi-cultural diversity, bilingualism and latin flair. With its strong university focus, good population density and available housing, he wouldn’t live in any other Canadian city.

After his school, Collège Ahuntsic, cancelled all their biotechnology laboratory technique internships, Étienne Collette’s summer was uncertain. During the previous semester at school, C3G Services Team Bioinformatics Manager François Lefebvre and C3G Bioinformatics Specialist Emmanuel Gonzalez had been guest lecturers, introducing his class to the computational side of bioinformatics. About to complete his DEC, during the three months when school was restricted to online theory classes, Étienne used the spare time to teach himself the C programming language. One thing led to another and a three week placement at C3G turned into a four month internship where Étienne engaged in no less than six different bioinformatic projects. Coming from an interdisciplinary background, the variety and concurrency were appreciated and stimulating. Along the way, he was supported by responsive feedback from the Services Team and daily updates with François. Étienne found purpose in the real-world detective work required by genuinely unsolved bioinformatic problems. He has just started studying biochemistry at l’Université de Montréal where he intends to specialize in genetics.

C3G and the Bourque Lab at McGill University routinely hire students for summer internships. Many continue to work part-time beyond the summer as they complete their studies. We are a bilingual lab with people from many backgrounds, countries and skill-sets represented. Summer internships are posted around December with applications accepted until mid-February. Sometimes internships happen unexpectedly too. If you, or someone you know, is interested in learning beyond the classroom, drop us a line. See what happens, eh?

https://www.computationalgenomics.ca/internships/

Published: October 19, 2020


Making New Discoveries Using Public Data

by Audrey Baguette

Every research project is composed of three key elements: a question to answer, the analyzes to perform and the data to use. Often, that last component is limiting. Indeed, producing new data is expensive and sometimes even time-consuming. Thankfully, a solution exists in the form of huge libraries accessible with a few clicks: public data.

Advantages of Using Public Data

Databases are full of useful data covering various techniques, technologies, and organisms. For example, the Encyclopedia of DNA Elements (ENCODE) harbors more than 17000 datasets on human, mouse, worm and fly, from RNA sequencing to whole-genome sequencing through protein binding1,2. Not only is public data easily accessible and free, it may also be stored in its raw and pre-processed form, requiring less time and costs in subsequent analyses (do not forget to perform quality controls first!) While more bioinformatics-inclined papers often use pre-existing datasets to compare tools, they are otherwise greatly overlooked, either because we assume all that could be done with it has been done or because it does not have the exiting spark of novelty. But datasets may have been analyzed using only one angle and could still hold many secrets, even if there are a few years old. Additionally, the ever-growing performance of new algorithms may permit to extract information that was hidden in the data before. For those reasons, it can be valuable to re-analyse public data and this can lead to new discoveries.

How to Use Public Data Efficient

Public databases contain information about many diseases, cell types, organisms and techniques, but it is still limited to what has been explored before. One must thus slightly change his way to approach data in order to find a new angle to analyze. Thus, instead of the typical “formulate question -> how to answer the question -> produce data” workflow, the preparation requires to scout the existing datasets to find some that have a potential for new discoveries. The analyzes have to be centered around the available data rather than the opposite.

Example of New Discovery from “Old” Data

1- Formulate the research question

For my research project, I wanted to study the relationship between transcription and 3D conformation of the DNA in the nuclear space. Various studies tried to explore this before, some of the earliest dating from 19933, but the mechanism is complex and there are still many unknowns.

2- Explore datasets

One of the most common diseases in human is lung cancer. Because of its prevalence and mortality rate, it is also one of the most studied diseases and thus the data produced is widely available. I thus chose to use the A549 cell line, a lung cancer cell line. Various data types were generated from it (RNA-seq, ChIP-seq, Hi-C), permitting to explore both the transcription events and the architecture in these cells. Moreover, being a cell line, it should have less cell-to-cell variability than cells coming from a patient biopsy.

3- Adapt the angle of exploration

As many other studies, including the ones from with the data I used was produced, tried to explore the inter-relation between transcription and 3D folding, a new angle had to be found. A literature review showed there were still many unknowns regarding the different types of boundaries limiting co-regulation between genes

4- Discover!

A striking tendency that was seen while exploring the data was that the relative orientation of genes seems to influence their probability of co-regulation. We thus proposed a model stating simple “rules” that affect the probability of co-regulation of two genes in A549 cells (Figure 1). In other words, Genes located on the same strand have a very high chance of co-regulation, as the transcription machinery could just slide from one gene to the other. When genes are located on different strands, there is less chance of co-regulation as the machinery would have to completely un-bind, then re-bind to the opposite strand. The change of strand thus introduces a type of co-regulation boundary. Finally, stronger boundaries that have been described before, such as TAD boundaries or the co-localization of Cohesin and CTCF, disrupt more strongly the probability of co-regulation. The discovery of tendencies that serve as a base to the proposed model have all been made using public data.

Figure 1: (A) Same-strand genes are very likely to be co-expressed, as the RNA pol II just needs to continue its path along the strand. Divergent and convergent genes are less likely to be co-expressed, as the RNA pol II would needs to detach and reattach itself to go from one gene to the other. (B) When genes are separated by a barrier (CTCF and Cohesin or TAD boundary), there is complete disruption of co-expression.

References:

  • 1. The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489, 57–74 (2012).
  • 2. Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Research 46, D794–D801 (2018).
  • 3. Jackson, D. A., Hassan, A. B., Errington, R. J. & Cook, P. R. Visualization of focal sites of transcription within human nuclei. EMBO J 12, 1059–1065 (1993).

Posted September 2020


 

Disambiguating mixed-species of graft samples

by Senthilkumar Kailasam

As a Bioinformatician, often I get to work with PDX cancer samples. I’ve recently been reading about samples containing genome admixture, and was revisiting strategies that we commonly use for analyzing these biological data. Presented here is a summary of the existing software tools used for this purpose.

What are Xenografts?

Studying and understanding cancer is very challenging, and animal model systems help in addressing some of the common research bottlenecks. Mouse harboring human cancer cells, also known as Patient-derived xenograft models (PDX), are an excellent model system available to researchers. A small number of cancer cells collected from a patient are injected into an immunocompromised mouse, which grow to form tumors. PDX systems provide a controlled platform to study tumor biology and are especially useful for testing chemotherapeutic approaches.

Technical issue with xenograft samples

The grafted sample obtained from these mouse tumors can be subjected to NGS for genomics and transcriptomics studies. Despite meticulous efforts, it is difficult to prevent contamination of the graft samples with the host (mouse) stromal tissue, and the sequencing obtained is usually contaminated with host DNAs and RNAs. These contaminations could hinder correct interpretation of the data. Removing reads of host origin before downstream analysis is becomes essential to ensuring accurate conclusions.

What are the methods available?

Various algorithms exist to separate host-derived reads from the rest of the sample. Almost all methods require that the input is in two are more BAM files: one aligned to the host genome and other aligned to the graft genome. The type of aligner used also hugely influences the choice of algorithm. I was able to find five well documented software packages. Most of these algorithm compare the quality of read alignment and then categorises the reads to either host or graft. If ambiguous the read is discarded. Some of the key features of these packages are highlighted in the table. (Table 1)

Table 1.

Package/software Name Compatible Aligner Comparison Remarks Multicore References
Sargasso Bowtie2, STAR Multispecies Custom filtering by threshold Yes [1]
Xenosplit Subread, Bowtie2, Subjunc, TopHat2, BWA and STAR Maximum two species Goodness of mapping scores No [2]
Disambiguate Hisat2, TopHat, BWA and STAR Maximum two species No [3]
XenoCP BWA Maximum two species Cloud-based Yes [4]


I tried all three of the four of these packages (XenoCP omitted) in an active project. I had selected a sample dataset that had issues with poor read alignment to reference genome (GRch38). The reason I chose this dataset was to see if the unaligned reads are of host origin! But that was not the case for this dataset. I used the number reads recovered (assigned as graft-origin) as a parameter to compare the tools. Sargosso and xenosplit perform very similarly and are stringent in assigning the reads to graft. Disambiguate, the oldest program of those tried, gave slight improvement compared to standard reference genome-based alignment alone. In the future, I plan to compare these five packages using a synthetic dataset with known portion of mouse and human reads. Until then, Sargasso and xenosplit seem promising if you are interested in specificity and not sensitivity.

Table 2.

Sample 1 Sample 2 Sample 3 Sample 4
raw_reads 187, 299, 238 140,927,692 328,021,530 342,910,322
trimmed_reads 187,126,248 140,809,146 327,809,782 342,667,064
sargasso_Human
(%)
36,469,134
(19.49)
261,686
(0.19)
55,689,176
(16.99)
18,074,244
(5.27)
xenosplit_Human
(%)
35,751,426
(19.11)
361,651
(0.19)
54,616,560
(16.99)
18,028,870
(5.27)
disambiguate_Human
(%)
46,050,656
(26.61)
12,384,076
(8.79)
82,209,686
(25.09)
41,830,148
(12.21)
GRCh38(unique)
(%)
51,406,614
(24.47)
13,458,944
(9.56)
83,642,740
(25.52)
49,287,080
(14.38)

Reference:

  • Qiu, J., et al., Mixed-species RNA-seq for elucidation of non-cell-autonomous control of gene transcription. Nature Protocols, 2018. 13(10): p. 2176-2199.
  • Giner, G. and A. Lun, https://github.com/goknurginer/XenoSplit. 2019.
  • Ahdesmaki, M.J., et al., Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples. F1000Res, 2016. 5: p. 2741.
  • Rusch, M., et al., XenoCP: Cloud-based BAM cleansing tool for RNA and DNA from Xenograft. bioRxiv, 2020: p. 843250.

Posted August 2020.

 


 

 

The ANCHOR pipeline

ANCHOR is a high-resolution metagenomics pipeline, the result of a multi-disciplinary collaboration between C3G and researchers at Institut de recherche en biologie végétale (IRBV) at Université de Montreal. Published in 2019 in the journal Environmental Microbiology, the pipeline demonstrated unprecedented accuracy in its characterization of microbiome samples [1]. Dr. Emmanuel Gonzalez, co-creator of the pipeline and the metagenomics specialist at C3G was motivated by his observation that existing tools were not performing well on real-life datasets. “C3G’s strategy towards microbiome analysis had been to run commonly used pipelines, but the resolution would sometimes deliver foggy results that made the interpretation challenging. When I joined C3G, another pipeline was buzzing, using a machine learning algorithm to remove sequence noise – a characteristic of metagenomics samples. That was quite the thing back in 2015. It was clever, light and incredibly fast, except… I happened to realize that its accuracy was dropping significantly with real-world experiments involving multiple samples.”

In designing ANCHOR, Dr Gonzalez and co-creator Dr. Nicholas Brereton had resolved that the tool would be written with a strong focus on understanding the biology of metagenomics experiments. Dr Gonzalez recalls that “before I even began writing down any code, we sat down together and set up some ground rules for this new pipeline, based on the respect towards the inherent complexity of any biological system and a reassessment of the capacity and ability of technical advances to handle biological complexity. I think this starting point is what made ANCHOR stand out amidst similar metagenomic pipelines. For example, our first rule went absolutely against the trend: let’s not modify sequences!”

Confined habitats

In the publication introducing ANCHOR, Drs Gonzalez and Brereton reanalysed metagenomic data from surface swabs within the International Space Station (ISS). The ISS experiment had originally used the Qiime pipeline and the conclusions of the analysis were that no differences were present when comparing the surface microbial ecosystems of the Destiny (US laboratory) and Harmony (crew sleeping quarters) modules. The reanalysis using ANCHOR substantially improved the scale of data capture as well as the accuracy and resolution of the findings, providing microbial classifications at the level of individual species. The reanalysis not only led to exciting novel discoveries regarding the ISS environment but also fundamentally changed the major conclusion of the experiment, with significant differences clearly identified between the modules (Figure 1). These significant differences detected by ANCHOR included increases in microbiome bacteria associated with the laboratory animals within the Destiny module, such as Helicobacter typhlonius, a species endemic to rodent research laboratories on earth [1].

Figure 1. ISS Destiny and Harmony Module differential abundance. Significantly differentially abundant taxa, their ISS location and illustrations indicating their known associations.

Presented as part of the scientific article, ANCHOR was used to reanalyse metagenomic data from surface swabs within the International Space Station (ISS). The experiment originally used the Qiime pipeline and the conclusions of the analysis were that no differences were present when comparing the surface microbial ecosystems of the Destiny (US laboratory) and Harmony (crew sleeping quarters) modules. The reanalysis substantially improved the scale of data capture as well as the accuracy and resolution of the findings, now at microbial species-level. “The researchers from the first analysis did a great job, but the pipeline they used limited their ability to analyse what was inside the with precision” adds Emmanuel. The reanalysis not only led to exciting novel discoveries regarding the ISS environment but also fundamentally changed the major conclusion of the experiment, with significant differences clearly identified between the modules (Figure 1). These significant differences included increases in microbiome bacteria associated with the laboratory animals within the Destiny module, such as Helicobacter typhlonius, a species endemic to rodent research laboratories on earth(1).

Human microbiome health

The ANCHOR pipeline was designed to improve data accuracy in high complexity real-world systems, and its applicability quickly moved from the ISS to terrestrial health concerns. “We transitioned this approach directly to the field of human health later in 2019 after Dr. Amir Minerbi, a medical doctor from the Alan Edwards Pain Management Unit, showed great interest to use our pipeline to characterise the bacterial microbiota of women suffering of fibromyalgia”, says Emmanuel. That collaboration led to the discovery of the first microbiome association to chronic pain by comparative analysis of the gut microbiomes of healthy women with fibromyalgia patients [2]. In patients with fibromyalgia, significant increases of species such as Clostridium scindens and Butyricicoccus desmolans, which are bottlenecks in secondary bile production (via 7α-dehydrogenase activity) as well as bacterial derived androgens (20α- and 20β-hydroxysteroid dehydrogenase activity) which may interact with the endocrine system (Figure 2). These are important discoveries in the field which has sparked a new research direction in the pain field and the research has already been recognised around the world.

Long-term Space Travel

At the end of 2019, the team behind ANCHOR responded to a grant call made by the Canadian Space Agency to promote projects on Health & Life Science data and sample mining. They proposed to analyse astronaut microbiomes using improved metagenomics to study the impact of long-term space travel upon astronaut health. “We’re grateful to the Canadian Space Agency for giving us the chance to apply high resolution microbiome analysis within the space sciences again and particularly excited to see whether there is an observable impact of long term confinement on the microbiome health of astronauts.

Les microbes de la Station spatiale internationale
Read now

1. Gonzalez E, Pitre F, Brereton N. ANCHOR: A 16S rRNA gene amplicon pipeline for microbial analysis of multiple environmental samples. Environmental Microbiology. 2019.
2. Minerbi A, Gonzalez E, Brereton NJ, Anjarkouchian A, Dewar K, Fitzcharles M-A, et al. Altered microbiome composition in individuals with fibromyalgia. Pain. 2019.

ANCHOR: a 16S rRNA gene amplicon pipeline for microbial analysis of multiple environmental samples
Read more

Learn more about ANCHOR
Author: Emmanuel Gonzalez

Science Quebec article: Read now

Posted: July 22 2020



The Biobanque Québécoise de la COVID-19 (BQC19) is a province-wide biobank which mission is to provide high-quality data and samples to the scientific and medical community in order to better understand, combat and limit the impact of the coronavirus disease 2019 (COVID-19).

This initiative is led by Dr. Vincent Mooser, Director of its Executive Committee, and a Canada Research Excellence Chair in Genomic Medicine of the Faculty of Medicine at McGill University. It was made possible with the contribution and involvement of inter-institutional and multidisciplinary teams including McGill Genome Center and C3G.

The BQC19 is available to researchers across the country and around the world. Since having access to high-quality data and samples is essential to win the war with the COVID-19 pandemic, the biobank is committed to the principles of Open Science to make all data accessible to qualified researchers.

Visit: bqc19.ca
Mcgill Article read more

 

 



COVID-19 Resources Canada

C3G has been part of a group of volunteer researchers, students, activists and web developers, led by Guillaume Bourque of McGill University and Tara Moriarty of the University of Toronto, who created COVID-19 Resources Canada a website to facilitate the sharing of information, expertise and resources in the fight against COVID-19. The group aims to “Serve as a reliable source of information and expertise for COVID-19 research in Canada; Support & facilitate coordination of Canadian COVID-19 research efforts; Support COVID-19 capacity-building in public health, research and grassroots initiatives.” The initiatives include a database of volunteers, a tool for sharing reagents used by clinicians and researchers; a compilation of all active Canadian research into COVID-19 and funding opportunities, among others.
visit COVID-19 Resources Canada

 

 


 

 

Covid-19 updates at C3G

We would like to provide a quick update amidst the coronavirus pandemic. While our host institutions (McGill University, SickKids Toronto) have put various measures in place to ensure the safety of our staff, our platform remains fully operational and available to support genomics research. Researchers should therefore not hesitate to contact us and inquire about our analysis and data management services, seek help to plan experiments, get support for a grant application or request free consultations with our bioinformaticians.

If your lab is conducting or planning to conduct COVID-19 related genomics research, our platform can help. Contact us at  info@c3g.ca.

Please remember to stay safe and healthy.

The
C3G Team



Infant Glioma: Characterizing the landscape of genetic drivers and their clinical impact

Recently published in Nature Communications, this paper presents work by members of the C3G Toronto node that integrates genomic and transcriptomic analyses to assess the molecular and clinical features of infant glioma patients. Examining single nucleotide variants, changes in copy number, fusion formation and other transcriptomic analyses revealed three clinical glioma subgroups in infants, each with distinct genetic drivers, locations in the brain and responses to treatment. Gliomas in infants have substantially different treatment outcomes compared to those that occur in children and adults, yet little is understood about the molecular basis of these differences. This paper gives a comprehensive molecular analysis of infant gliomas to helps to ascertain the biological mechanisms driving their oncogenesis and to help guide future diagnostics and treatment approaches for these patients.


Methylation signatures investigations

The C3G Toronto node has also been involved in two projects to investigate methylation signatures for specific conditions. We have recently published a manuscript in BMC Medical Genomics in which we describe specific DNA methylation signatures for Nicolaides-Baraitser syndrome (NBS), a rare childhood condition that affects physical features and intellectual ability (Chater-Diehl et al., 2019). We showed that specific methylation patterns are associated with pathogenic variants of the NBS causal gene, SMARCA2, which encodes the catalytic domain of a chromatin remodeling complex. We have also identified DNA methylation signatures associated with autism spectrum disorder risk loci, which has been recently published in Clinical Epigenetics (Siu et al., 2019). We show that methylation signatures can be used to identify and distinguish individuals with specific autism-associated mutations and can help determine if specific gene variants are pathogenic or benign to improve autism diagnostics.


 

Genetics & Genomics Analysis Platform: version 2

 

We are pleased to announce that the new version of GenAP is now available [HERE]. This release offers a completely re engineered platform that leverages Cloud resources at Compute Canada, and will eventually be deployed as well on other HPC resources. It already offers 2 types of applications: Data Hubs (as in GenAP1), a new graphical “File Browser” allowing files transfer to and from your workspaces,. A new Galaxy application, including up-to-date tools and pipelines, will eventually be added.


 

ForCasT: a fully integrated and open source pipeline to design CRISPR mutagenesis experiments

ForCas Tool (ForCasT) is a comprehensive tool for the design, evaluation and collection of CRISPR/Cas9 guide RNAs and primers. Using robust parameters, it generates guide RNAs for target loci, assesses their quality for any potential off-target effects and designs associated primers. The results are then stored in a local database that serves as a shared resource for users within a research team, and is constantly being updated to reflect the quality of guides and primers based on additional computational and wet-lab results. ForCasT is a single tool that research teams from various fields of biology can use to build and maintain a collection guide RNAs and primers for Cas-mediated genome editing that are suited to their specific needs. It is currently available as a web-app and as a Dockerized version, and can be found at https://github.com/ ccmbioinfo/CasCADe



Last summer was a productive summer for our members in terms of publications !
Read more about highlighted publications

GenPipes: an open-source framework for
distributed and scalable genomic analyses

It started in June with the publication of our beloved GenPipes framework and set of NGS data analysis pipelines in GigaScience. We use these pipelines on a daily basis for data production and routine analysis and hope the community will find it useful. While GenPipes is the product of several years of teamwork, kudos to co-first authors Mathieu Bourgey and Rola Dali who worked very hard to get this long-awaited paper out!

Read paper

 


 

 

Altered microbiome composition in individuals with fibromyalgia

September 2019

This summer has also been very special for C3G’s metagenome specialist Emmanuel Gonzalez with a publication in Pain highlighting a strong potential link between the microbiome and fibromyalgia, a terrible and elusive disorder affecting a large fraction of the population. This study drew quite a nice amount of media attention, notably from the
Montreal Gazette and the CBC. We are very proud to say that through Emmanuel, C3G provided first-rate analysis services for experimental design, species identification, statistical analysis, machine learning and finally for working hand-in-hand with Drs. Minerbi and Brereton on biological interpretation. Importantly, Emmanuel’s applied ANCHOR here, a method he also
published earlier this year, which enables the identification of microbial species at a resolution higher than for other common 16S sequencing data analysis methods.

Read paper

 


Altered differentiation is central to HIV-specific CD4+ T cell dysfunction in progressive disease

Another noteworthy publication to which C3G members contributed as authors was published in Nature Immunology this summer. An important focus of the study was the comparison of HIV-specific CD4 T-cells subpopulations from patients who have undergone antiretroviral therapy and patients who spontaneously suppress HIV viral load below detectable limits (a.k.a. elite controller patients). This comparison contributes to an understanding of why viral control is lost once antiretrovial activity therapy is interrupted.

Read Paper


GSoC 2019 is over!

Again this year, C3G was a Google Summer of Code organization. For people unfamiliar with it, GSoC is in Google’s own words: ” a global program that matches students up with open source, free software and technology-related organizations to write code and get paid to do it! ”

We would like to thank participating students this year for their contributions.

Jiahuang Lin (TBD) – Human history and genome evolution
Konstantinos Kyriakidis (AUTh) – Batchtools for Compute Canada
Madhav Vats (IIIT Delhi) – Flowchart creator for GenPipes
Pranav Tharoor (MAHE) – MiCM Project Match
SriHarshitha Ayyalasomayajula (KMIT) – GenPipes single-cell pipeline


Tip of the Month

Introduction

There is a huge library of common bioinformatics software available on Compute Canada resources via the modules maintained by C3G staff and distributed via the CernVM-File System (CVMFS). Despite the breadth of the C3G CVMFS library, there may be times when using the provided software isn’t ideal.

For example:
you might want to use software that we haven’t yet made available via CVMFS and you don’t want to repeatedly install it at each HPC facility you might want to guarantee comparability of results by running exactly the same software stack on Compute Canada, your workstation, or on infrastructure from a cloud provider such as Amazon Web Services or Google Cloud there is a more recent version of the software already available in a container. Listings of existing images are available from community efforts such as biocontainers, but also might be made built directly from the source repository.In circumstances such as these, containers offer an excellent solution by packaging up your software and its dependencies into a single image that contains all the software needed for a particular analysis or workflow.

The process for running containerized software on Compute Canada can be described in three steps:

  • Ensure singularity is available
  • Download a container
  • Run your containerized software
Step 1: Ensure singularity is available

At all Compute Canada facilities, singularity is available as a module. Loading the module is as simple as running:

If you’re running singularity on your linux laptop or workstation, download instructions are available here.

Step 2: Download a container

Many software stacks are already available as Docker images at repositories such as Docker Hub or Quay.io. Unfortunately, running Docker on shared clusters introduces potential security vulnerabilities. Fortunately for us, Singularity can use Docker images to build new singularity containers. For example, let’s say that we wanted to run the genometools suite. The biocontainers repository shows me that the latests version (1.5.10) is available at quay.io/repository/biocontainers/genometools-genometools as a Docker image. To download the image to my Compute Canada instance, I can run “singularity pull”:

This produces the singularity image “sif” file in the current directory.

Step 3: Run your containerized software
To run the genome tools suite from inside the new container, prepend your command with “singularity exec ”:

That’s it! You have a perfectly reproducible software stack running without needing to worry about installation or dependencies.

Next Steps and Getting Help

As you might imagine, there are plenty of details we don’t have time to cover in this short blog post. If you’d like to learn more, or if you’re having trouble, there are plenty of ways to find help.

  • The Compute Canada wiki has an excellent page on running containers on their infrastructure (en/fr)
  • The Singularity docs are the definitive guide
  • The C3G has a weekly open door session to which you are welcome to bring questions about containers and reproducible bioinformatics analyses.

 

Why better Data Sharing means better health Care

The future of personalized medicine is dependent on data sharing, according to Yann Joly, Research Director of the Centre of Genomics and Policies; and Guillaume Bourque, Director of the Canadian Centre for Computational Genomics.

Using big data techniques to analyze the function of human genes is already helping develop treatments tailored to individual patients. The more data researchers can access from across the world, the better chances of treating even rare diseases. But privacy and consent regulations differ by country, making sharing this information across borders slow and frustrating.

Learn more