Anesthesia-related Tweets during COVID-19
Twitter has become a social media nexus for the sharing of information, and in an Anesthesia & Analgesia Journal article published in April 2021, members of C3G’s Toronto Node at SickKids examined the way Twitter was used to share anesthesia-related information during the COVID-19 pandemic. Across >240 million English-language Tweets related to COVID-19 published from January to October of 2020, only >23,000 (0.01%) were related to anesthesiology. Among this subset, the most frequent topics included airway management, personal protective equipment and COVID-19 testing. More than half of the Tweets contained a hyperlink, typically to scientific journal articles, news publications, anesthesia associations, and other social media websites. Average daily postings generally paralleled COVID-19 incidence and death rates in the US, with weekends typically seeing reduced activity compared to weekdays. While this study primarily reflects North American Twitter usage, the results suggest that most anesthesiology-related information disseminated via Twitter is from largely reputable sources and give insight into how social media may be used to effectively share and discuss accurate scientific information in times of global crisis.
A Coordinated Progression of Progenitor Cell States Initiates Urinary Tract Development
Oraly Sanchez-Ferras from the Goodman Cancer Research Centre recently published a paper in Nature Communications.
She reveals how the formation of tissues proceeds through a progression of different progenitors. This understanding is important for the generation of organ replacement in regenerative medicine but also in identifying regulators of progenitor progression leading to tissue morphogenesis and potentially cancer progression.
This work was made possible with the contribution of Alain Pacis and Mathieu Bourgey from C3G, who performed the single-cell analyses.
Whole Genome STR Analysis
About Short Tandem Repeats
In early 2021, Jeffrey Hyacinthe, another student here in Guillaume Bourque’s lab, wrote about repetitive sequences, focusing on transposable elements (TEs). Here I will discuss another type of repetitive sequence: short tandem repeats (STRs), also known as microsatellites. These do not move around like TEs, and they contain multiple copies of short motifs 2-6 base pairs long (Gymrek et al., 2012; 2016). The number of repeats at a location can vary within an individual or population. There are estimated to be ~700 000 STRs in the human genome (Gymrek et al., 2016) meaning they are quite common. They constitute the main topic for my M.Sc. thesis research, commenced last Fall in the Bourque lab.
STRs have been associated with a variety of disease phenotypes such as Huntington’s Disease (HD) and Fragile X Syndrome (FXS), as well as other effects like expression variation (Gymrek et al., 2016). In HD, an expanded, in-frame C-A-G repeat in the protein-coding HTT gene triggers the onset of the disorder prior to the individual’s 30’s or 40’s (MedlinePlus, 2020; see Table 1). HD is associated with motor dysfunction and cognitive impairment, eventually resulting in death. Only one copy of the pathogenic allele is needed to cause disease onset.
|Number of Repeats||Phenotype|
|36-39||Potential onset of HD|
|40+||Almost guaranteed onset of HD|
Table 1: Repeat counts of CAG in the HTT gene (GRCh38 coordinates: chr4:3074876-3074933) and associated phenotypes. See MedlinePlus, 2020.
Researchers have put significant amounts of work into annotating repetitive regions in current reference genomes. Tandem Repeats Finder, originally published in 1999 by Gary Benson, is used to catalog ‘simple’ repeats in reference genomes – these annotations can be used by other tools as targets for genotyping or expansion detection (Mousavi et al., 2019).
Even with annotations, repetitive DNA is difficult to analyze with the genomic technologies currently available to us. One of the de-facto standard sequencing technologies in 2021 is the Illumina short-read sequencing platform (Amarasinghe et al., 2020), which typically produces paired reads of up to ~150 base pairs long (Illumina, 2020). To identify repetitive regions longer than this, a fairly obvious problem emerges (Figure 1).
Figure 1: Types of read pairs from Illumina sequencing and how they align to different repeat sizes. Taken from Dolzhenko et al., 2020. The diagram shows that for repetitive sequences longer than 150bp, we no longer have reads that flank both sequences’ ends, and we start having to rely on approximate methods using in-repeat reads (IRRs) to determine repeat count. Paired IRRs can potentially map to multiple regions in the reference genome, since they have no anchoring sequence.
Long stretches of the same motif do not necessarily uniquely occur at only one position in the human genome (Dolzhenko et al., 2017). Without broader context of the surrounding regions, reads from one locus can potentially map to other loci, reducing their usefulness.
Adding more complexity to the calling process, we are primarily diploid creatures – most repeat loci have two copies in the human genome. If these loci are large or if we do not have enough sequencing coverage in an area, calling can become ambiguous and we may experience allelic dropout or have other difficulties determining a precise genotype (Gymrek et al., 2012; 2016; Mousavi et al., 2019).
Other techniques that show promise for STR genotyping are based on long-read sequencing technologies such as Oxford Nanopore (ONT) or Pacific Biosciences (PacBio) SMRT sequencing. Long reads are typically on the order of kilobases long (Amarasinghe et al., 2020), giving information on overall structure each read and eliminating the read-mapping issues presented above, except in extreme cases. This comes at the cost of higher sequencing prices, generally lower coverage, and higher error rates (Amarasinghe et al., 2020). Comparatively few tools have been developed for targeted genotyping and genome-wide scanning of STRs using long-read data.
Technologies to ascertain STRs
Specific tools are used to profile STR variation and call genotypes. When calling an STR genotype, we typically use the number of repeats of the motif unit to represent a particular allele. For the HTT repeat, a genotype of 20/35 would represent 20 and 35 copies of CAG, or 60 and 105 bases, on the two copies of chromosome 4 in the individual.
Some of the earliest of these computational methods which operate on NGS reads were released in the early 2010s. One of the first examples is lobSTR (Gymrek et al., 2012), which was designed to work on high-throughput sequencing data of the era and was validated against traditional DNA electrophoresis techniques. Some other STR-focused tools introduced over the past decade include hipSTR, Tredparse, and STRetch. These rely on catalogs and generally cannot detect the repeat counts of any STR longer than an Illumina read or only report expansions (Mousavi et al., 2019).
More recently, tools like ExpansionHunter (EH; Dolzhenko et al., 2017) and gangSTR (Mousavi et al., 2019) have developed new techniques which yield a greater number of higher quality calls, faster performance, and the ability to call confidence intervals on repeat regions that are longer than the read lengths themselves. These techniques use a mix of anchored/flanking reads and in-read repeats (see Figure 1) to both precisely call shorter STRs and call longer STR regions in a ‘fuzzy’ manner.
An alternative approach to using short reads to analyze long STR regions was introduced by the authors of EH in 2020 under the name ExpansionHunter Denovo (EHDn; Dolzhenko et al., 2020). This tool is not a caller at all, as it does not attempt to discern the number of repeats at a given locus. Instead, it takes advantage of patterns within in-repeat read (IRR) quantities (Figure 1) to look for outliers or case/control distribution differences. This means that EHDn only functions with regions longer than the read length – relying on pathogenic repeat alleles often being expansions, and generally long (Dolzhenko et al., 2020). This approach means a catalog of regions is unnecessary, which facilitates discoveries that would be impossible to make with other tools
The performance of these approaches has a direct impact on their usefulness and applicability. gangSTR is much faster than EH (Mousavi et al., 2019), and as a result includes a catalog of ~830 000 loci versus EH’s catalog of only 30 loci. This allows gangSTR to be used for whole-genome profiling, something which is computationally infeasible with EH, and facilitates new association discoveries and non-targeted STR exploration studies.
Genome-wide scanning tools have also been developed for long-read sequencing data. RepeatHMM-scan (Liu, Tong, & Wang; 2020) can use ONT or PacBio data to scan a sample’s whole genome to estimate repeat counts. Testing it myself, however, I found that it is still too slow to scan at sample level detail within the same order-of-magnitude time frame as gangSTR.
We are at a time where precise whole-genome STR profiles are possible to compute for short STRs, with repeat counts becoming fuzzier the longer a region gets. Given that longer STRs tend to be associated with pathogenic phenotypes, further work will be required to elucidate the complete link between tandem repeats and putative phenotypic effects in humans.
Where do we go from here?
Despite a variety of approaches being available for STR analysis, no one solution currently approaches a hypothetical “ideal” tandem repeat caller which quickly and accurately resolves the genotypes of any (short to very long) tandem repeat region.
Currently, existing tools are unable to precisely determine long STR repeat counts or are not fast enough to do this across a whole genome sequencing dataset. An ideal tool would combine high accuracy and precision of these repeat counts with extremely fast performance, for use in discovery and association studies across all STR loci in the genome.
A perfect STR caller would not require a pre-existing catalog, permitting the identification of novel expansions and allowing for easy migrations between reference genomes. Catalogs can be incomplete, especially since the current standard reference genome (GRCh38) has large unresolved regions which are only recently starting to be addressed (see Nurk et al., 2021 preprint).
Any high-quality tool analyzing biological data needs to be extensively validated with real world datasets. An unfortunate reality of tandem repeats is that there is currently a lack of “ground truth” datasets to validate against (Mousavi et al., 2019) – so it is essential to fully use what is available to us (existing tools, simulated reads, forensic panels, capillary electrophoresis, etc.) and developing better validation approaches.
Mousavi et al. also suggest that a hybrid approach utilizing both short- and long-read data may yield the most accurate assessment of STRs, taking advantage of the additional structural context long reads provide and short reads’ typically lower error rate. This could take the form of a completely new tool, or it could be a consensus-based approach which uses the varied strengths of existing methods.
As part of my project with Dr. Bourque here at the McGill Genome Centre, I am exploring existing STR tools and how they can be used to characterize a Québec cohort. I hope to develop new techniques to better understand this form of genetic variation.
- 1. Gymrek, M. et al. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res., 2012 https://dx.doi.org/10.1101/gr.135780.111
- 2. Gymrek, M. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet., 2016. https://dx.doi.org/10.1038/ng.3461
- 3. Huntington disease. MedlinePlus, 2020. https://medlineplus.gov/genetics/condition/huntington-disease/
- 4. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acid Res., 1999.https://dx.doi.org/10.1093/nar/27.2.573TRF
- 5. Dolzhenko, E. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biology, 2020. https://dx.doi.org/10.1186/s13059-020-02017-z
- 6. Dolzhenko, E. et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res., 2017 https://dx.doi.org/10.1101/gr.225672.117
- 7. Mousavi, N. et al. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acid Res., 2019 https://doi.org/10.1093/nar/gkz501
- 8. Illumina. Maximum read length for Illumina sequencing platforms. 2020. https://emea.support.illumina.com/bulletins/2020/04/maximum-read-length-for-illumina-sequencing-platforms.html
- 9. Liu, Q., Tong, Y., and Wang, K. Genome-wide detection of short tandem repeat expansions by long-read sequencing. BMC Bioinformatics, 2020. https://dx.doi.org/10.1186/s12859-020-03876-w
- 10. Nurk, S. et al. The complete sequence of a human genome. BioRxiv preprint, 2021.
- 11. Amarasinghe, S. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biology, 2020.
Published: July 28 2021
New study analyzes different mechanisms in the establishment of sex-phenotype dependent methylation in mouse livers
By Hector Galvez and Qinwei Zhuang
AlOgayil, N., Bauermeister, K., Galvez, J.H. et al. Distinct roles of androgen receptor, estrogen receptor alpha, and BCL6 in the establishment of sex-biased DNA methylation in mouse liver. Sci Rep 11, 13766 (2021). https://doi.org/10.1038/s41598-021-93216-6
Recently, using mice with different combinations of genetic and phenotypic sex, we were able to identify sex-associated differentially methylated regions (sDMRs) that depended on the sex phenotype (see this related study). However, the mechanisms behind the establishment of those differentially methylated regions are still being studied. In a new paper by Najla AlOgayil from Prof. Anna Naumova’s lab, including contributions from members of our own lab group, we show that androgen receptor (Ar), estrogen receptor alpha (Esr1), and the transcriptional repressor Bcl6 all play a distinct role in the process.
Focusing on a panel of validated sex-phenotype dependent male- and female-biased sDMRs, we tested the developmental dynamics of sex bias in liver methylation and the impacts of mutations in these three genes of interest. Our data show that sex bias in methylation either coincides with, or follows, sex bias in the expression of sDMR-proximal genes, suggesting that sex bias in gene expression may be required for demethylation at certain sDMRs. Briefly, AR and ESR1 influence DNA methylation through their impacts on gene expression, whereas BCL6 not only regulates expression, but also serves as a bridge between expression and demethylation at intragenic regions.
For these analyses, we relied on pyrosequencing methylation assays to compare DNA methylation levels in female and male livers at three different ages, representing fetal (E14.5), prepubertal (4 weeks), and adult (8 weeks) life. Additionally, we analyzed motif enrichment in all sex-phenotype dependent sDMRs using HOMER. Finally, to understand the relationship between sex-biased methylation and sex-biased expression, we studied the expression profiles of sex-biased genes using RNA-seq data from the livers of knockout mice strains for the genes Esr1 and Bcl6, and compared them to control groups using the GenPipes RNA-seq pipeline.
Taken together, the data we published serve as evidence that sex phenotype-dependent autosomal DNA methylation levels in the mouse liver depend on not one, but several factors, including AR, ESR1, and BCL6. However, more research is needed to further elucidate how multiple signaling pathways affecting gene regulation, including steroid sex hormones, can modify methylation, at least in the early ages when methylation patterns are being established.
Published: July 28 2021
How do jumping genes contribute to human diversity?
In recognition of Transposons Day 2021 (June 16th), let’s take a look at the importance of active mobile genomic elements (once called “junk DNA”) to human health and diseases.
Illustration: ANDRZEJ KRAUZE
We are hosts to genomic hitchhikers
Of the three billion base pairs of the human genome, more than half contain a heterogenous crowd of repeated sequences called transposable elements (TEs, or transposons). These genomic hitchhikers have fascinated researchers since their discovery in corn by Barbara McClintock in the late 1940’s (McClintock was eventually awarded the Nobel Prize for her discovery… in 1983). As a matter of fact, TEs can indeed “jump” – or transpose – from one locus to another, sometimes affecting the expression of nearby genes.
TE jumps are not benign, since new insertions can disrupt genes and other crucial regulatory features. Indeed, TE can cause diseases: In germ line cells, TE insertions can cause sporadic conditions, while in soma cells, TE jumps can lead to cancer and are associated with aging and neurodegeneration.
Fortunately, most of the TEs present in our genomes today are no less than DNA fossils, inert witnesses of past waves of transposition. In fact, most TEs in the human genome have either accumulated inactivated mutations or remain silent under the control of epigenetic defense mechanisms.
Because TEs carry their own genes (transposases, reverse transcriptases, envelopes) and regulatory sequences (promoters, binding motifs), many have also been domesticated (or “co-opted”) throughout the evolution of eukaryotes. In human lineage, this “cherry picking” of useful TEs has led to important innovations, for example in the immune system.
Given all these examples, the fate of a new TE insertion is often difficult to predict. In humans, the overwhelming majority of TEs are fixed in the global population, meaning that most TE loci are shared between humans – most, but not all!
So, what are the consequences of contemporary TE activity in humans? How many are there? What do they do? Why does it matter? Hang on!
Active Transposable Elements generate structural variation
In humans, a small number of elements which belong to the Alu, LINE1 and SVA superfamilies (groups of homologous copies) escape the host defenses and retain their “jumping” capabilities. These TEs belong to the class ofretrotransposons (Class I) since they all rely on the reverse transcription of their RNA to colonize new loci.
Active TEs in humans are estimated to generate ~1 new insertion every 50 births, contributing to more than 40,000 known insertions polymorphisms(even this is most likely a large underestimation). These structural variants recapitulates the genetic diversity oof the human population, as it can be observed with single nucleotide polymorphisms (SNPs).
Figure 1. TE insertion polymorphism in humans accounts for tens of thousands of variants.
If TE insertions present further analogies to SNPs, then a subset of TE variants should also be crucial in the regulation of gene expression. Understanding how TEs affect genomic regulation in humans is decisive, considering a growing number of evidences ties TE insertion polymorphisms to medical conditions.
Recent research, as discussed below, suggests that polymorphic TE insertions create functional variation among genomes and actively contribute to the emergence of new phenotypes.
New TE insertions can modulate gene expression
Expression QTL identifies hundreds of potential regulatory TEs
One popular approach to investigating the link between genomic variation and gene expression is to perform population-wide analyses called molecular QTL(quantitative trait loci). This new generation of QTL search for statistical connections between genotypes and molecular phenotypes throughout the genome. Molecular phenotypes are continuous variables typically drawn from -omics data, such as gene expression (eQTL), splice variants quantification (sQTL), chromatin accessibility (caQTL) or methylation (meQTL).
By swapping SNPs for TEs, researchers have recently been able to apply this framework to insertional polymorphisms of Alu, LINE1 and SVA among human genomes. The result? Hundreds of “TE-QTL”, where the presence of a given TE is statistically correlated to the expression of a nearby gene. First reported by Wang et al. (2017) in immune cell lines < a href=””>(LCL), the GTEx consortium recently showed that TE-QTL, like SNPs, can either promote tissue and organ-specific expression or display organism-wide (housekeeping) effects on gene expression. In addition, TE variants were shown to be able to generate splice variations after exploring TE-QTL in 44 post-mortem tissues.
Figure 2. Recent findings using TE-eQTL. A Manhattan plot from Wang et al., (2017) displaying the significance of hundreds of Alu, L1 and SVA TE-eQTL, according to TE genomic location. B Heatmap of TE-eQTL (blue-red) and gene expression (green-pink) correlations across 44 tissues from the GTEx consortium. Adapted from Cao et al., (2020).
Leveraging multi-omics data to infer mechanisms
To understand how new TE insertions modulate nearby gene expression, I took advantage of the TE-QTL framework to layer epigenomic data and examine whether gene regulation by TEs could occur through chromatin remodeling. Using ATAC-seq, I tested this hypothesis by applying both expression and chromatin accessibility QTL (e- and ca-QTL) in LCLs derived from the 1000 Genomes Project.
This study found that hundreds of TEs are statistically associated with chromatin accessibility in humans, and that a subset of these elements also affect the expression of local genes. A great example of this relationship is the case of MAP3K13, an up-regulator of the proto-oncogene c-Myc for which both chromatin accessibility and gene expression are reduced in the presence of an AluYb8 insertion within an annotated enhancer. Though further investigation is needed, this example illustrates the potential of TEs to provide protective alleles by generating epigenomic variation.
Figure 3. Figure 3. A candidate AluYb8 insertion in an enhancer of MAP3K13 is correlated to the reduction of chromatin accessibility at three ATAC-seq peaks mapped to TSS and CTCF sites (ATAC peaks 1, 2 and 3 and blue boxplots), as well as the reduction in expression of the gene as a whole (as seen by RNA-seq, purple boxplot). Adapted from Goubert et al., (2020a).
The importance of genotype quality
A critical aspect in all association studies (studies relying heavily on data correlations such as GWAS or QTL) is the quality of the genotypes. Bias in the initial genotyping can lead to missed or false positive signals during functional analyses.
While most genotyping algorithms rely on likelihood ratios after mapping reads onto a reference genome (an approach well-suited for SNPs) structural variants like TEs will only be represented by either presence or absence alleles. Given that the majority of datasets (large cohorts in particular) are still relying on reads shorter than a typical TE (active human TEs range from 300bp to 6000bp), there is a pressing need to improve TE genotyping.
Figure 4. Errors in TE genotyping reduce the ability to detect TE-eQTL. In this example, only 1 homozygote for the TE insertion (“2”) was detected with a method based on a single reference genome (left, Sudmant et al,. 2015). Correction using a composite reference genome made of pairs of presence/absence alleles for each locus (centre) enhances the ability to detect a correlation between genotypes and ALS2 expression, as recapitulated by a SNP in linkage disequilibrium (right).
At the McGill Genome Centre, we develop new methods to improve TE genotyping and obtain a better understanding of their effects in humans:
- With short reads, TE genotypes can be improved by remapping reads over a composite genome made of pairs of “presence” and “absence” alleles using the linear reference genome as a background (project homepage, in development).
- If long reads and alternate genome assemblies are available, then personalized and graphed genomes are the next avenue to explore.
- McClintock B., The origin and behavior of mutable loci in maize – Proceedings of the National Academy of Sciences Jun 1950, 36 (6) 344-355; DOI: 10.1073/pnas.36.6.344
- Payer, L.M., Burns, K.H. Transposable elements in human genetic disease – Nat Rev Genet 20, 760–772 (2019). DOI: 10.1038/s41576-019-0165-8
- Burns, K.H., 2017. Transposable elements in cancer. Nature Reviews Cancer, 17(7), pp.415-424.
- Andrenacci, D., Cavaliere, V. and Lattanzi, G., 2020. The role of transposable elements activity in aging and their possible involvement in laminopathic diseases. Ageing research reviews, 57, p.100995.
- Jönsson, M.E., Garza, R., Johansson, P.A. and Jakobsson, J., 2020. Transposable elements: a common feature of neurodevelopmental and neurodegenerative disorders. Trends in Genetics.
- Smit, A.F., Riggs A.D., Tiggers and DNA transposon fossils in the human genome – Proceedings of the National Academy of Sciences Feb 1996, 93 (4) 1443-1448; DOI: 10.1073/pnas.93.4.1443
- Slotkin, R., Martienssen, R. Transposable elements and the epigenetic regulation of the genome. Nat Rev Genet 8, 272–285 (2007). DOI: 10.1038/nrg2072
- Cosby, R.L., Judd, J., Zhang, R., Zhong, A., Garry, N., Pritham, E.J. and Feschotte, C., 2021. Recurrent evolution of vertebrate transcription factors by transposase capture. Science, 371(6531).
- Chuong, E.B., Elde, N.C. and Feschotte, C., 2017. Regulatory activities of transposable elements: from conflicts to benefits. Nature Reviews Genetics, 18(2), p.71.
- Feusier, J., Watkins, W.S., Thomas, J., Farrell, A., Witherspoon, D.J., Baird, L., Ha, H., Xing, J. and Jorde, L.B., 2019. Pedigree-based estimation of human mobile element retrotransposition rates. Genome research, 29(10), pp.1567-1577.
- Watkins, W.S., Feusier, J.E., Thomas, J., Goubert, C., Mallick, S. and Jorde, L.B., 2020. The Simons Genome Diversity Project: a global analysis of mobile element diversity. Genome biology and evolution, 12(6), pp.779-794.
- Cao, X., Zhang, Y., Payer, L.M., Lords, H., Steranka, J.P., Burns, K.H. and Xing, J., 2020. Polymorphic mobile element insertions contribute to gene expression and alternative splicing in human tissues. Genome biology, 21(1), pp.1-19.
- Goubert, C., Zevallos, N.A. and Feschotte, C., 2020a. Contribution of unfixed transposable element insertions to human regulatory variation. Philosophical Transactions of the Royal Society B, 375(1795), p.20190331
- Li, H., 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), pp.2987-2993.
- Goubert, C., Thomas, J., Payer, L.M., Kidd, J.M., Feusier, J., Watkins, W.S., Burns, K.H., Jorde, L.B. and Feschotte, C., 2020b. TypeTE: a tool to genotype mobile element insertions from whole genome resequencing data. Nucleic acids research, 48(6), pp.e36-e36.
- Groza, C., Kwan, T., Soranzo, N., Pastinen, T. and Bourque, G., 2020. Personalized and graph genomes reveal missing signal in epigenomic data. Genome biology, 21, pp.1-22.
Published: June 21 2021
‘CHANGE’ through the eyes of a Business Analyst
By Mary Ann Kizhakechethipuza
“There is nothing more difficult to take in hand, more perilous to conduct or more uncertain in its success than to take the lead in the introduction of a new order of things.”
– Niccolò Machiavelli
The Business Analyst Body of Knowledge (BABOK) describes Business Analysis as the “practice of enabling change in an enterprise by defining needs and recommending solutions that deliver value to stakeholders”. The major focus for a business analyst during the lifetime of a project is ensuring that the team is equipped with the necessary tools required to propel forward the positive progress of the project. The ability of a business analyst to understand, communicate, model and navigate “change” is instrumental in being able to channel the team’s efforts towards project success.
In the organizational context, change can be planned and modelled. But, it is usually associated with higher unpredictability, risk and ambiguity. The impact created by pursuing the change can be positive and negative, the latter of which may not become evident until the later stages of implementation. Hence, this raises the question as to why so much importance is given to being able to change or adapt?
The author Braden Kelley, in his book “Charting Change” observes that “while there is risk to change, just like with innovation, there is often potentially more risk associated with doing nothing.” Progress in technology, shifts in consumer behaviour and even political movements have created a rate of external change which has in turn triggered internal change in organizations and fuelled the speed of innovation in many businesses, across industries and over the globe. Although it could be argued that the intention behind participating in this race as a means to stay ahead of the competition and the changing environment is “survival”, it is also worth noting that, propelled by innovation, the same rate of change has brought several improvements to the general quality of life and the quality of services provided by industries like Healthcare and Information Technology.
Twenty years ago, “work from home” was a term uncommon among the workforce, but today, amidst the pandemic, it is the new norm. The latest developments in pharmaceuticals and drug discovery have enabled companies to formulate and bring to market highly potent vaccines against the vicious COVID-19 virus, all in record time. Other organizations like the Canadian Centre for Computational Genomics (C3G) are one step ahead and have adopted the mission of providing cutting edge support in bioinformatics analysis and high performance computing services for the life science research community through “leveraging innovation”.
One good example in this case would be COVID19 Resources Canada, which is an online platform to coordinate pandemic response efforts at a national level. The idea for this platform started as a casual conversation in Twitter between two great minds which quickly evolved into a website with the aim to serve the public and the research community as an information hub, a platform to coordinate research efforts and as a tool for COVID-19 public health capacity building.
Another project worth mentioning in this regard is the Bento platform. Compliant with GA4GH standards, it allows users to ingest, organize, store, retrieve and navigate genetic -omics datasets and associated clinical/phenotypical metadata. I have been fortunate to work on both of these projects and have had the opportunity to understand and witness first hand the positive impact these projects brought to the expert & non-expert community through their approach of catering and adapting to the changing needs of the external environment.
Recent trends in the external environment have reinforced the fact that the future holds change and that it is inevitable. Companies with financial and organizational agility have a better chance to quickly adapt to changing market conditions. The same applies at an individual level where in addition to agility, the attitude and ability to embrace change shall prove as a necessary foundation for navigating through the chaos.
- Kelley, Braden. (2016). Charting Change: A Visual Toolkit for Making Change Stick (2016). Palgrave Macmillian
- International Institute of Business Analysis. (2015). Business Analysis Body of Knowledge(BABOK V3)
Published: June 21 2021
McGill researchers awarded Azrieli Science Grant for RNA and the brain
Dr. Wayne Sossin, Professor at The Montreal Neurological Institute, McGill University, Montreal, has been awarded a grant through the Azrieli Science Grants Program as principal investigator for a research project titled “Determining how stalled polysomes are generated for transport in RNA granules and regulated local translation”. He will collaborate on this project with Dr. Joaquin Ortega and Dr. Guilluame Bourque, also of McGill University and Dr. Christos G. Gkogkas of IMBB-FORTH in Greece. The total budget awarded is $450,000 over 3 years.
Local translation in the axons and dendrites of neurons is critical for wiring up a functioning nervous system. Neuronal RNA granules are densely packed clusters of mRNA and ribosomes that transport mRNAs for local translation in synapses. Mutations in proteins necessary for the function of neuronal RNA granules lead to miswiring and neurodevelopmental disorders. Learning how these granules are generated, structured and function is necessary to developing strategies to ameliorate neurodevelopmental disorders caused by the dysregulation of these granules.
This project will find out exactly how the process of protein translation (the process of creating proteins from mRNA) in the nervous system is different in neurodevelopmental disorders.
The Azrieli Foundation carries forward the philanthropic legacy of David J. Azrieli. The foundation’s mission is to empower people by supporting a broad range of organizations, facilitating innovative outcomes, and increasing knowledge and understanding in the search for practical and novel solutions.
Dr. Senthilkumar Kailasam from C3G will act as the primary staff member working on this project. C3G is happy to contribute genetic computational support to this undertaking.
Astronaut gut microbiome alterations induced by long-duration confinement
Throughout long-duration spaceflight, maintaining astronaut health is crucial to the feasibility of a manned mission to Mars. In the longest controlled human confinement study conducted to date, the ground-based Mars500 experiment investigated long-duration health by isolating six astronauts for 520 days.
A study by Emmanuel Gonzalez from C3G and his collaborators from the University of Montreal, Nicholas Brereton and Frédéric Pitre revealed that crew members who took part in the Mars500 experiment showed significant changes in their gut microbiota from their 520 days in confinement.
Published: May 28 2021
McGill and Genome Canada announce new Canadian SARS-CoV-2 Data Portal
The development and implementation of the new Canadian SARS-CoV-2 Data Portal will be led by McGill University’s Dr. Guillaume Bourque, Professor in the Department of Human Genetics, and Director, Canadian Center for Computational Genomics. The Data Portal will manage and facilitate data sharing of anonymous viral genome sequences among Canadian public health labs, researchers and other groups interested in accessing the data for research and innovation purposes.
Architecture beyond buildings
Hi there! Could you please tell us a little bit about
yourself and what you do at C3G?
My name is Ksenia Zaytseva and I am a Data Architect within the Data Team at C3G. I have a Master’s degree in Information Science. Before C3G I worked in a variety of academic research disciplines building data and metadata management systems mostly for their Open Data initiatives.
In the Data Team, we work on the development of various tools and services for data infrastructures. We organize experimental metadata or patients’ clinical and phenotypic data following international data standards devised by the biomedical and genomics health domains. We also build systems and user interfaces for exploring the data and making large-scale genomics data analysis possible.
What kind of projects are you involved in?
One of the main projects I am part of is the Bento portal. The Bento portal is a platform for sharing and exploring –omics data. My role on this project is to develop clinical and phenotypic data management services based on the existing standards in genomics, healthcare and information science. Bento is a suite of microservices where each microservice addresses a specific problem. The advantage of this approach is that depending on a project’s specific requirements, each microservice can be used separately and plugged into other software architectures. I have been working on the Katsu service – it’s an API service with a database backend used to store phenotypic metadata about patients and/or biosamples and their related genomic and disease information. The service is partly based on the Phenopackets GA4GH standard. It also stores experiment metadata, administrative metadata about the dataset itself (e.g. provenance, access rights) and reference resources (e.g. what ontologies and controlled vocabularies are used to annotate the data).
We aim to implement Bento as a generic platform for various projects in genomics. This approach, and our data model’s adoption of standards, enables us to set up a project portal and to transform and import the project-specific data relatively quickly. It also provides the possibility for integrative and federated data analysis in the future. Currently, Bento is deployed in several projects, among them iCHANGE, Signature and BQC19.
Another project I am involved in is the Canadian Distributed Infrastructure for Genomics (CanDIG). Similarly to Bento, I am working on a clinical metadata service using the OMOP data model. Besides genomics projects, I am also a part of the Canadian Open Neuroscience Portal (CONP) project. The CONP is an open data portal for datasets and pipelines in neuroscience. I developed and maintain the metadata validation tool. When each dataset is submitted to the portal, the tool checks if it contains all the required data descriptions, for example, information about its creators or the license under which the dataset has been made available. I have also worked on implementing semantic web technologies within CONP, such as making its metadata available in Google dataset searches and providing SPARQL endpoint access. I am currently working on integrating CONP terms into the Neuroscience Information Data Model.
What do you enjoy most about your work?
Besides developing as a technology professional, my favorite part of my work at C3G is that I get to learn a lot about human genomics, different sequencing technologies and methods, and how other things work in healthcare and biomedical research. I find it personally very interesting as a general context for my work.
Ksenia tells us about data and metadata standards in the health domain and some of the challenges in that field.
All standards or data models can be divided into two groups: first, those that apply to the data itself and, second, those that apply to the metadata (the data about the data). The first group includes definitions and relationships among biomedical and health concepts – for example, the definitions of Individual, Biosample/Specimen, Condition, how those elements are related to each other and what properties they have. The standards I am working with are GA4GH Phenopackets for phenotypic data in genomics, OMOP common data model for observational medical data, HL7 FHIR – healthcare records exchange standard and mCODE data elements for oncology-related data.
The second group includes metadata standards and models for describing the meta information about the data. For example, its provenance describes how and when the data was originated/collected/produced (e.g. by an agent or through a machine or software). Also described are the creators or authors of the data, applicable access rights, where the data is stored (e.g. repository or archive) and what the data is about. I work specifically with the DATS model for dataset descriptions as well as schema.org and W3C standards (e.g. PROV-O).
Besides data standards there are many reference resources (ontologies and controlled vocabularies) and databases in the biomedical field that we use to reconcile our data descriptors. Some of these ontologies are SNOMED-CT, Human Phenotype Ontology (HP), National Cancer Institute Thesaurus (NCIT) and Uber Anatomy Ontology (UBERON), among others.
What are the challenges of working with health science data?
The main challenge is that there is no one-size-fits-all data model that satisfies all different use cases. Most projects have their own systems and data requirements related to their research questions and goals. Use of the data/metadata standards and interoperability guidelines allow us to bring data together via identifying common data elements. It facilitates large-scale data aggregation, analysis and new findings. That’s why it’s important to build a community of researchers, clinicians and data experts to gather the knowledge and expertise together in order to develop common data solutions that can cover various use cases and better prepare us for unforeseen challenges, such as the current pandemic.
Published: April 26, 2021
In a Bioinformatician’s Shoes
Meet Hector in this short video interview as he shares with us his passion as a bioinformatician and the exciting challenges he faces at C3G in the era of Covid19. One of his many projects tracks Covid19 mutations and variant evolutions in near real-time to better defend against the pandemic.
Matching single cells to a reference cell type
By Maxime Caron
For more than a decade it has been possible to profile the transcriptome of single cells and numerous analytical methods have emerged over these years . C3G is currently undertaking one such single cell transcriptomic project. A useful analysis technique has been to project individual cells onto a cell type reference to find the most similar cell type of unlabeled and/or malignant cells. One simple approach uses R  and two dimensionality reduction algorithms (PCA  and UMAP ) as well as k-nearest neighbours (kNN)  to project cells onto a cell type reference and assess the accuracy of these projec-tions. This approach focuses on the independent processing of both training and test sets to limit over-fitting and showcases the superior performance of UMAP compared to PCA in this application.
A set of 10,000 cells from 9 cell types (labeled A to G) is used. The PCA and UMAP representations of these cells are shown below and we observe a more refined representation using UMAP.
The projection procedure starts by iteratively randomly splitting the dataset into 70% training and 30% test sets. For each iteration we:
1) Normalize for sequencing depth and log transform the expression counts on the training and test sets independently.
2) Feature select (most variable genes) on the training set and intersect these features in the test set.
3) Run PCA on the standardized expression values of the training set and apply the trained PCA object to the test set to obtain the principal components of each cell.
Three iterations of this procedure using PCA only are shown below. The training cells appear in grey and the test cells appear coloured. We assign the projected cell type of each test cell using k nearest neighbors (e.g., k=10) on the PC1 and PC2 values. In this example, the accuracy of the projections is defined as the overall fraction of cells that are correctly identified. We observe low projection accuracy for some cell types and overall average accuracy using PCA due to its coarse representation.
We then generate a UMAP representation of the training set using the previously computed principal components and use the trained UMAP object to get the UMAP coordinates of the test cells:
4) Run UMAP on the training set using the top principal components of PCA (e.g., 30).
5) Obtain the UMAP coordinates of the test cells using the trained UMAP object and the principal components of the test cells generated in step 3.
Three iterations of this procedure are shown below. This time we assign the projected cell type of each test cell using k nearest neighbors on the UMAP1 and UMAP2 values.
Using UMAP, we observe more defined cell type clusters and an increase in overall projection accura-cies. This process can be repeated enough times to obtain a distribution of either cell type specific val-ues or overall accuracy values.
-  Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., … & Surani, M. A. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods, 6(5), 377-382.
-  R Core Team, R. (2013). R: A language and environment for statistical computing.
-  Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559-572.
-  McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
-  Fix, E. (1951). Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine.
Published: February 23, 2021
Don’t Skip the Repeats
Junk DNA: vestigial remains or the genome’s dark matter? For a long time, genetic repeats and transposable elements were characterized as such – useless, nuisances and unknowable. There was some sense to that vision. Once the 20 000 genes of the human genome were identified, these being the parts that actually did something, what could explain all the rest? Current research points towards regulation. There are promoters, enhancers and sequences contributing to structural integrity, but what about all the repeated sequences? In recent years, the body of evidence for the role and importance of repeats has grown considerably such that they should no longer be ignored.
What are repeats?
Repetitive sequences include simple repeats and transposable elements (TE). These genomic sequences are duplicated throughout the genome and encompass more than 50% of the human genome. Simple repeats are short repeated sequences in succession while TEs are DNA sequences that are able to relocate within the genome. The two main methods of transposition define the two TE classes. Retrotransposons transpose through a “copy and paste” mechanism whereby DNA is transcribed to RNA and that RNA is reverse-transcribed elsewhere within the genome. DNA transposons use a “cut and paste” mechanism where a DNA intermediate is used1. Their various origins, along with the mutations they undergo, has led to a complex phylogeny of TE families and subfamilies, each with their own properties and features. Through these two mechanisms, TEs spread multiple copies of themselves throughout the genome, most having lost the ability to be expressed or be further transposed.
Why were they overlooked?
While DNA sequences that hop around might seem like obvious elements of interest, it is worth noting that their repeated presence makes their analysis a challenge. Most sequencing approaches rely on mapping reads to a reference genome. Since repeated fragments map to multiple locations, TEs cannot be appropriately placed and are often discarded2.
In addition, TEs transpose simply because they can. Where they end up could disrupt normal gene function, but for the most part they do not affect anything and become degraded by genetic drift. In summary, TEs are mostly silent genomic sequences that do not code for relevant host genes, degrade over time and challenge our current genomic analysis approaches. They are not the most intuitive elements, are they?
Why are they worth considering?
Current approaches building upon databases such as repeatmasker3 enable increasingly accurate TE measurements that reveal their involvement with regulatory activity. In fact, the evidence of their role in regulation is so strong that instead of doubting TE’s utility, the question is now how wide-reaching is their influence? The most definitive impact of TEs is co-option, the integration of TEs as part of the host regulatory genome. For example, a MER41 TE integrated an interferon-inducible enhancer previously absent in a melanoma 2 (AIM2) gene regulating inflammation from viral infection. It has also been found that some TEs can still be expressed and their transcripts interact as non-coding RNA, which can lead to regulation of distant genes4. Furthermore, TE insertions are not random due to their preferences for various genome features and compartments1. Thus, TEs can be associated with other features of the genome such as the epigenome. LINE-1 TEs have been found to be hypomethylated in cancer and without methylation TEs are more likely to be expressed. The resulting increase in expression could be used as a cancer biomarker and lead to clinical applications5. In some of my own preliminary work, I find that TEs tend to be differentially represented across cell tissue types in histone mark ChIP-seq, suggesting that TEs involvement with regulation may be cell type specific. These examples are only a few of the various ways in which TEs have, and continue, to shape our genome.
It is worth remembering that even if repeats account for the majority of the genome, they are not the answer to everything. They still remain largely outside of genes and are mostly inactive. However, they could also be the overlooked component that just might explain your latest genetic discovery.
- 1. Bourque, G. et al. Ten things you should know about transposable elements. Genome Biol. 19, (2018).
- 2. Goerner-Potvin, P. & Bourque, G. Computational tools to unmask transposable elements. Nat. Rev. Genet. 19, 688–704 (2018).
- 3. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. (2013).
- 4. Chuong, E. B., Elde, N. C. & Feschotte, C. Regulatory activities of transposable elements: from conflicts to benefits. Nat. Rev. Genet. 18, 71–86 (2017).
- 5. Ardeljan, D., Taylor, M. S., Ting, D. T. & Burns, K. H. The Human Long Interspersed Element-1 Retrotransposon: An Emerging Biomarker of Neoplasia. Clin. Chem. 63, 816–822 (2017).
Published: January 18, 2021
The decision announced by the Danish prime minister, Mette Frederiksen, on November 4th to destroy all minks in the northern country is projected to result in the destruction of more than 15 million minks. The Danish government claimed their decision was supported by evidence that a mutated variant of SARS-CoV-2 spread among minks had infected humans and that this variant could interfere with future vaccine effectiveness. However, the severity of the mink viral strain is still under debate. Slaughtering all minks nationwide, with its far reaching implications for the whole fur industry, has been described as both aggressive and overly cautious.
While causing surprise all over the world and leading to concerns of the severity of the astonishing move, the spread of SARS-CoV-2 among farmed mink did not happen overnight. The first reported cases of COVID among farmed minks in Denmark occurred on June 17th and led to more than 5,000 minks being euthanized. On September 18th, Denmark’s infectious disease research institute, the Statens Serum Instituton, alerted that the mutated virus from mink had formed trans-infection chains in humans and warned that these virus mutations could attenuate community immunity. Urged to take immediate action to minimize the risk of wider spreading of the new virus mutant, on November 4th the Danish prime minister announced the order to kill all minks nation-wide.
Strain tracing and variant calling
One natural question to ask is how virus strains are being traced, especially for mutations that might have unwanted potential to boost transmission and/or lethality. Examples from an earlier study1 focusing on SARS Coronavirus-2 variant tracing during the early outbreak of COVID-19 patients in northern Germany illustrate bioinformatics’ role in virus strain tracing during this pandemic. Specifically, the techniques applied include variant calling and comparative analysis of single nucleotide polymorphisms (SNPs). Genomic data was collected via metagenomic RNA-sequencing and amplicon-sequencing before subsequent variant calling identified frequently observed clusters of SNPs. The mutations that enhanced the replication and transmission of the virus are more likely to take precedence and therefore serve as signatures for identifying popularized strains (Figure 1). Strain clusters are critical for evaluating the dynamics of virus mutation, for diagnostic and therapeutic purposes as well as to inform policies aiming to contain the spread of the virus.
Clustering of viral variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences recovered from the index patient, patient 1, patient 2 and 19 SARS-CoV-2 sequences from respiratory swabs collected at the same time period in comparison to the reference sequence, NC_045,521. Nucleotide positions are indicated at the bottom. Only variants with sufficient coverage (>10) and single nucleotide polymorphism (SNP) present in more than 33% of all reads and in at least one sample are included. I-III summarizes sequence patterns as defined by SNPs. The frequency of variants is indicated by the heat map ranging from grey (reference), yellow to dark blue (variant). The quality score per individual site is indicated at the top. * indicates members within one family. Sampling dates are indicated on the right with the sampling date of the cases in the index cluster labelled in red.
1. Potential Overinterpretation
Comparative genomics has been a common practice in detecting similarities among and differences between samples. However, one bigger challenge following variant discovery is evaluating the significance or potential impact of SARS-Cov-2 mutations on pandemic containment efforts. Scientists express skepticism upon the claim that mutations in one of the clusters identified from Denmark’s farm minks could attenuate the efficacy of vaccines under active development. More scrutinized trials are needed to evaluate the risks of the newly popularized strain interfering with vaccination efforts.
2. Lack of universal guidelines
Risky incidents of intra- and interspecies transmissions of SARS-Cov-2 have been captured and spotlighted. However, universal guidelines are still needed to prevent interspecies transmission between humans and other natural hosts. In similar emergencies, detailed criteria will be immeasurably valuable for supporting or halting such extreme public health measures as euthanizing large herds of animals. Despite enormous costs, overcaution can still be greatly beneficial, when well guided.
1. Pfefferle, S., Günther, T., Kobbe, R., Czech-Sioli, M., Nörz, D., Santer, R., Oh, J., Kluge, S., Oestereich, L., Peldschus, K. and Indenbirken, D., 2020. SARS Coronavirus-2 variant tracing within the first Coronavirus Disease 19 clusters in northern Germany. Clinical Microbiology and Infection.
Published: November 18, 2020
COVID-19 didn’t cancel these summer internships
With economies locked down this past summer and companies reluctant to hire, for many students, internship opportunities looked rather grim. Despite the shutdown of the McGill campus, the Canadian Center for Computational Genomics (C3G) shifted gears and revved up for a summer of working from home internships.
For Solomia Yanishevsky to complete her Master’s of Health Bioinformatics at Dalhousie University, she either had to do an internship or write a thesis. Desiring practical experience, Solomia expanded her job search beyond Halifax, NS and landed at the Montreal node of C3G. After many companies backed out of their hiring intentions, of the 20 students in her program, she was one of only four to find an internship position. Over the summer, Solomia worked with the Data Team to generate artificial FHIR datasets to successfully test a data ingestion algorithm. She also mapped the mCODE data standard to C3G’s Phenopacket metadata service to broaden its compatibility. As part of Genome Canada’s Canadian COVID Genomics Network (CanCOGen) project, Solomia worked on integrating live COVID-19 data elements from six different sources into a cohesive framework, a project that is as ongoing as the global health pandemic. Seeing her work put to use and understanding how it integrated into the bigger picture, Solomia embraced the responsibility. She found the work environment highly collaborative and engaging. Her coworkers were welcoming and her mentor was patient to her learning process.
Not every Queen’s University computer engineering student can tell the difference between DNA and RNA, but
Soulaine Theocharides is the exception. Currently pursuing a secondary degree in biology, her summer work here at McGill (Queen’s University’s great rival) occupied the space at the intersection of the two fields. Working alongside the TechDev Team, Soulaine used computing clusters, Unix and Python to develop a searchable organizational structure for massive epigenomic datasets. Her project integrated databases and file hierarchies with genetic analysis pipelines. Soulaine rigorously documented her resilient data organizational structure so that it survives well beyond her internship.
McGill software engineering student Sebastian Ballesteros didn’t let his absence of genomic experience interfere with his summer. Instead, the novelty of genetics was a strong motivational factor for him. Leveraging previous internships working with computer graphics, Sebastian developed an online tool to visualize genomic variations as part of the Bento platform. He also transformed complicated genomic policy maps into a privacy assessment tool named D-Path (Data Privacy Assessment Tool) which allows data stewards worldwide to easily comply with regional regulations governing genetic data storage and use. Despite living in the McGill ghetto only a few blocks away from the C3G office, his internship was carried out entirely online. Though initially difficult to get a feel for his coworkers and associations, Sebastian found remote work good preparation for his current semester which is being taught entirely online. He loves Montreal’s unique personality, multi-cultural diversity, bilingualism and latin flair. With its strong university focus, good population density and available housing, he wouldn’t live in any other Canadian city.
After his school, Collège Ahuntsic, cancelled all their biotechnology laboratory technique internships, Étienne Collette’s summer was uncertain. During the previous semester at school, C3G Services Team Bioinformatics Manager François Lefebvre and C3G Bioinformatics Specialist Emmanuel Gonzalez had been guest lecturers, introducing his class to the computational side of bioinformatics. About to complete his DEC, during the three months when school was restricted to online theory classes, Étienne used the spare time to teach himself the C programming language. One thing led to another and a three week placement at C3G turned into a four month internship where Étienne engaged in no less than six different bioinformatic projects. Coming from an interdisciplinary background, the variety and concurrency were appreciated and stimulating. Along the way, he was supported by responsive feedback from the Services Team and daily updates with François. Étienne found purpose in the real-world detective work required by genuinely unsolved bioinformatic problems. He has just started studying biochemistry at l’Université de Montréal where he intends to specialize in genetics.
C3G and the Bourque Lab at McGill University routinely hire students for summer internships. Many continue to work part-time beyond the summer as they complete their studies. We are a bilingual lab with people from many backgrounds, countries and skill-sets represented. Summer internships are posted around December with applications accepted until mid-February. Sometimes internships happen unexpectedly too. If you, or someone you know, is interested in learning beyond the classroom, drop us a line. See what happens, eh?
Published: October 19, 2020
Making New Discoveries Using Public Data
Every research project is composed of three key elements: a question to answer, the analyzes to perform and the data to use. Often, that last component is limiting. Indeed, producing new data is expensive and sometimes even time-consuming. Thankfully, a solution exists in the form of huge libraries accessible with a few clicks: public data.
Advantages of Using Public Data
Databases are full of useful data covering various techniques, technologies, and organisms. For example, the Encyclopedia of DNA Elements (ENCODE) harbors more than 17000 datasets on human, mouse, worm and fly, from RNA sequencing to whole-genome sequencing through protein binding1,2. Not only is public data easily accessible and free, it may also be stored in its raw and pre-processed form, requiring less time and costs in subsequent analyses (do not forget to perform quality controls first!) While more bioinformatics-inclined papers often use pre-existing datasets to compare tools, they are otherwise greatly overlooked, either because we assume all that could be done with it has been done or because it does not have the exiting spark of novelty. But datasets may have been analyzed using only one angle and could still hold many secrets, even if there are a few years old. Additionally, the ever-growing performance of new algorithms may permit to extract information that was hidden in the data before. For those reasons, it can be valuable to re-analyse public data and this can lead to new discoveries.
How to Use Public Data Efficient
Public databases contain information about many diseases, cell types, organisms and techniques, but it is still limited to what has been explored before. One must thus slightly change his way to approach data in order to find a new angle to analyze. Thus, instead of the typical “formulate question -> how to answer the question -> produce data” workflow, the preparation requires to scout the existing datasets to find some that have a potential for new discoveries. The analyzes have to be centered around the available data rather than the opposite.
Example of New Discovery from “Old” Data
1- Formulate the research question
For my research project, I wanted to study the relationship between transcription and 3D conformation of the DNA in the nuclear space. Various studies tried to explore this before, some of the earliest dating from 19933, but the mechanism is complex and there are still many unknowns.
2- Explore datasets
One of the most common diseases in human is lung cancer. Because of its prevalence and mortality rate, it is also one of the most studied diseases and thus the data produced is widely available. I thus chose to use the A549 cell line, a lung cancer cell line. Various data types were generated from it (RNA-seq, ChIP-seq, Hi-C), permitting to explore both the transcription events and the architecture in these cells. Moreover, being a cell line, it should have less cell-to-cell variability than cells coming from a patient biopsy.
3- Adapt the angle of exploration
As many other studies, including the ones from with the data I used was produced, tried to explore the inter-relation between transcription and 3D folding, a new angle had to be found. A literature review showed there were still many unknowns regarding the different types of boundaries limiting co-regulation between genes
A striking tendency that was seen while exploring the data was that the relative orientation of genes seems to influence their probability of co-regulation. We thus proposed a model stating simple “rules” that affect the probability of co-regulation of two genes in A549 cells (Figure 1). In other words, Genes located on the same strand have a very high chance of co-regulation, as the transcription machinery could just slide from one gene to the other. When genes are located on different strands, there is less chance of co-regulation as the machinery would have to completely un-bind, then re-bind to the opposite strand. The change of strand thus introduces a type of co-regulation boundary. Finally, stronger boundaries that have been described before, such as TAD boundaries or the co-localization of Cohesin and CTCF, disrupt more strongly the probability of co-regulation. The discovery of tendencies that serve as a base to the proposed model have all been made using public data.
Figure 1: (A) Same-strand genes are very likely to be co-expressed, as the RNA pol II just needs to continue its path along the strand. Divergent and convergent genes are less likely to be co-expressed, as the RNA pol II would needs to detach and reattach itself to go from one gene to the other. (B) When genes are separated by a barrier (CTCF and Cohesin or TAD boundary), there is complete disruption of co-expression.
- 1. The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489, 57–74 (2012).
- 2. Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Research 46, D794–D801 (2018).
- 3. Jackson, D. A., Hassan, A. B., Errington, R. J. & Cook, P. R. Visualization of focal sites of transcription within human nuclei. EMBO J 12, 1059–1065 (1993).
Posted September 2020
Disambiguating mixed-species of graft samples
As a Bioinformatician, often I get to work with PDX cancer samples. I’ve recently been reading about samples containing genome admixture, and was revisiting strategies that we commonly use for analyzing these biological data. Presented here is a summary of the existing software tools used for this purpose.
What are Xenografts?
Studying and understanding cancer is very challenging, and animal model systems help in addressing some of the common research bottlenecks. Mouse harboring human cancer cells, also known as Patient-derived xenograft models (PDX), are an excellent model system available to researchers. A small number of cancer cells collected from a patient are injected into an immunocompromised mouse, which grow to form tumors. PDX systems provide a controlled platform to study tumor biology and are especially useful for testing chemotherapeutic approaches.
Technical issue with xenograft samples
The grafted sample obtained from these mouse tumors can be subjected to NGS for genomics and transcriptomics studies. Despite meticulous efforts, it is difficult to prevent contamination of the graft samples with the host (mouse) stromal tissue, and the sequencing obtained is usually contaminated with host DNAs and RNAs. These contaminations could hinder correct interpretation of the data. Removing reads of host origin before downstream analysis is becomes essential to ensuring accurate conclusions.
What are the methods available?
Various algorithms exist to separate host-derived reads from the rest of the sample. Almost all methods require that the input is in two are more BAM files: one aligned to the host genome and other aligned to the graft genome. The type of aligner used also hugely influences the choice of algorithm. I was able to find five well documented software packages. Most of these algorithm compare the quality of read alignment and then categorises the reads to either host or graft. If ambiguous the read is discarded. Some of the key features of these packages are highlighted in the table. (Table 1)
|Package/software Name||Compatible Aligner||Comparison||Remarks||Multicore||References|
|Sargasso||Bowtie2, STAR||Multispecies||Custom filtering by threshold||Yes|||
|Xenosplit||Subread, Bowtie2, Subjunc, TopHat2, BWA and STAR||Maximum two species||Goodness of mapping scores||No|||
|Disambiguate||Hisat2, TopHat, BWA and STAR||Maximum two species||No|||
|XenoCP||BWA||Maximum two species||Cloud-based||Yes|||
I tried all three of the four of these packages (XenoCP omitted) in an active project. I had selected a sample dataset that had issues with poor read alignment to reference genome (GRch38). The reason I chose this dataset was to see if the unaligned reads are of host origin! But that was not the case for this dataset. I used the number reads recovered (assigned as graft-origin) as a parameter to compare the tools. Sargosso and xenosplit perform very similarly and are stringent in assigning the reads to graft. Disambiguate, the oldest program of those tried, gave slight improvement compared to standard reference genome-based alignment alone. In the future, I plan to compare these five packages using a synthetic dataset with known portion of mouse and human reads. Until then, Sargasso and xenosplit seem promising if you are interested in specificity and not sensitivity.
|Sample 1||Sample 2||Sample 3||Sample 4|
|raw_reads||187, 299, 238||140,927,692||328,021,530||342,910,322|
- Qiu, J., et al., Mixed-species RNA-seq for elucidation of non-cell-autonomous control of gene transcription. Nature Protocols, 2018. 13(10): p. 2176-2199.
- Giner, G. and A. Lun, https://github.com/goknurginer/XenoSplit. 2019.
- Ahdesmaki, M.J., et al., Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples. F1000Res, 2016. 5: p. 2741.
- Rusch, M., et al., XenoCP: Cloud-based BAM cleansing tool for RNA and DNA from Xenograft. bioRxiv, 2020: p. 843250.
Posted August 2020.
The ANCHOR pipeline
ANCHOR is a high-resolution metagenomics pipeline, the result of a multi-disciplinary collaboration between C3G and researchers at Institut de recherche en biologie végétale (IRBV) at Université de Montreal. Published in 2019 in the journal Environmental Microbiology, the pipeline demonstrated unprecedented accuracy in its characterization of microbiome samples . Dr. Emmanuel Gonzalez, co-creator of the pipeline and the metagenomics specialist at C3G was motivated by his observation that existing tools were not performing well on real-life datasets. “C3G’s strategy towards microbiome analysis had been to run commonly used pipelines, but the resolution would sometimes deliver foggy results that made the interpretation challenging. When I joined C3G, another pipeline was buzzing, using a machine learning algorithm to remove sequence noise – a characteristic of metagenomics samples. That was quite the thing back in 2015. It was clever, light and incredibly fast, except… I happened to realize that its accuracy was dropping significantly with real-world experiments involving multiple samples.”
In designing ANCHOR, Dr Gonzalez and co-creator Dr. Nicholas Brereton had resolved that the tool would be written with a strong focus on understanding the biology of metagenomics experiments. Dr Gonzalez recalls that “before I even began writing down any code, we sat down together and set up some ground rules for this new pipeline, based on the respect towards the inherent complexity of any biological system and a reassessment of the capacity and ability of technical advances to handle biological complexity. I think this starting point is what made ANCHOR stand out amidst similar metagenomic pipelines. For example, our first rule went absolutely against the trend: let’s not modify sequences!”
In the publication introducing ANCHOR, Drs Gonzalez and Brereton reanalysed metagenomic data from surface swabs within the International Space Station (ISS). The ISS experiment had originally used the Qiime pipeline and the conclusions of the analysis were that no differences were present when comparing the surface microbial ecosystems of the Destiny (US laboratory) and Harmony (crew sleeping quarters) modules. The reanalysis using ANCHOR substantially improved the scale of data capture as well as the accuracy and resolution of the findings, providing microbial classifications at the level of individual species. The reanalysis not only led to exciting novel discoveries regarding the ISS environment but also fundamentally changed the major conclusion of the experiment, with significant differences clearly identified between the modules (Figure 1). These significant differences detected by ANCHOR included increases in microbiome bacteria associated with the laboratory animals within the Destiny module, such as Helicobacter typhlonius, a species endemic to rodent research laboratories on earth .
Figure 1. ISS Destiny and Harmony Module differential abundance. Significantly differentially abundant taxa, their ISS location and illustrations indicating their known associations.
Presented as part of the scientific article, ANCHOR was used to reanalyse metagenomic data from surface swabs within the International Space Station (ISS). The experiment originally used the Qiime pipeline and the conclusions of the analysis were that no differences were present when comparing the surface microbial ecosystems of the Destiny (US laboratory) and Harmony (crew sleeping quarters) modules. The reanalysis substantially improved the scale of data capture as well as the accuracy and resolution of the findings, now at microbial species-level. “The researchers from the first analysis did a great job, but the pipeline they used limited their ability to analyse what was inside the with precision” adds Emmanuel. The reanalysis not only led to exciting novel discoveries regarding the ISS environment but also fundamentally changed the major conclusion of the experiment, with significant differences clearly identified between the modules (Figure 1). These significant differences included increases in microbiome bacteria associated with the laboratory animals within the Destiny module, such as Helicobacter typhlonius, a species endemic to rodent research laboratories on earth(1).
Human microbiome health
The ANCHOR pipeline was designed to improve data accuracy in high complexity real-world systems, and its applicability quickly moved from the ISS to terrestrial health concerns. “We transitioned this approach directly to the field of human health later in 2019 after Dr. Amir Minerbi, a medical doctor from the Alan Edwards Pain Management Unit, showed great interest to use our pipeline to characterise the bacterial microbiota of women suffering of fibromyalgia”, says Emmanuel. That collaboration led to the discovery of the first microbiome association to chronic pain by comparative analysis of the gut microbiomes of healthy women with fibromyalgia patients . In patients with fibromyalgia, significant increases of species such as Clostridium scindens and Butyricicoccus desmolans, which are bottlenecks in secondary bile production (via 7α-dehydrogenase activity) as well as bacterial derived androgens (20α- and 20β-hydroxysteroid dehydrogenase activity) which may interact with the endocrine system (Figure 2). These are important discoveries in the field which has sparked a new research direction in the pain field and the research has already been recognised around the world.
Long-term Space Travel
At the end of 2019, the team behind ANCHOR responded to a grant call made by the Canadian Space Agency to promote projects on Health & Life Science data and sample mining. They proposed to analyse astronaut microbiomes using improved metagenomics to study the impact of long-term space travel upon astronaut health. “We’re grateful to the Canadian Space Agency for giving us the chance to apply high resolution microbiome analysis within the space sciences again and particularly excited to see whether there is an observable impact of long term confinement on the microbiome health of astronauts.
Les microbes de la Station spatiale internationale
1. Gonzalez E, Pitre F, Brereton N. ANCHOR: A 16S rRNA gene amplicon pipeline for microbial analysis of multiple environmental samples. Environmental Microbiology. 2019.
2. Minerbi A, Gonzalez E, Brereton NJ, Anjarkouchian A, Dewar K, Fitzcharles M-A, et al. Altered microbiome composition in individuals with fibromyalgia. Pain. 2019.
ANCHOR: a 16S rRNA gene amplicon pipeline for microbial analysis of multiple environmental samples
Science Quebec article: Read now
Posted: July 22 2020
The Biobanque Québécoise de la COVID-19 (BQC19) is a province-wide biobank which mission is to provide high-quality data and samples to the scientific and medical community in order to better understand, combat and limit the impact of the coronavirus disease 2019 (COVID-19).
This initiative is led by Dr. Vincent Mooser, Director of its Executive Committee, and a Canada Research Excellence Chair in Genomic Medicine of the Faculty of Medicine at McGill University. It was made possible with the contribution and involvement of inter-institutional and multidisciplinary teams including McGill Genome Center and C3G.
The BQC19 is available to researchers across the country and around the world. Since having access to high-quality data and samples is essential to win the war with the COVID-19 pandemic, the biobank is committed to the principles of Open Science to make all data accessible to qualified researchers.
COVID-19 Resources Canada
C3G has been part of a group of volunteer researchers, students, activists and web developers, led by Guillaume Bourque of McGill University and Tara Moriarty of the University of Toronto, who created COVID-19 Resources Canada a website to facilitate the sharing of information, expertise and resources in the fight against COVID-19. The group aims to “Serve as a reliable source of information and expertise for COVID-19 research in Canada; Support & facilitate coordination of Canadian COVID-19 research efforts; Support COVID-19 capacity-building in public health, research and grassroots initiatives.” The initiatives include a database of volunteers, a tool for sharing reagents used by clinicians and researchers; a compilation of all active Canadian research into COVID-19 and funding opportunities, among others.
visit COVID-19 Resources Canada
Covid-19 updates at C3G
We would like to provide a quick update amidst the coronavirus pandemic. While our host institutions (McGill University, SickKids Toronto) have put various measures in place to ensure the safety of our staff, our platform remains fully operational and available to support genomics research. Researchers should therefore not hesitate to contact us and inquire about our analysis and data management services, seek help to plan experiments, get support for a grant application or request free consultations with our bioinformaticians.
If your lab is conducting or planning to conduct COVID-19 related genomics research, our platform can help. Contact us at email@example.com.
Please remember to stay safe and healthy.
Infant Glioma: Characterizing the landscape of genetic drivers and their clinical impact
Recently published in Nature Communications, this paper presents work by members of the C3G Toronto node that integrates genomic and transcriptomic analyses to assess the molecular and clinical features of infant glioma patients. Examining single nucleotide variants, changes in copy number, fusion formation and other transcriptomic analyses revealed three clinical glioma subgroups in infants, each with distinct genetic drivers, locations in the brain and responses to treatment. Gliomas in infants have substantially different treatment outcomes compared to those that occur in children and adults, yet little is understood about the molecular basis of these differences. This paper gives a comprehensive molecular analysis of infant gliomas to helps to ascertain the biological mechanisms driving their oncogenesis and to help guide future diagnostics and treatment approaches for these patients.
Methylation signatures investigations
The C3G Toronto node has also been involved in two projects to investigate methylation signatures for specific conditions. We have recently published a manuscript in BMC Medical Genomics in which we describe specific DNA methylation signatures for Nicolaides-Baraitser syndrome (NBS), a rare childhood condition that affects physical features and intellectual ability (Chater-Diehl et al., 2019). We showed that specific methylation patterns are associated with pathogenic variants of the NBS causal gene, SMARCA2, which encodes the catalytic domain of a chromatin remodeling complex. We have also identified DNA methylation signatures associated with autism spectrum disorder risk loci, which has been recently published in Clinical Epigenetics (Siu et al., 2019). We show that methylation signatures can be used to identify and distinguish individuals with specific autism-associated mutations and can help determine if specific gene variants are pathogenic or benign to improve autism diagnostics.
Genetics & Genomics Analysis Platform: version 2
We are pleased to announce that the new version of GenAP is now available [HERE]. This release offers a completely re engineered platform that leverages Cloud resources at Compute Canada, and will eventually be deployed as well on other HPC resources. It already offers 2 types of applications: Data Hubs (as in GenAP1), a new graphical “File Browser” allowing files transfer to and from your workspaces,. A new Galaxy application, including up-to-date tools and pipelines, will eventually be added.
ForCasT: a fully integrated and open source pipeline to design CRISPR mutagenesis experiments
ForCas Tool (ForCasT) is a comprehensive tool for the design, evaluation and collection of CRISPR/Cas9 guide RNAs and primers. Using robust parameters, it generates guide RNAs for target loci, assesses their quality for any potential off-target effects and designs associated primers. The results are then stored in a local database that serves as a shared resource for users within a research team, and is constantly being updated to reflect the quality of guides and primers based on additional computational and wet-lab results. ForCasT is a single tool that research teams from various fields of biology can use to build and maintain a collection guide RNAs and primers for Cas-mediated genome editing that are suited to their specific needs. It is currently available as a web-app and as a Dockerized version, and can be found at https://github.com/ ccmbioinfo/CasCADe
GenPipes: an open-source framework for
distributed and scalable genomic analyses
It started in June with the publication of our beloved GenPipes framework and set of NGS data analysis pipelines in GigaScience. We use these pipelines on a daily basis for data production and routine analysis and hope the community will find it useful. While GenPipes is the product of several years of teamwork, kudos to co-first authors Mathieu Bourgey and Rola Dali who worked very hard to get this long-awaited paper out!
Altered microbiome composition in individuals with fibromyalgia
This summer has also been very special for C3G’s metagenome specialist Emmanuel Gonzalez with a publication in Pain highlighting a strong potential link between the microbiome and fibromyalgia, a terrible and elusive disorder affecting a large fraction of the population. This study drew quite a nice amount of media attention, notably from the
Montreal Gazette and the CBC. We are very proud to say that through Emmanuel, C3G provided first-rate analysis services for experimental design, species identification, statistical analysis, machine learning and finally for working hand-in-hand with Drs. Minerbi and Brereton on biological interpretation. Importantly, Emmanuel’s applied ANCHOR here, a method he also
published earlier this year, which enables the identification of microbial species at a resolution higher than for other common 16S sequencing data analysis methods.
Altered differentiation is central to HIV-specific CD4+ T cell dysfunction in progressive disease
Another noteworthy publication to which C3G members contributed as authors was published in Nature Immunology this summer. An important focus of the study was the comparison of HIV-specific CD4 T-cells subpopulations from patients who have undergone antiretroviral therapy and patients who spontaneously suppress HIV viral load below detectable limits (a.k.a. elite controller patients). This comparison contributes to an understanding of why viral control is lost once antiretrovial activity therapy is interrupted.
GSoC 2019 is over!
Again this year, C3G was a Google Summer of Code organization. For people unfamiliar with it, GSoC is in Google’s own words: ” a global program that matches students up with open source, free software and technology-related organizations to write code and get paid to do it! ”
We would like to thank participating students this year for their contributions.
Jiahuang Lin (TBD) – Human history and genome evolution
Konstantinos Kyriakidis (AUTh) – Batchtools for Compute Canada
Madhav Vats (IIIT Delhi) – Flowchart creator for GenPipes
Pranav Tharoor (MAHE) – MiCM Project Match
SriHarshitha Ayyalasomayajula (KMIT) – GenPipes single-cell pipeline
Tip of the Month
There is a huge library of common bioinformatics software available on Compute Canada resources via the modules maintained by C3G staff and distributed via the CernVM-File System (CVMFS). Despite the breadth of the C3G CVMFS library, there may be times when using the provided software isn’t ideal.
you might want to use software that we haven’t yet made available via CVMFS and you don’t want to repeatedly install it at each HPC facility you might want to guarantee comparability of results by running exactly the same software stack on Compute Canada, your workstation, or on infrastructure from a cloud provider such as Amazon Web Services or Google Cloud there is a more recent version of the software already available in a container. Listings of existing images are available from community efforts such as biocontainers, but also might be made built directly from the source repository.In circumstances such as these, containers offer an excellent solution by packaging up your software and its dependencies into a single image that contains all the software needed for a particular analysis or workflow.
The process for running containerized software on Compute Canada can be described in three steps:
- Ensure singularity is available
- Download a container
- Run your containerized software
Step 1: Ensure singularity is available
At all Compute Canada facilities, singularity is available as a module. Loading the module is as simple as running:
If you’re running singularity on your linux laptop or workstation, download instructions are available here.
Step 2: Download a container
Many software stacks are already available as Docker images at repositories such as Docker Hub or Quay.io. Unfortunately, running Docker on shared clusters introduces potential security vulnerabilities. Fortunately for us, Singularity can use Docker images to build new singularity containers. For example, let’s say that we wanted to run the genometools suite. The biocontainers repository shows me that the latests version (1.5.10) is available at quay.io/repository/biocontainers/genometools-genometools as a Docker image. To download the image to my Compute Canada instance, I can run “singularity pull”:
This produces the singularity image “sif” file in the current directory.
Step 3: Run your containerized software
To run the genome tools suite from inside the new container, prepend your command with “singularity exec ”:
That’s it! You have a perfectly reproducible software stack running without needing to worry about installation or dependencies.
Next Steps and Getting Help
As you might imagine, there are plenty of details we don’t have time to cover in this short blog post. If you’d like to learn more, or if you’re having trouble, there are plenty of ways to find help.
- The Compute Canada wiki has an excellent page on running containers on their infrastructure (en/fr)
- The Singularity docs are the definitive guide
- The C3G has a weekly open door session to which you are welcome to bring questions about containers and reproducible bioinformatics analyses.
Why better Data Sharing means better health Care
The future of personalized medicine is dependent on data sharing, according to Yann Joly, Research Director of the Centre of Genomics and Policies; and Guillaume Bourque, Director of the Canadian Centre for Computational Genomics.
Using big data techniques to analyze the function of human genes is already helping develop treatments tailored to individual patients. The more data researchers can access from across the world, the better chances of treating even rare diseases. But privacy and consent regulations differ by country, making sharing this information across borders slow and frustrating.