Computer scientists at Rice University introduced Emu, an algorithm that uses long genome readings to identify bacterial species in a community. The program could simplify the classification of harmful bacteria from useful ones in microbiomes such as those in the gut or agriculture and the environment. Credit: Kristen Curry / Rice University
Part of a gene is better than none when a species of microbe is identified. But for Rice University computer scientists, one part was not enough in the search for a program to identify all the species in a microbiome.
Emu, its microbial community profiling software, effectively identifies bacterial species using long DNA sequences spanning the entire length of the gene under study.
The Emu project led by computer scientist Todd Treangen and graduate student Kristen Curry of Rice’s George R. Brown School of Engineering facilitates the analysis of a key genetic microbiome used by researchers to classify bacterial species which could be harmful — or useful — to humans and the environment.
Its target, 16S, is a subunit of the rRNA (ribosomal ribonucleic acid) gene, the use of which was pioneered by Carl Woese in 1977. This region is highly conserved in bacteria and archaea and also contains variable regions that they are critical for separating different genera and species.
“It is commonly used for microbiome analysis because it is present in all bacteria and most archaea,” said Curry, in his third year in the Treangen group. “That’s why there are regions that have been preserved over the years that facilitate orientation. In DNA sequencing, we need the parts to be the same in all bacteria because we know what to look for and then we need the parts to be different. so we can differentiate bacteria. ”
The study by the Rice team, with collaborators in Germany and the Houston Methodist Research Institute, Baylor College of Medicine and Texas Children’s Hospital, appears in the journal Methods of nature.
A diagram illustrates the relative simplicity of more random shotgun sequencing (WGS) and Emu, a technique developed at Rice University to identify bacterial species using long DNA sequences from the common 16S gene, which is very preserved in bacteria. The program could simplify the classification of harmful bacteria from those useful in microbiomes. Credit: Kristen Curry / Rice University
“Years ago we tended to focus on the bad bacteria, or what we thought was bad, and we didn’t care about the others,” Curry said. “But there’s been a shift in the last 20 years to the point where we think maybe some of these other bacteria that come out mean something.
“This is what we refer to as the microbiome, all the microscopic organisms in an environment,” he said. “The environments studied typically include water, soil, and intestinal tract, and microbes have been shown to affect crops, carbon sequestration, and human health.”
Emu, the name derived from its “expectation maximization” task, analyzes full-length 16S sequences of bacteria processed by an Oxford Nanopore MinION handheld sequencer and uses sophisticated error correction to identify new-based species. ” different hypervariable regions.
“With the previous technology we could only read a portion of the 16S gene,” Curry explained. “It has about 1,500 base pairs, and with short read sequencing you can only sequence up to 25% -30% of this gene. However, you really need the full-length gene to achieve species-level accuracy.” .
But even the newest technology isn’t perfect, as it allows bugs to slip into sequences.
“Although error rates have dropped in recent years, they can still have an error of up to 10% within an individual DNA sequence, while species can be separated by a handful of differences in its 16S gene, ”said Treangen, an assistant professor of computer science who specializes in tracking infectious diseases. “Distinguishing the sequencing error from the real differences represented the main computational challenge of this research project.
“One of the problems is that much of the error is not random, meaning it can happen repeatedly in specific positions and then start to look like real differences instead of sequencing error,” he said.
“Another problem is that there can be thousands of bacterial species in a given sample, creating a complex mix of microbes that can exist in abundance well below the sequencing error rate,” Treangen said. “That means we can’t just rely on ad hoc cuts to distinguish the signal from the error.”
Instead, Emu learns to distinguish between signal and error by comparing a multitude of long sequences, first with a template and then between them, perfecting its error correction iteratively as it profiles microbial communities. In the experiments performed, the false positives dropped significantly to Emu compared to other approaches when analyzing the same data sets.
“Long readings represent a disruptive technology for microbiome research,” Treangen said. “Emu’s goal was to take advantage of all the information contained through the full-length 16S gene, without masking anything, to see if we could get more accurate calls at the genus or species level. And that’s exactly what we got with Emu, thanks to a fruitful and multidisciplinary collaborative effort “.
Alexander Dilthey, Professor of Microbiology and Genomic Immunity at Heinrich Heine University, Düsseldorf, Germany, is the corresponding author of the paper.
Open Source Program Identifiers Synthetic and Natural Gene Sequences More Information: Kristen Curry, Emu: Microbial Community Profiles at the Full-Level 16S rRNA Sequencing Data Level Full-Size Oxford Nanopore, Methods of nature (2022). DOI: 10.1038 / s41592-022-01520-4. www.nature.com/articles/s41592-022-01520-4 Provided by Rice University
Citation: Emu software uses a common gene to profile microbial communities (2022, June 30) retrieved July 2, 2022 from
This document is subject to copyright. Apart from any fair treatment for the purposes of private studies or research, no part may be reproduced without written permission. Content is provided for informational purposes only.