Meraculous: Deciphering the ‘Book of Life’ With Supercomputers

by  Linda Vu,  Berkeley Lab News Center

Genomes are like the biological owner’s manual for all living things. Cells read DNA instantaneously, getting instructions necessary for an organism to grow, function and reproduce. But for humans, deciphering this “book of life” is significantly more difficult.

Nowadays, researchers typically rely on next-generation sequencers to translate the unique sequences of DNA bases (there are only four) into letters: A, G, C and T. While DNA strands can be billions of bases long, these machines produce very short reads, about 50 to 300 characters at a time. To extract meaning from these letters, scientists need to reconstruct portions of the genome—a process akin to rebuilding the sentences and paragraphs of a book from snippets of text.

But this process can quickly become complicated and time-consuming, especially because some genomes are enormous. For example, while the human genome contains about 3 billion bases, the wheat genome contains nearly 17 billion bases and the pine genome contains about 23 billion bases. Sometimes the sequencers will also introduce errors into the dataset, which need to be filtered out. And most of the time, the genomes need to be assembled de novo, or from scratch. Think of it like putting together a ten billion-piece jigsaw puzzle without a complete picture to reference.

A team of multi-institutional researchers has streamlined and accelerated genome assembly from a months-long process to minutes via the development of Meraculous, a tool created using algorithms, computational methods, and the Unified Parallel C (UPC) programming language.

“Using the parallelized version of Meraculous, we can now assemble the entire human genome in about eight minutes using 15,360 computer processor cores,” says University of California, Berkeley graduate student Evangelos Georganas. “With this tool, we estimate that the output from the world’s biomedical sequencing capacity could be assembled using just a portion of [the National Energy Research Scientific Computing Center’s] Edison supercomputer.”

To make efficient use of massively parallel systems, Georganas developed an algorithm for de novo assembly that exploits the one-sided communication and Partitioned Global Address Space (PGAS) capabilities of the UPC programming language. PGAS enables researchers to treat the physically independent memories of each supercomputer node as one address space, reducing the time and energy the supercomputer spends swapping information between nodes.

Lawrence Berkeley National Laboratory Computational Research Division researcher Leonid Oliker says the new parallel algorithms enable the rapid performance of assembly calculations. Meraculous developer Jarrod Chapman thinks this milestone could make metagenome analysis by the tool possible.  Article

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.