Using genomics to uncover risk factors for major illnesses or search for relatives is expensive and time-consuming since it requires analyzing large numbers of genomes. Computer scientists have now leveled the playing field by developing a cloud-based platform that allows genomics researchers to access one of the world’s biggest datasets. The new platform, known as AnVIL (Genomic Data Science Analytic, Visualization, and Informatics Lab-space), provides access to hundreds of analysis tools, patient information, and over 300,000 genomes to any researcher with an Internet connection.
Using genomics to uncover risk factors for major illnesses or search for relatives is expensive and time-consuming since it requires analyzing large numbers of genomes. A team lead by a computer scientist from Johns Hopkins University has leveled the playing field by developing a cloud-based platform that gives genomics researchers quick access to one of the world’s biggest datasets.
The new platform, known as AnVIL (Genomic Data Science Analytic, Visualization, and Informatics Lab-space), provides access to hundreds of analysis tools, patient information, and over 300,000 genomes to any researcher with an Internet connection. The study, which was funded by the National Human Genome Institute (NHGRI), was published in Cell Genomics today.
“AnVIL is inverting the concept of genomics data sharing, promising to allow exciting new discoveries by linking researchers and datasets in new ways,” stated project co-leader Michael Schatz, Bloomberg Distinguished Professor of Computer Science and Biology at Johns Hopkins University.
Typically, genomic analysis begins with researchers downloading large volumes of data from centralized warehouses to their own data centers, a procedure that is not only time-consuming, inefficient, and costly, but also makes collaboration with researchers from other universities difficult.
“AnVIL will have a huge impact on institutions of all sizes, particularly those with little means to develop their own data centers. It is our aim that AnVIL would level the playing field, allowing everyone to make discoveries on an equal footing “Schatz said.
The genetic risk factors for diseases like cancer and cardiovascular disease are typically modest, requiring researchers to examine the genomes of thousands of people to find new connections. Because a single human genome contains roughly 40GB of raw data, downloading thousands of genomes may take several days to weeks: According to Schatz, a single genome takes around 10 DVDs worth of data, therefore transferring thousands entails moving “tens of thousands of DVDs worth of data.”
Furthermore, many studies need the integration of data acquired at various institutions, which necessitates each institution downloading its own copy while preserving patient data protection. As researchers embark on ever-larger projects requiring the simultaneous analysis of hundreds of thousands to millions of genomes, this problem is projected to become much more.
“Connecting to AnVIL remotely avoids the need for these large downloads and saves time and money,” Schatz explains. “Instead of painstakingly transferring data to researchers, we make it simple for them to access data on the cloud. It also makes it much simpler to share information, allowing data to be combined in novel ways to discover new relationships, and it simplifies a number of computing difficulties, such as providing robust encryption and privacy for patient records.”
AnVIL also provides researchers with a number of major analysis tools, including Galaxy, which was developed in part at Johns Hopkins, as well as other popular tools like R/Bioconductor, Jupyter notebooks, WDLs, Gen3, and Dockstore, which can be used for both interactive and large-scale batch computing. These technologies, taken together, enable researchers to undertake even the most complex projects without having to set up their own computer infrastructures.
The platform is presently being used by researchers from all around the globe to investigate a number of hereditary illnesses, including autism spectrum disorders, cardiovascular disease, and epilepsy. Schatz’s Telomere-to-Telomere Consortium team utilized it to reanalyze hundreds of human genomes using the new reference genome and find over 1 million additional variations.
With plans to host many more projects in the near future, the AnVIL team has already collected petabytes of data from several of the largest NHGRI projects, including hundreds of thousands of genomes from the Genotype-Tissue Expression (GTEx), Centers for Mendelian Genetics (CMG), and Centers for Common Disease Genomics (CCDG) projects.
Researchers from Johns Hopkins University, the Broad Institute of MIT and Harvard, Harvard University, Vanderbilt University, the University of Chicago, Oregon Health and Sciences University, Yale University School of Medicine, the University of California, Santa Cruz, Roswell Park Comprehensive Cancer Institute, Penn State University, the City University of New York, the Carnegie Institute, and Washington University in St. Louis are part of the AnVIL team.
The Broad Institute and Johns Hopkins University were sponsored by NHGRI cooperative agreement grants, as well as co-funding from the National Institute of Health’s Office of Data Science Strategy.