The future of genome sequencing is object storage

By Davide Villa, Director of Business Development EMEAI, Western Digital.

Tuesday, 23rd August 2022 Posted 3 years ago in by Phil Alsop

Did you know - Over 200 terabytes of raw data is produced by a single human genome sequence. That means, if 100 million genomes are sequenced in the next two to three years, researchers will have collected roughly 20 billion gigabytes of raw data worldwide. But why is collecting and storing human genomes so important?

Researching human genomes

Research around the human genome and how it affects health and disease has advanced dramatically in recent years. It has been over ten years, and millions of pounds in investment, since the first reference human genome sequences were discovered and cautiously studied by academics. Now, the genomes of thousands of people from various ethnic backgrounds have all been sequenced. This enormous surge in activity has been made possible by breakthroughs in sequencing technology. For example, a group of Stanford scientists achieved a Guinness World Record for the fastest DNA sequencing technique with a time of five hours and two minutes, thanks to a newly developed, whole genome nanopore sequencing approach. This ultra-rapid DNA sequencing technology combined with significant cloud computing and storage means scientists and doctors have the ability to diagnose rare genetic conditions in an average of eight hours. According to a scientific study that was recently published in the New England Journal of Medicine, whole genome sequencing (WGS) can provide new diagnoses for patients with the widest range of rare diseases that have been studied to date and has the potential to have a significant positive impact on the NHS. These positive impacts could be advanced further through the COVID-19 Genomics UK (COG-UK) consortium, a UK-wide public health surveillance initiative to produce and analyse large-scale SARS-CoV-2 sequencing datasets and map its occurrence and spread in the UK, received a significant investment from the UK Government in March 2020.

More than 100 million genomes are expected to have undergone genomic projects by 2025. Big pharma and government population genomics initiatives are currently gathering enormous volumes of data, and it is anticipated that this quantity will continue to rise over time. With the right analysis and interpretation, this data has the potential to bring in a new era of precision medicine. Therefore, it is crucial to rethink how this data is managed, stored, and presented.

Genome sequencing and data According to research conducted by Cardiff University, COVID-19 drastically changed everything and impacted the research sector significantly, due to lockdowns that required scientists to perform their research outside of secure labs. Because of the magnitude of genome data that is available and stored across numerous locations in an IT environment,

scientists were worried about data loss when transitioning between each step of the genome sequencing process.

The administration, storage, and display of sequence data must be evolved to align with the changes in data amounts and formats, which presents a challenge to the bioinformatics data (computer applications and analysis techniques that collect and study biological data).

Bioinformatics workflow data

Recent advances in the life sciences have led to technological advancements in high-resolution imaging, genome sequencing, and analysis tools. A high number of data is being pushed through bioinformatics workflow on the research path to providing treatments, precise medicine, and superior healthcare:

· Primary Analysis: Data from devices like sequencers and microscopes are gathered in the first step of the bioinformatics workflow, frequently as multi-terabyte files. Pre-processing data is the initial step before data analysis. In order to prepare the data for analysis, this process also involves quality checking, trimming, and filtering. Once the information is prepared, it is fed into supercomputers to produce DNA sequencing. High performance computing (HPC) is used in this stage of the investigation to disperse the workloads among HPC compute clusters by utilising parallel file systems. Solid States Drives (SSDs) with ultra-low consistent latency and extreme performance, support such clusters to help speed up analysis and can be used as overflow for data sets larger than physical memory.

· Secondary Analysis: Numerous pre-processing steps are necessary for the secondary analysis as well. This covers actions like analysing and merging files. Sequence alignment, a method of arranging DNA, RNA, or protein sequences to find similarities and build links between various sequences, is then performed by researchers. The output is kept in either a compressed format called a Binary Alignment Map (BAM) or a Sequence Alignment Map (SAM) format (BAM). Both kinds have quite huge file sizes and demand storage with a lot of space. Object storage is often used as it’s extremely durable and has large capacities to support those file types.

· Tertiary Analysis: The tertiary analysis stage involves a variety of post-processing stages depending on the research focus—genome, protein, or molecular. For easier comprehension, the outcome of analysis processes such variation identification, annotation, interpretation, and data mining are frequently displayed. These files are also kept on large archives, such as object storage systems.

· Cyclical and Collaborative Analysis: Raw data for research is collaborative and reused concurrently and persistently. The scientific community can then access bigger genetic pools and test beds by sharing BAM or other file types. As a result, many academics can access data to perform secondary or tertiary analysis without first performing preliminary analysis. Most bioinformatics environments will have a substantial collection of BAM files from national genome repositories, research universities, and other sources.

How to store?

The default storage tier for bioinformatics operations has traditionally been Network Attached Storage (NAS). For the life sciences sector, data storage is evolving beyond what capabilities NAS can offer. The amount of data in the life sciences is growing rapidly, and NAS can be an expensive primary storage option that is challenging to scale.

Object storage, a type of unstructured data storage, has now become to the industry standard for storing genome sequencing data, that successfully handles data expansion, long-term data preservation, and random access to data when scientists and other medical professionals need it. It eliminates the scaling restrictions of traditional file storage because it can be scaled indefinitely and aligns with cloud workflows. Its makes finding files within a single global namespace over petabytes of storage more efficient. Object storage is frequently more cost-effective and offers a cloud-ready environment that makes it easier to collaborate with other institutions or internally. Object storage could target and boost the lifespan regimes of genomes with advancements.

Data security

While looking at someone's DNA can frequently aid in the prevention, diagnosis, and treatment of many diseases, getting that person's genetic fingerprint also makes personal information contained in the genome visible. This is the dilemma facing precision medicine in the future. Suddenly, individuals are sharing their six billion base pair genomes with those who are sequencing it. Whatever the objective, privacy is at risk when genome mapping and sequencing is done.

To stop hackers or unauthorised users from accessing data, most parties must employ a variety of security measures, including secure access control and encryption. No matter where sensitive data is located, it needs to be protected as part of a bigger, organisation-wide plan. Inadequate storage management can pose serious risks to a healthcare organisation and its patients because the majority of end users are uninformed of IT security risks. This is especially true for genetic information.

What next?

A new potential in healthcare has been made possible by the explosion of data, technological advancements such as genome sequencing systems, the convergence of cutting-edge research, and data analytics. As a result, the right storage infrastructure has never been more crucial to support the demands of today's clinical workloads and provide quick, high-quality patient outcomes. Object storage and SSDs and the continued evolution of these storage technologies ensure scientists and researchers have confidence in the way clinical data can be accessed, stored and secured. Correlating genomic, clinical, or behavioural data with pioneering treatment knowledge and research can aid in providing patients with accurate diagnoses, assessments, and medication while potentially identifying future risk factors and diseases.