Innovative Shared Research Computing Storage Project Takes Shape

Powered by storage technology supplied by Microway, the Northeast Storage Exchange is changing the way Boston-area universities approach research data storage.

  • Saturday, 4th July 2020 Posted 4 years ago in by Phil Alsop

Born out of a groundbreaking regional high-performance computing project, the Northeast Storage Exchange (NESE) aims to break further ground—to create a long-term, growing, self-sustaining data storage facility serving both regional researchers and national and international-scale science and engineering projects.

To achieve these goals, a diverse team has built a regional technological achievement: New England’s largest data lake.

The story of creating this data-lake is a lesson in cross organizational collaboration, the growth of oceans of research data, changes in storage technology, and even vendor management.

Finding the right technology – hardware, firmware, and software – for such a large-scale project meeting a diverse range of data storage needs is challenging. Now that the project has launched, though, both the NESE team and industry partners like Microway are confident in the project’s capacity to meet growing research computing data storage demands in a way that facilitates end-user buy-in and unprecedented collaboration.

The Beginnings of MGHPCC and NESE

The Massachusetts Green High Performance Computing Center, or MGHPCC for short, is among the most innovative large-scale computing projects in the country. This project brings together the major research computing deployments from five Boston-area universities into a single, massive datacenter in Holyoke, Massachusetts.

The 15 megawatt, 780-rack datacenter is built to be an energy- and space-efficient hub of research computing, with a single computing floor shared by thousands of researchers from Boston University, Harvard University, Massachusetts Institute of Technology, Northeastern University, and the entire University of Massachusetts system. Because the datacenter is run by hydro and nuclear power, it exerts virtually zero carbon footprint. By joining together in the Holyoke site, all of the member institutions gain the benefits of lower space and energy costs, as well as the significant intangible benefits of simplified collaboration across research teams and institutions.

As of 2018, the facility was more than two thirds full, at 330,000 computing cores total. The facility currently holds the main research computing facilities for the five founding universities, as well as those of teams of national and international collaborative data science researchers.

It follows, naturally, that an innovative research computing project like MGHPCC would require an equally innovative corresponding data storage solution. Enter NESE, the Northeast Storage Exchange project, supported by the National Science Foundation. The institutions involved are Boston University, MGHPCC, Massachusetts Institute of Technology, Northeastern University, and the entire University of Massachusetts system. Within a team of 25 from Harvard’s Faculty of Arts and Sciences Research Computing, NESE has a dedicated Storage Engineer. Scott Yockel and his team at the Harvard Faculty of Arts and Sciences Research Computing lead development, deployment, and operations of NESE for the whole collaboration. NESE is already New England’s largest data lake, with over 20PBs of storage capacity and rapid growth both planned and projected.

An Innovative Data Architecture

NESE doesn’t rely on traditional storage design. Its architects have instead chosen Ceph: an innovative object storage platform that runs on non-proprietary hardware.

By not relying on costly proprietary enterprise storage solutions, which can be very expensive and difficult to retrieve data from at high performance, and also eliminating the need for individual research teams or institutions to manage their own storage infrastructure, NESE delivers MGHPCC storage needs economically and efficiently. It breaks new ground for cost and collaboration at once.

The project design has attracted notice: NESE was launched with funding from the National Science Foundation’s Data Infrastructure Building Blocks (DIBBs) program, which aims to foster data-centric infrastructures which accelerate interdisciplinary and collaborative research for science, engineering, education and economic development.

In addition, NESE has attracted major industry partners who help the team to achieve the goals of both individual projects and the NSF as a whole. Microway, which designs and builds customized fully-integrated turn-key computational clusters, servers, and workstations for demanding users in the HPC and AI, has supplied NESE’s hardware and will continue to partner with NESE as it grows. Additionally, Red Hat, the creators of Ceph, have been working with the NESE team from design and testing through to implementation.

Building NESE

Of course, such a large storage infrastructure has challenges that needed to be met when designing and building out the solution. Building such an immense data lake requires knowledgeable project management and partners committed to delivering a solution tailored to research computing users.

First, the research done at MGHPCC and each of its member institutions is highly diverse in terms of its storage demands. From types and volume of data to retrievability and front-end needs, the NESE team has needed to account for many different users in building out new storage infrastructure. What’s more, the system needed to be easily scalable; while the initial storage capacity is large, the NESE teams expects it to grow rapidly over the next several years. Finally, with such a huge volume of data storage and large number of users, the system needed to be relatively failproof, so that outages do not affect huge swaths of data.

With these challenges in hand, the NESE team, including Saul Youssef of Boston University and Scott Yockel reached out to Microway for help in designing the ideal solution. Yockel and others at Harvard had previously worked with Microway for dense GPU computing solutions. Based on this trust, they gave Eliot Eshelman, Vice President, Strategic Accounts and HPC Initiatives, and the rest of the Microway team the task of helping them design and deploy the right data storage solution for the NESE project’s unique challenges. The team went through multiple rounds of consultation and possible iterations before selecting the final system design.

Originally, Yockel explained, the NESE team was interested in both dense and deep hardware systems, with 80-90 drives per node. After learning from the extended Ceph community that this kind of configuration could lead to backlogs, failures, and ultimately system outages, they instead selected single-socket, 1U 12 drive systems. He noted that the smaller, though still dense, systems are far more resilient to complete filesystem failures than the initial design, and can still support the flexibility of storage that NESE needs.

The Microway team then made this type of hardware available for testing. Youssef and Yockel were able to validate both performance and reliability of the hardware platform. Only then did they commit their ambitious project’s reputation to this validated architecture.

“Microway understands our particular approach and needs. They provided us quotes that we could use throughout the consortium to gain significant buy-in, and worked with us to iterate design based on Ceph best practices and this project’s specific demands,” Youssef said. “Our relationship with them has been straightforward in terms of purchasing, but the systems we’ve created are really at the edge of what’s possible in data storage.”

The initial NESE deployment has five racks, each with space for 36 nodes; as of September 2019, it includes roughly 100 nodes in total. All nodes are dual 10GbE connected to MGHPCC’s data network, and contain high-density storage in a mix of traditional and high-speed SSDs.

The net result is over 20 PB of overall capacity, which can seamlessly expand even as much as 10X as required in the future.

The overall solution also provides the diversity of storage that NESE needs, enabling a mix of high-performance, active, and archival storage across users. This has allowed for cost optimization, while the use of Ceph has ensured that all of that data is easily retrievable, regardless of a user’s storage use type.

Impact of an Innovative Data Storage Solution

With the implementation of NESE within MGHPCC, Massachusetts data science researchers now have a data storage resource that is large, with the ability to grow over time, and no need to migrate data across physical data storage over time. The project’s use of a distributed Ceph architecture will enable the NESE team to add new resources or decommission old ones while the system is active.

Data storage management by a single team within the consortium lowers administration labor effort and costs, adds greater flexibility for backups, and makes it easy to double storage for a lab or project.

The NESE team has elected to begin (relatively) “small,” with the 20PBs of storage currently used by a small portion of the consortium’s labs and researchers. Even so, the project has significant buy-in from throughout the MGHPCC consortium. “It’s not unreasonable to expect our storage capacity to grow five-fold in the next few years,” Youssef said.

Harvard’s overall data storage needs alone have grown 10PBs per year for each of the last four years; other member institutions have seen similarly skyrocketing data storage needs. That’s because research is creating vast amounts of data, and growth isn’t linear. New generations of instrumentation in the life sciences mean increases in data production of 5-10x every few years; even the social sciences and humanities, areas that once needed little by way of data storage, have begun to generate data through new research methodologies and other projects like library digitization.

 

With such vast swaths of data being generated yearly, cost concerns become more significant too. NESE is in a unique position to provide cost savings in data storage, thanks to its efficiently-run location within the MGHPCC building and dedicated management team. As a result, Youssef estimates that the cost of storage within NESE is about 1/6 to 1/10 the cost of comparable commercial data storage solutions deployed on-campus. Being on the MGHPCC floor means that high bandwidth connectivity to the storage is also affordable. With more competitive costs, NESE is freer to grow and expand into the future. These cost benefits, taken together with trust in Yockel’s operations team, is the basis for potential NESE rapid growth. Seventy percent of the initial storage has come from external buy-in, and more is expected.

Future Pathways

Though Youssef and Yockel aren’t sure exactly how large NESE will become, they’re certain it will – and has significant capacity to – grow. The current racks were provisioned for more nodes than they currently house, with about 1/3 of the current space free for buy-in. While the capacity has served Harvard research teams most to date, it will be allocated among all of the different universities as shared project space in the future. The start up NESE storage is mainly used as Globus endpoints across the collaboration, storage for laboratories across the Harvard campus, and storage for the Large Hadron Collider project at CERN. Everyone is on board with using NESE, so it will be used more and more in the future.

Once it does, it opens the door for significant collaboration across institutions that is currently unwieldy and thus unprecedented. Shared data storage makes sharing data sets across research teams and universities far easier: there are no more challenges of data locality. Researcher 1 at Harvard may simply point Researcher 2 at BU to a dataset already on the same storage.

The effects of such data-locality could be transformative: they could open a pathway to more innovative, collaborative research that spans some of the nation’s top universities.

Universities also are not the only space for further collaboration made possible by NESE. Already, Red Hat has conducted large-scale Ceph testing using the NESE system that was impossible with its in-house systems. The performance testing has driven changes to Ceph and contributed back to the Ceph community. Youssef and Yockel noted that the NESE team is open to finding other such spaces for collaboration with industry as the project expands.

For now, what’s certain is this: NESE will remain at the heart of the MGHPCC’s innovative research computing space. The team will continue to collaborate with Microway on the project’s expansion, as well.