I/O Bottlenecks in Hyperspace

Big Data IoT Forum | April 6, 2012

I/O Bottlenecks in Hyperspace by Dan Gatti, Big Data I/O Forum

Johns Hopkins Data-Scope project reminds me of Hans Solo and Wookie blasting into Hyperspace. You can imagine looking at 6 petabyes of Big Data blasting at 500 gigabytes per second will need to overcome I/O bottlenecks.

From Datanami article by Nicole Hemsoth On the topic of data-intensive research hubs, earlier this year we highlighted the work of high performance computing, physics and big data researcher, Dr. Alexander Szalay. From his office at Johns Hopkins University, he is funneling a $2 million grant from the NSF into a unique research endeavor called the Data-Scope.

As Szalay describes, this instrument will be both a microscope and telescope for big data. At the core of the project is the system itself, which will boast over 6 petabytes of storage, around 500 gigabytes per second aggregate sequential IO, about 20 million IOPs and something in the area of 130 teraflops.

During our interview, he described a system that would be capable of offering “high I/O performance of traditional disk drives with a smaller number of very high throughput SSD drives with high performance GPGPU cards and a 10G Ethernet interconnect. Various vendors (NVIDIA, Intel, OCZ, SolarFlare, Arista) have actively supported our experiments.”

The system, which will be unveiled fully at the end of the year, will offer a digital “multiverse” or database that tracks all astronomical objects known to mankind. Outside of the data curation aspect, the project will allow global astronomers to conduct analyses remotely across the entire database, without forcing them to download the many-terabyte database.

According to OCZ Technology, the success of the project hinges on scientists’ ability to simultaneously build statistical aggregations over petabytes of data, yet explore the smallest aspects of the underlying collections.

The company says that the unique advantage of this system is its ability to function both as a “microscope” and as a “telescope” for data, as well as its storage capacity of 6 petabytes, and its 500 gigabytes per second sequential I/O performance and 20 million IOPS. In addition to raw bandwidth, SSDs provide a smaller operating footprint over traditional HDDs, greatly reducing power consumption while still delivering the same amount of IOPS performance .

Leveraging the benefits of General Purpose Computing on GPUs (GPGPU) for scientific and engineering computing, random access data is streamed directly from SSDs into the co-hosted GPUs over the system backplane. The two major benefits of this architecture are the elimination of access latency by the SSD tier of the storage hierarchy, and the elimination of the network bottleneck by co-locating storage and processing on the same server.

Amidst the recognizable list of names Szalay rattled off, including NVIDIA and Intel, a lesser-known HPC company, OCZ Technology Group, was also named. The San Jose company, which caters to the HPC market with its SSD offerings, says that SSDs are capable of delivering a new approach to big data problems in scientific computing. To highlight this, they point to their recent selection as the SSD vendor for the Data-Scope research project at Johns Hopkins University.

According to Michael Schuette, who is working with the leads of the Data-Scope project via his role as VP of Technology Development at OCZ Technology Group, there are solid reasons behind the choice of SSDs to serve big (And complex) data scientific endeavors like the Data-Scope project.

As Schuette told us, the key issue of big data computing is that the data are too big to fit into any standard volatile system memory and they are too random to be served efficiently by a conventional HDD array. In other words, big data computing is comparable to “seeing the forest despite the trees while being able to single out individual parameters of importance”.

He went on to note that “Historical approaches to scientific computing had to either focus on the forest or else on the trees but it was not possible to deliver sufficient amounts of data to keep the computational resources busy, even with standard CPU-based architectures. In that case, only logically coherent data were delivered by the hard disk drives but the decision of what “logically coherent” means was left up to the individual investigator and, by its very nature, it was biased towards integrating new data points into known factual frame works.”

Schuette feels that the next level of scientific computing being big data-driven is a revolutionary turn-around from this kind of scientific computing. He points to General purpose graphics processing unit (GPGPU) – based computing as the key to providing massive parallel computing of big numbers of individual data sets that are part of a superset, regardless of whether those data appear on the surface to be related or not. Schuette explains that “in order to avoid starvation of the computational resources, three tiers of storage need to work together in a local, direct-connected integration.”

He unraveled the issue further, saying that a petabyte-scale storage array contains as many raw data as are available and it is directly connected to a massive non-volatile second tier comprising SSDs. The SSDs with their random IOPS capability of over 500x better than HDDs allow buffering of any random data entity. This buffer is then locally connected via the system logic (chipset) to the system memory and by extension to the CPU and the massive parallel GPU cores.

A similar performance could be achieved in theory by using massive arrays of DRAM, for example DDR3 dual inline memory modules (DIMMs). However, those DRAM-based modules are orders of magnitude more expensive, they use volatile memory technology, meaning that data are lost upon power failure, and with increasing capacity they become more and more inefficient because of refresh overhead, not to mention extremely high power consumption per byte stored.

As Schuette explained:

“This is where NAND flash-based solid state drives are offering the best of both worlds solution in that they are capable of coping with the randomness of the data within each superset. In particular, often enough cross-links between seemingly unconnected data sets show commonalities that are only discovered after digesting the data by the massive parallel GPGPU array, which means that the spatial and temporal locality of the data has been disrupted by thousands of intermediate data sets that were also computed.

Even huge DRAM-arrays would have purged the relevant data aspects by the time the commonalities emerge. However, NAND flash-based SSDs have enough capacity and, more importantly, retain the data so that instant access for additional cross-correlation and analysis using secondary paradigms derived “on the fly” from the data-driven compute approach can be executed immediately and without access penalty. This cascading transition from – figuratively speaking – the forest to randomly distributed anomalies of individual trees can be handled extremely well by SSDs which, therefore are one of the absolutely critical key elements in big-data computing.”

But let’s bring this all back to the science, specifically, data-intensive science….

Above and during our interview at the beginning of the year, Szalay summed up a number of the challenges facing scientists across disciplines—as well as the core concepts the OCZ Technology lead states above. As he described,

“Scientific computing is increasingly dominated by data. More and more scientific disciplines want to perform numerical simulations of the complex systems under consideration. Once these simulations are run, many of them become a “standard reference”, that others all over the world want to use, or compare against. These large data sets generated on our largest supercomputers must, therefore, be stored and “published”. This creates a whole new suite of challenges, similar in nature to data coming off of the largest telescopes or accelerators. SO it is easy to come up with examples of convergence.

The differences are felt by the scientists, who are working on their science problems “in the trenches”, where it is increasingly harder to move and analyze the large data sets they can generate. As a result, they are still building their own storage systems, since the commercially available ones are either too expensive or too slow (or both). We need the equivalent of the BeoWulf revolution, a simple, off-the shelf recipe, based on commodity components, to solve this new class of problems.”

True to form, OCZ says it has helped shape an affordable, powerful computational environment that can be used as a blueprint for future science applications. The company says the JHU project comprises a system of nearly one hundred servers using hundreds of OCZ Deneva 2 SSDs combined with regular hard disk drives with two tiers for storage and computing.

Category: Uncategorized