Five Ways Big Data Can Help HPC Operators Run More Efficiently in the Cloud

| March 24, 2021

By Robert Lalonde– When most of us hear the term big data, we tend to think of things like social media platforms, seismic data, or weather modeling. An essential use of big data, however, is to analyze data from complex systems to make them more efficient.

The airline industry is a good example. When the Boeing 787 entered service in 2011, it produced an estimated 500 GB of data per flight. Just three years later, the Airbus A350 entered service with ~6,000 sensors, quadrupling data collection to 2.0 TB per flight, and data collection continues to rise.  Fuel and maintenance costs account for ~35% of airline expenses. Even tiny improvements in fuel efficiency and maintenance operations easily justify the cost of collecting, storing, and analyzing all this big data.

HPC Clusters Generate a Vast Amount of Data

Interestingly, the same HPC infrastructure typically used to analyze all this telemetry for airlines, and other industries is itself a prolific producer of big data. HPC clusters frequently run millions of jobs per day across thousands of servers, generating enormous amounts of log and event data. Much like an airline, by studying this data, HPC analysts can dramatically improve the efficiency of their operations. Cloud-spending is the fast-growing budget line item for most HPC operators. With annual HPC cloud spending projected to double to more than $7.4 billion annually by 2023, there is a strong focus on optimizing spending in the cloud.

The idea of gathering insights from HPC environments hardly new. Most HPC workload managers have supported basic reporting for decades. They provide tools to ingesting raw log data into SQL databases for ease of reporting and analysis. HPC operators routinely capture high-level metrics related to jobs, users, queues, software licenses, and dozens of infrastructure utilization-related metrics.

HPC Demands Analytic Techniques Pioneered in the Enterprise

As with most problems in analytics, gathering, cleansing, curating, and representing data so that it can be used to make good decisions is surprisingly hard. For example, workload manager, and OS logs are generally comprised of raw, cryptic, time-series data. HPC business analysts want to know things like “What are the top factors driving excessive pending times for jobs associated with a critical project?” or “What groups reserved more resources than they consumed last month, and what was the cost of this unnecessary over-provisioning?” Answering these business-level questions requires careful analysis.

Like their enterprise brethren, HPC users initially embraced data collection and analytic techniques pioneered in commercial data warehousing. HPC analytics solutions frequently involved complex ETL workflows that would extract, transform, and load data into a carefully-crafted SQL database. The schema was designed to efficiently support the queries that business analysts were expected to pose. Analysts would also generate OLAP cubes and use popular BI tools such as Tableau or Power BI. With these familiar tools, they could perform multidimensional analysis and get additional insights from operational data.

As analytic tools have improved, HPC reporting and analytics look more like modern data lakes. Rather than deploy an expensive ETL infrastructure, data is stored in more cost-efficient distributed file systems, object-stores, or key-value stores. Modern BI tools can query these data sources directly. As storage costs have fallen, analysts can keep data indefinitely, rather than purging and aggregating data as they were required to do in the data warehouse.

Below are five ways that a data analytics infrastructure can help large-scale computing users operate more efficiently in the cloud.

  1. Predict resource requirements more accurately. When HPC users submit workloads, they generally include detailed “resource requirements” – things like cores, memory, storage, and GPU capabilities. These requirements are often embedded in scripts or application profiles and not visible to end-users. Application architects frequently err on the side of caution and request more resources than applications need. By analyzing resource requests vs. actual resource consumption over millions of past jobs, organizations can fine-tune resource requests. This helps reduce unnecessary cloud spending. Also, by having applications request only what they need, more workloads can run simultaneously, delivering higher throughput.
  2. Optimize on-premise vs. cloud instance utilization. In hybrid clouds, pay-per-use cloud resources tend to be much more expensive than local infrastructure. Organizations would ideally like to keep on-premise resources fully utilized before spending incrementally in the cloud. By monitoring utilization and workloads both on-premise and in the cloud, and studying historical workload patterns, scheduling and cloud bursting policies can be tailored to maximize the use of on-premise infrastructure. This helps ensure that more expensive cloud resources are used only when necessary.
  3. Align spending to business priorities within fixed budgets. A key challenge for hybrid cloud operators is that data related to cloud spending is typically siloed from workload related data. In a recent Univa survey, 76% of customers indicated that they had no automated solution to attribute cloud spending by project, application, or department. Accounting is complicated by the variety of different billing models offered by cloud providers. By continuously extracting billing-related information from cloud billing APIs and correlating cloud-spending with application workloads, HPC administrators can have up to date visibility on spending by project, group, and application. They can also leverage the analytic environment to impute a “cost per job” based on historical usage patterns to help avoid overshooting IaaS budgets.
  4. Optimize turnaround time to boost productivity. Optimizing productivity is more complicated than merely reducing workload runtime. Job turnaround time is the metric that matters most to end-users. It is driven by a variety of factors, including the time that jobs pend in queues, data transfer times, the time required to provision instances, pull container images from registries, and extract data from object stores. By mining and analyzing job-related data, administrators can discover the real issues impacting turnaround time, address those barriers, and dramatically improve overall productivity.
  5. Leverage data assets for cloud automation. Historically, data-intensive analytic techniques in HPC have been used for activities such as decision support, capital planning, and showback / chargeback accounting. In this model, BI tools informed decisions made by human analysts. While this remains a critical use case, In large-scale HPC, it is not feasible to have humans in the loop for every decision. Workload managers increasingly need to decide where to run jobs, how to move data, and whether to provisioning or scale back cloud infrastructure instantly in real-time. Just as big data can be used to train predictive models to make better decisions, it can play a critical role in cloud automation. HPC operators have the opportunity to leverage data-driven automation, to maximize throughput, minimizing cloud spending and get closer to a NoOps infrastructure management model.

This sounds complicated, but it turns out that instrumenting, and analyzing data from hybrid clouds is a much more narrowly scoped problem than an enterprise data lake. Commercial off-the-shelf tools increasingly support these capabilities. Much as aircraft manufacturers offer analytic platforms to help airlines make better decisions from collected data, HPC vendors are increasingly doing the same.

As the use of cloud for HPC continues to grow, having a data infrastructure that supports business planning, as well as operational workload and cloud automation, is critical.

Category: Uncategorized

About the Author ()

Comments are closed.