Data Warehouse Analytics
Latency is the killer of real-time data requiring the flow of more complex reads of the data warehouse. Data warehousing pros are looking for a new solution to a never ending problem.
Data Warehouse “Bottleneck” in Real-Time Analytics
By Antone Gonsalves InformationWeek
August 18, 2008 02:00 AM
An enterprise data warehouse is often the cornerstone of a business intelligence environment, but a new Forrester Research report says that can make the data store either a key enabler or a painful bottleneck in the delivery or urgent analytics to decision makers.
In a report entitled “Really Urgent Analytics: The Sweet Spot for Real-Time Data Warehousing,” Forrester analyst and report author Jim Kobielus advises intelligence and knowledge management professionals to familiarize themselves with the various approaches for adapting a data warehouse to meet real-time requirements, and, if necessary, consider bypassing the data warehouse altogether.
Unfortunately, real-time is too often an afterthought or a bolt-on to many data-warehouse environments, which are mostly used for accessing historical data sets batch-loaded overnight, or as frequently as every few minutes, but not continuously, Kobielus writes. That’s because most business intelligence use cases are satisfied by such scenarios.
Real-time data, on the other hand, requires a continuous flow of data, and traditional enterprise data warehouse deployments introduce slow data delivery to BI environments, the research firm said. The latency is typically the result of “window-constrained batch operations; large-batch, full-table-load pipeline processing; intermediate data-staging repositories; client-side caching; hardware resource constraints;
Nevertheless, many data-warehousing pros have managed to adapt or optimize their environments to support low-latency BI applications, which usually means end-to-end delays as low as several seconds. While true real-time operations deal in intra-second latencies, the near-real-time flows are often sufficient and avoid the cost of speed-of-light operations, the researcher said.
But if true real-time data delivery is needed than enterprises should prepare for the cost and complexity of either preparing a data warehouse for the task or using an alternative approach. Here’s Kobeilus’ list of real-time scenarios:
–Accelerating latency through an enterprise data warehouse. This is achieved through best practices, middleware changes, and/or workload management; trickle-feed extract, transform and load technology; incremental batch; and change data capture.
–Operational data store. This is current, non-persistent, transaction-level data consolidated in a hub for operational reporting.
–Event stream processing. This guarantees sub-second end-to-end latency from source to consumer app or target store through the deployment of an ESP infrastructure. ESP provides the middleware fabric for complex event processing and guarantees intra-second latency on data streams transmitted directly from operational sources to target applications or repositories.
–Data federation. This is near-real-time query/update across a heterogeneous data environment through the semantic layer.
–And informational fabric. This is the real-time in memory, distributed caching infrastructure embedded in a service-oriented architecture or enterprise service bus for analytic and transactional apps.
Of course, each of the scenarios have their pros and cons, but here’s when each should be considered, according to Forrester.
A real-time data warehouse should be used when an enterprise seeks to extend its existing technology to support real-time refresh in addition to or in lieu of traditional overnight batch. Other reasons to use this approach is to support management of both real-time and historical data through a central hub with consistent quality, governance and transformation; and to minimize the impact of real-time latency on operational systems.
The operational data store approach should be used when enterprises require the ability to provide real-time or near-real-time reporting on current operational data and need to support consolidated reporting of operational data managed in two or more systems of record. The approach also minimizes the impact of real-time refresh on operational systems.
Event stream processing should be used when enterprises need to offer guaranteed, intra-second update/refresh latency without the need for an intermediary persistence node, such as a enterprise data warehouse or operational data store.
ESP also makes sense to provide flexibility to incorporate an intermediary persistence node as a source or target in true real-time scenarios, and to implement a common event processing middleware infrastructure across BI, business activity monitoring, and SOA environments.
Data federation should be used when an enterprise is looking to implement a consolidated view, reporting, query and dashboarding of master operational data managed in two or more systems of record. The approach also can provide flexibility in managing master system-of-record data sets to decentralized, distributed, or federated environments, in which line-of-business data owners retain control over respective entities and domains.
Other scenarios in which data federation is useful is to enable a common data quality, governance, and transformation environment across distributed master data sets and to commit write-back updates to operational/transactional source systems, Forrester said.
Finally, an information-fabric approach should be used when an enterprise is looking to fill many of the scenarios listed under the ESP and data federation approaches, plus has the need for a “SOA-enabled middleware infrastructure that supports dynamically optimized integration of analytical and transaction functions across all source, target and intermediary nodes,” Forrester said.
Which approach an enterprise adopts should depend on its business imperatives. Forrester offers the following guidelines for implementing real-time analytics under evolving enterprise data warehouse, BI and business optimization strategies:
–Determine the actual urgency of up-to-the-second, direct-from-the-source streams of business intelligence.
–See how far existing EDW, ETL and BI technologies will take you toward satisfying real-time or near-real-time needs.
–Plan a real-time analytics strategy with an eye on end-to-end architectural changes.
–Implement other real-time deployment models where needed. “These other models can represent stepping stones, complements, or extensions to your EDW, rather than stovepipe infrastructures,” Forrester said.
–Understand the implications of real-time analytics on service-level requirements.
–And establish a long-term initiative to scale, accelerate and optimize dependent environments.
“Clearly, you can take measured, incremental steps into real-time analytics, or bet the business on a bleeding-edge futuristic approach that dovetails with your enterprise SOA initiatives,” Forrester said. “The path you take depends on how key real-time responsiveness is to your business strategy.”
Category: Data Warehouse Analytics