Hadoop- Metadata Scalability

Big Data IoT Forum | June 5, 2011

As the cluster grows, the metadata requires more memory. One petabyte of storage requires one gigabyte of memory.

Discovering data across different systems becomes a significant challenge.

The Road Ahead as described by Anil Madan, eBay-Director of Engineering, Analytics Platform Development

“Here are some of the challenges we are working on as we build out our infrastructure:

Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.

Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.

Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.

Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.

Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.

Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.”

Tags: featured

Category: Metadata Scalability

About the Author (Author Profile)

Comments are closed.

The vision is to provide an open forum for users to discuss their Artificial Intelligence use of Big Data and plans to implement IoT analytics into the workflow. We look for end-user experiences and their solutions to be shared with the community. Objectives are: 1. Reach out and EDUCATE both the government and business community and share customer experiences thru case studies. 2. Help ACCELERATE adoption of new technologies thru editorials, blogs and marketing campaigns. 3. Provide a UNIQUE NETWORKING experience for end-user face-to-face conversations.

The Centers of Excellence provide students with hand-on experience to develop business applications as the term project working with Start-Up companies from San Diego and Silicon Valley. In addition, technology companies such as Microsoft. Centers of Excellence include the following technologies; 1-IoT 2-Cloud 3-Deep Learning/Artificial Intelligence/Augmented Reality 4-Vizcenter Students can earn a certificate and degree in the Internet of Things (IoT) and Cloud Technologies working with technology companies. Students can develop a business application that can be the foundation of an entrepreneurial experience and new Start-Up business.

Hadoop- Metadata Scalability

About the Author (Author Profile)

Big Data AI & IoT Forum

San Diego State University Centers of Excellence

Site Information

Big Data Analytics Hub Feeds

Subscribe

Contact Us