Facebook recently deployed Facebook Messages, its first ever
user-facing application built on the Apache Hadoop platform.
Apache HBase is a database-like layer built on Hadoop designed
to support billions of messages per day. This paper describes the
reasons why Facebook chose Hadoop and HBase over other
systems such as Apache Cassandra and Voldemort and discusses
the application’s requirements for consistency, availability,
partition tolerance, data model and scalability. We explore the
enhancements made to Hadoop to make it a more effective
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.
Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is highly
fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
There has been a great deal of hype about Amazon’s simple storage
service (S3). S3 provides infinite scalability and high availability at
low cost. Currently, S3 is used mostly to store multi-media docu-
ments (videos, photos, audio) which are shared by a community of
people and rarely updated. The purpose of this paper is to demon-
strate the opportunities and limitations of using S3 as a storage sys-
tem for general-purpose database applications which involve small
objects and frequent updates. Read, write, and commit protocols
The emergence of the Cloud system has simplified the deployment of large-scale distributed systems
for software vendors. The Cloud system provides a simple and unified interface between vendor and user,
allowing vendors to focus more on the software itself rather than the underlying framework. Existing
Cloud systems seek to improve performance by increasing parallelism. In this paper, we explore an
alternative solution, proposing an indexing framework for the Cloud system based on the structured
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.