Facebook recently deployed Facebook Messages, its first ever
user-facing application built on the Apache Hadoop platform.
Apache HBase is a database-like layer built on Hadoop designed
to support billions of messages per day. This paper describes the
reasons why Facebook chose Hadoop and HBase over other
systems such as Apache Cassandra and Voldemort and discusses
the application’s requirements for consistency, availability,
partition tolerance, data model and scalability. We explore the
enhancements made to Hadoop to make it a more effective
Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data
Replication is a widely used method for achieving high availability in database systems. Due to the nondeterminism inherent in traditional concurrency control schemes, however, special care must be taken to ensure that replicas don’t
diverge. Log shipping, eager commit protocols, and lazy synchronization protocols are well-understood methods for
safely replicating databases, but each comes with its own cost in availability, performance, or consistency.
In this paper, we propose a distributed database system which combines a simple deadlock avoidance technique with
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.
Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:
We have been using HBase for around a year in our development and projects, from 0.17.x to
0.19.x. We and all in the community know the critical performance and reliability issues of these
releases.
Now, the great news is that HBase‐0.20.0 will be released soon. Jonathan Gray from Streamy,
Ryan Rawson from StumbleUpon, Michael Stack from Powerset/Microsoft, Jean‐Daniel Cryans
from OpenPlaces, and other contributors had done a great job to redesign and rewrite many
Als dokumentenorientierte Datenbank für das Internet unterscheidet sich CouchDB bereits grundlegend von klassischen relationalen Datenbanken. Dabei setzt es konsequent auf den populären MapReduce-Algorithmus und Internetstandards, wie das JSON-Austauschformat und das REST-Protokoll. In diesem Beitrag werden wir die Hintergründe diskutieren, wie eine hochskalierbare Datenarchitektur für das Web heute aussehen könnte und wie wir diese am Beispiel der CouchDB realisieren können.
The size of data sets being collected and analyzed in the
industry for business intelligence is growing rapidly, mak-
ing traditional warehousing solutions prohibitively expen-
sive. Hadoop [3] is a popular open-source map-reduce im-
plementation which is being used as an alternative to store
and process extremely large data sets on commodity hard-
ware. However, the map-reduce programming model is very
low level and requires developers to write custom programs
which are hard to maintain and reuse.
In this paper, we present Hive, an open-source data ware-
There has been a great deal of hype about Amazon’s simple storage
service (S3). S3 provides infinite scalability and high availability at
low cost. Currently, S3 is used mostly to store multi-media docu-
ments (videos, photos, audio) which are shared by a community of
people and rarely updated. The purpose of this paper is to demon-
strate the opportunities and limitations of using S3 as a storage sys-
tem for general-purpose database applications which involve small
objects and frequent updates. Read, write, and commit protocols