We demonstrate a powerful and easy-to-use tool called Dedoop (Deduplication with Hadoop) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. To achieve high performance Dedoop supports several advanced load balancing strategies.
We present an automatic skew mitigation approach for user- defined MapReduce programs and present SkewTune, a sys- tem that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew.
The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing.
Facebook recently deployed Facebook Messages, its first ever
user-facing application built on the Apache Hadoop platform.
Apache HBase is a database-like layer built on Hadoop designed
to support billions of messages per day. This paper describes the
reasons why Facebook chose Hadoop and HBase over other
systems such as Apache Cassandra and Voldemort and discusses
the application’s requirements for consistency, availability,
partition tolerance, data model and scalability. We explore the
enhancements made to Hadoop to make it a more effective
Hadoop has become an attractive platform for large-scale data ana-
lytics. In this paper, we identify a major performance bottleneck of
Hadoop: its lack of ability to colocate related data on the same set
of nodes. To overcome this bottleneck, we introduce CoHadoop,
a lightweight extension of Hadoop that allows applications to con-
trol where data are stored. In contrast to previous approaches, Co-
Hadoop retains the flexibility of Hadoop in that it does not require
users to convert their data to a certain format (e.g., a relational
Near duplicate detection benefits many applications, e.g.,
on-line news selection over the Web by keyword search. The
purpose of this demo is to show the design and implemen-
tation of MapDupReducer, a MapReduce based system ca-
pable of detecting near duplicates over massive datasets ef-
ficiently.
The MapReduce distributed programming framework has
become popular, despite evidence that current implemen-
tations are inefficient, requiring far more hardware than a
traditional relational databases to complete similar tasks.
MapReduce jobs are amenable to many traditional database
query optimizations (B+Trees for selections, column-store-
style techniques for projections, etc), but existing systems
do not apply them, substantially because free-form user code
obscures the true data operation being performed. For ex-
ample, a selection in SQL is easily detected, but a selection
Cloud infrastructures enable the efficient parallel execution of data-intensive
tasks such as entity resolution on large datasets. We investigate challenges and possi-
ble solutions of using the MapReduce programming model for parallel entity resolu-
tion. In particular, we propose and evaluate two MapReduce-based implementations
for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply
a tailored data replication.
Large-scale data analysis lies in the core of modern enter-
prises and scientific research. With the emergence of cloud
computing, the use of an analytical query processing in-
frastructure (e.g., Amazon EC2) can be directly mapped
to monetary value. MapReduce has been a popular frame-
work in the context of cloud computing, designed to serve
long running queries (jobs) which can be processed in batch
mode. Taking into account that different jobs often perform
similar work, there are many opportunities for sharing. In
principle, sharing similar work reduces the overall amount of
Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data