MapReduce is a computing paradigm that has gained a lot of at-
tention in recent years from industry and research. Unlike paral-
lel DBMSs, MapReduce allows non-expert users to run complex
analytical tasks over very large data sets on very large clusters
and clouds. However, this comes at a price: MapReduce pro-
cesses tasks in a scan-oriented fashion. Hence, the performance of
Hadoop — an open-source implementation of MapReduce — often
does not match the one of a well-configured parallel DBMS. In this
paper we propose a new type of system named Hadoop++: it boosts
task performance without changing the Hadoop framework at all
(Hadoop does not even ‘notice it’). To reach this goal, rather than
changing a working system (Hadoop), we inject our technology at
the right places through UDFs only and affect Hadoop from inside.
This has three important consequences: First, Hadoop++ signifi-
cantly outperforms Hadoop. Second, any future changes of Hadoop
may directly be used with Hadoop++ without rewriting any glue
code. Third, Hadoop++ does not need to change the Hadoop in-
terface. Our experiments show the superiority of Hadoop++ over
both Hadoop and HadoopDB for tasks related to indexing and join
processing.
Attachment | Size |
---|---|
Setty2010HadoopMakingaYellowElephantRunLikeaCheetahWithout.pdf | 1.08 MB |