Traditionally, the goal of benchmarking a software system is to evaluate its performance under a particular workload for a fixed configuration. The most prominent examples for evaluating trans- actional database systems as well as other components on top (such as a application-servers or web-servers) are the various TPC bench- marks.
Hadoop has become an attractive platform for large-scale data ana-
lytics. In this paper, we identify a major performance bottleneck of
Hadoop: its lack of ability to colocate related data on the same set
of nodes. To overcome this bottleneck, we introduce CoHadoop,
a lightweight extension of Hadoop that allows applications to con-
trol where data are stored. In contrast to previous approaches, Co-
Hadoop retains the flexibility of Hadoop in that it does not require
users to convert their data to a certain format (e.g., a relational
One of the main reasons why cloud computing has gained
so much popularity is due to its ease of use and its ability
to scale computing resources on demand. As a result, users
can now rent computing nodes on large commercial clusters
through several vendors, such as Amazon and rackspace.
However, despite the attention paid by Cloud providers,
performance unpredictability is a major issue in Cloud com-
puting for (1) database researchers performing wall clock ex-
periments, and (2) database applications providing service-
level agreements. In this paper, we carry out a study of the
Today, growing datasets require new technologies as standard tech-
nologies — such as parallel DBMSs — do not easily scale to such
level. On the one side, there is the MapReduce paradigm allow-
ing non-expert users to easily define large distributed jobs. On the
other side, there is Cloud Computing providing a pay-as-you-go
infrastructure for such computations. This PhD project aims at im-
proving the combination of both technologies, especially for the
following issues: (i) predictability of performance, (ii) runtime op-
Cloud computing provides services to potentially numerous remote users with diverse requirements. Al-
though predictable performance can be obtained through the provision of carefully delimited services,
it is straightforward to identify applications in which a cloud might usefully host services that support
the composition of more primitive analysis services or the evaluation of complex data analysis requests.
In such settings, a service provider must manage complex and unpredictable workloads. This paper
Cloud computing is Internet based system development in which large scalable computing resources
are provided “as a service” over the Internet to users. The concept of cloud computing incorporates
web infrastructure, software as a service (SaaS), Web 2.0 and other emerging technologies, and has
attracted more and more attention from industry and research community. In this paper, we describe our
experience and lessons learnt in construction of a cloud computing platform. Specifically, we design a
Cloud computing is an increasingly popular paradigm for accessing computing resources. A popular
class of computing clouds is Infrastructure as a Service (IaaS) clouds, exemplified by Amazon’s Elastic
Computing Cloud (EC2). In these clouds, users are given access to virtual machines on which they can
install and run arbitrary software, including database systems. Users can also deploy database appli-
ances on these clouds, which are virtual machines with pre-installed pre-configured database systems.
With the tremendous growth in the volume of semi-structured and unstructured content within enterprises
(e.g., email archives, customer support databases, etc.), there is increasing interest in harnessing this
content to power search and business intelligence applications. Traditional enterprise infrastruture
or analytics is geared towards analytics on structured data (in support of OLAP-driven reporting and
analysis) and is not designed to meet the demands of large-scale compute-intensive analytics over semi-
Yahoo! is building a set of scalable, highly-available data storage and processing services, and de-
ploying them in a cloud model to make application development and ongoing maintenance significantly
easier. In this paper we discuss the vision and requirements, as well as the components that will go into
the cloud. We highlight the challenges and research questions that arise from trying to build a com-
prehensive web-scale cloud infrastructure, emphasizing data storage and processing capabilities. (The