Hadoop has become an attractive platform for large-scale data ana-
lytics. In this paper, we identify a major performance bottleneck of
Hadoop: its lack of ability to colocate related data on the same set
of nodes. To overcome this bottleneck, we introduce CoHadoop,
a lightweight extension of Hadoop that allows applications to con-
trol where data are stored. In contrast to previous approaches, Co-
Hadoop retains the flexibility of Hadoop in that it does not require
users to convert their data to a certain format (e.g., a relational