This paper describes a workflow manager developed and
deployed at Yahoo called Nova, which pushes continually-
arriving data through graphs of Pig programs executing on
Hadoop clusters. (Pig is a structured dataflow language and
runtime for the Hadoop map-reduce system.)
Nova is like data stream managers in its support for
stateful incremental processing, but unlike them in that it
deals with data in large batches using disk-based processing.
Batched incremental processing is a good fit for a large frac-
tion of Yahoo’s data processing use-cases, which deal with