WG211/M10Doerre
Typing MapReduce by Jens Doerre
MapReduce is a framework for processing large distributed data sets. It is based on the combinators map and reduce as used in functional programming (which can be strongly typed using parametric polymorphism). Companies with large data centers such as Google, Yahoo, and Amazon have made the combination of map and reduce practical for processing very large data sets, mainly spread across a cluster of computers. Their MapReduce frameworks have been implemented with imperative languages and are quite complex in the interest of efficiency. The price is a partial loss of structure leading to an increased danger of unrecognized programming errors, in particular, type errors. The goal of project MapReduceFoundation is to make MapReduce type-safe in an imperative setting and more efficient in a functional setting (where type safety is a given).
As a first step, we look into Hadoop, the open-source Java MapReduce framework created by Yahoo. In Hadoop, the connection between the two phases of a MapReduce computation is unsafe: there is no static type check of the generic type parameters involved. We provide such a static check for Hadoop programs. To this end, we use strongly typed higher-order functions checked by the standard Java 5 type checker together with the Hadoop program.
In the future, our plan is to provide domain-specific debugging and type-checking facilities. To this end, we will design a DSL for MapReduce that will ultimately be capable of modeling work-flows of multiple MapReduce computations interconnected by data-flow.