Abstract:
A system and method for performing large-scale data processing using a statistical programming language are disclosed. One or more high-level statistical operations may be received. The received high-level statistical operations may be dynamically translated into a graph of low-level data operations. The unnecessary operations may be removed and operations may be fused or chained together. Operations may then be grouped into distributed data processing operation. The low-level operations may then be run.