Programming Languages for Pipelined Analytics

Mon 19 October 2015 by Jeff Fischer

At my startup, we do a lot of "pipelined analytics" — gathering and processing of data across multiple processes over multiple stages. Our current system uses Python, with some C extensions. We've been asked in the past, "Why Python?" The answer is partly developer productivity and partly the nature of the problems we are solving. In particular, our system is generally I/O bound, on either the source side or the data store where the final result is kept. Thus, the raw performance of the language is not so critical to our application.

I thought it would be interesting to compare the suitability of different languages for pipelined analytics. To illustrate some of the differences, I will use a concrete problem: the walking of a file tree, obtaining the attributes for each file and directory in the tree. This involves a mixture of light I/O and computation. While being self-contained, it highlights several issues, such as control flow, error handling, system interfaces, and concurrency.

Outline

I will look at Python (2.7), Java, OCaml, and Go. The code is all available up on Github: this folder in my blog-examples project contains a README file and implementations in each of these languages.

The analysis will be spread over four posts:

This post, which provides an introduction
Commentary on the Python and Java implementations
Commentary on the Go and OCaml implementations
Performance results and conclusions

In the future, I may also consider other languages, such as C/C++, Scala, Rust and Python 3, as well as multi-threaded performance.

What does a pipelined analytics component look like?

For our purposes, each step in an analytics pipeline consists of a block of code that must be deployed and started on some machine in a cluster. Once running, it continuously processes data from some number of inputs (either a data source or another step) and outputs it to some set of outputs (either another step or a data sink). Each data "record" itself is usually not so large, but many may be batched together for efficiency. When passed between steps, records may be serialized to a string-based representation like JSON or use an schema-based binary representation such as Apache Thrift or Google's Protocol Buffers. In our case, the throughput of a step is more important than its latency, but that is not always the case.

Some impacts of language choice on these components include:

Raw performance - how quickly can the component obtain its input data, process it and output the data? This involves I/O capability, compute efficiency, and the efficiency of converting to/from serialized representations.
Expressiveness - often, developer productivity is as important as the language's inherent performance. Developer productivity not only helps get features implemented sooner, but can facilitate experimentation with the algorithms used to solve a given problem. This may have more impact on the performance than low-level optimizations.
Error handling and correctness - real data can be messy, and you want good support for dealing with unexpected conditions. It is also desirable to have a strong static type system to avoid programming errors, as long as it does not sacrifice too much in the way of expressiveness.
Ecosystem - how good is the standard library and what is available from third parties? Can you interface with important external systems such as messaging queues, databases, and storage (e.g. HDFS)?
Deployment - how hard is it to get your component, with all its dependencies, repeatably deployed on a remote server in your cluster?
Concurrency - what are the options for concurrency in the language? Do they scale or are they hobbled by things like global locks? Some problems can be easily solved using loosely-coupled message-passing concurrency, while others are require close coordination and are best solved using shared-state concurrency models (multi-threading).

The File Tree Walker

Let us now look at the example program we will implement in the four languages. A more complete specification for our file tree walker may be found here.

An implementation consists of a walk() function that does the work and the necessary surrounding code to process command line arguments, check for errors, and then call walk() with the necessary arguments.

The walk() function takes as parameters an absolute root path from where the traversal starts and a set of three callbacks — one callback for directories, one for files, and one for errors. For each file/subdirectory in a directory, it obtains the POSIX file attributes (e.g. via lstat) and then calls the associated callback, passing it the file path and the file attributes. For directories, it will then walk into the directory. If an error is encountered, the error callback is called. Upon completion of the traversal, walk() returns the count of directories encountered (including the root), a count of files encountered (including any softlinks and special files), and a count of errors encountered.

To handle especially deep file trees, the file hierarchy should not be stored on the program stack. This means that languages that do not do a tail recursion optimization (TCO) must not use recursion.

The main walker program should accept two command line arguments: the file path from which to start the traversal and an optional number of iterations. The iterations parameter should default to one. If it is greater, then the traversal is repeated that many times, to allow for more accurate performance measurements when the runtime is small. For the purposes of our tests, the main program will just pass empty callbacks to the walk() function.

Next up: Python and Java

That's all for the introduction. Thanks for reading this far! In Part 2, we will look in more detail at Python and Java. To know when the next part is ready, please follow me on Twitter.