Programming Languages for Pipelined Analytics: Python and Java

Tue 10 November 2015 by Jeff Fischer

This is part 2 of my series on pipelined analytics language choices. In this part, we will look in more detail at Python and Java. Here is the outline of the entire series:

Introduction
Commentary on the Python and Java implementations (this post)
Commentary on the Go and OCaml implementations
Performance results and conclusions

Python

Python is a dynamic-typed language with support for object-oriented and procedural programming styles, with some support for a functional style as well. The most widely-used implementation of Python is CPython, a byte code interpreter. For performance-critical code or external library integration, CPython can call out to C libraries. There are two (incompatible) versions of python: 2.x and 3.x. Python 3 was released in December of 2008, but the majority of Python code in use today is still based on 2.

Like most dynamic languages, Python provides automatic memory management. Instead of using garbage collection, the CPython imlementation uses reference counting, with garbage collection for cyclic data structures. This reduces the need to pause the entire system to reclaim memory and may have better performance for certain memory usage patterns.

Now, let us look at how Python matches up against the criteria we defined in the introduction.

Performance

Since CPython is a byte code interpreter, without any Just In-Time (JIT) compilation, it does not perform very well compared to native code on compute-intensive tasks. However, this is mitigated somewhat by its extensibility through C/C++. Core functionality such as parsing/serialization can be implemented in C and provided through Python interfaces. In fact, systems for very compute intensive numeric, scientific, and data analysis processing have been implemented successfully in Python by writing key components in C and leaving only the interactive/glue layer in Python. Examples include NumPy, SciPy, and Pandas.

There is also PyPy, another Python implementation, which does Just In-Time compilation. However, it is not completely compatible with Python C extensions, undergoing rapid changes, and is not yet widely adopted.

Expressiveness

Python is an expressive language. By supporting multiple programming styles, the most effective style can be chosen for a given task. For example, one might use modules and class hierarchies for the overall organization of the system, but functions for callbacks and utility code. Syntactic constructs such as list comprehensions and with clauses support concise code that is less error prone.

Dynamic typing saves the need for type annotations. More significantly, since objects are structurally typed, one does not have to create separate interface and implementation type hierarchies. The standard library takes advantage of this: many types of common access patterns are standardized via convention and special methods which have names starting and ending with two underbar ( "_" ) characters (e.g. __getattr__). This allows for a very generic style of programming where many independent classes can be used in the same context, without special language support or complex interface hierarchies. For example, a for loop can iterate over a list, a set, a map, a string, or a stream.

The language provides support for serveral forms of meta-programming: internal methods implementing attribute access can be overridden, decorators allow for the dynamic wrapping of class definitions, and meta-objects can override key parts of the Python's class machinery. Classes in Python are first class - they can be created dynamically and passed as a value to and from function/method calls.

Error handling and correctness

Python provides exception handling and a well-defined exception hierarchy. The standard library uses exceptions consistently and the with clause supports clean resource mangement in presence of exceptions.

The dynamic nature of Python makes it hard to catch certain forms of errors that are found by the compiler in a statically-typed language. I have found error handling code to be the most likely to have problems, as it is not normally executed. There are code analysis tools like PyFlakes and the Python checker embedded in the Eclipse Python module, which help somewhat.

Ecosystem

Python has a good standard library and an extensive ecosystem of third-party modules. The standard library includes a rich set of data structures and clean system interfaces. There is a centralized repository of third-party modules, the Python Package Index (PyPi), which makes it easy to find and install libraries. As a dynamic language, source code is easily available for everything, and licenses tend to be BSD or MIT.

Deployment

For a long time, Python 2 had a complex and fractured deployment story. Eventually, the Python community converged on a defacto standard package manager, pip. This, along with the Python Package Index make it easy to reliably deploy third-party packages.

Unfortunately, Python programs are not assembled into self-contained, redistribable systems. This means that the deployment process must be repeated on every host on which a component is to be run.

Concurrency

Although Python supports multi-threaded programming, only one thread can be executing in the bytecode interpreter at a time. This is due to the lack of thread safety in the interpreter's internal data structures and garbage collection algorithm. For I/O-bound programs, this is not such a problem. However, it reduces the value of threading for more compute-intensive applications.

For applications that are loosely-coupled and can scale well with muti-process concurrency, there is a multiprocessing package in the standard library that does a lot of the coordination work for such applications.

The Python File Walker

Let us now look at the Python implementation of our file tree walker. The full source code is on Github here. We will focus on the primary walk function. Here's the code:

def walk(root, callbacks):
    try:
        root_stats = os.lstat(root)
    except Exception, e:
        callbacks.error(root, e)
        return (0, 0, 1)
    callbacks.directory_attributes(root, root_stats)
    dirs = 1; files = 0; errors = 0
    work_q = deque([root,])
    while len(work_q)>0:
        d = work_q.popleft()
        for (p, s) in get_files_and_stats_for_dir(d):
            if isinstance(s, Exception):
                callbacks.error(p, s)
                errors += 1
            elif S_ISDIR(s.st_mode) and not S_ISLNK(s.st_mode):
                callbacks.directory_attributes(p, s)
                dirs += 1
                work_q.append(p)
            else:
                callbacks.file_attributes(p, s)
                files += 1
    return (dirs, files, errors)

This function takes as parameters a root path and an object which has our callback methods (file_attributes, directory_attributes, and error). Lines 2 through 7 are handling the special case of the root directory: we obtain its file attributes and pass them to the directory callback. Next, we initialize our counts and the working queue. The working queue is just a queue of directory paths to be processed. Lines 10 through 22 are the main loop. As long as the queue has directories in it, we pick one and then get all the entries in that directory along with their file attributes. For each file, we call the file callback. For each directory, we call the directory callback and then add it to the work queue.

A few comments on the coding style encouraged by Python:

The source code is very compact. Python uses indentation to indicate block structure, eliminating extra lines. We do not need a separate type or type hierarchy to define the callbacks parameter: due to Python's dynamic typing, any object that has the needed methods will work. Finally, built-in types like tuples can easily be used (e.g. for the walk function results) as needed without defining extra types.
The Python yield statement is used to good effect in get_files_and_stats_for_dir(). This function iterates through all the entries of a directory, runs stat on them, and returns a (path, stat) pair for each entry. Without yield, we would have to create the entire list in memory before returning from the function or have some kind of complex callback protocol. By combining yield with Python's iteration infrastructure, one can write very natural stream processing code.
Python's standard library has a function os.walk() that could be used to do most of the work for our file walker. Unfortunately, it calls stat for each entry to determine whether it is a subdirectory, but does not return the file attributes.Thus, using os.walk() would require two calls to stat for each file. We instead make the underlying calls directly.
Attempting to parallelize the walk would be problematic. Since the structure of the file hierarchy is not known in advance, the problem is not easily decomposable into independent processes. Unfortunately, multi-threading is going to have limited scalability, due to the CPython's inability to run more than one thread at a time within the interpreter.

Java

Java is a static-typed language with support for an object-oriented programming style. The Java Virtual Machine, which runs compiled Java programs, has seen significant engineering effort over the years. It includes support for concurrent garbage collection and Just In-Time (JIT) compilation.

Java is widely-used in industry and has a huge ecosystem of 3rd party libraries, open source software, and tools. However, it has a reputation for being overly verbose and supporting the over-engineering of software.

Now, let us look at how Java matches up against the criteria we defined in the introduction.

Performance

The combination of static typing, Just-in-Time compilation, and a mature implementation result in the potential for high performance. In practice, this is not always the case, due to over-engineering of code which causes memory bloat and longer code paths. Java is widely used in analytics code due to its performance potential and relative accessibility to programmers compared to a language like C++.

Expressiveness

Java has a deserved reputation for verbosity. It requires everything to reside within a class (no stand-alone functions), it has poor iteration syntax, and very verbose type annotations. Java's use of separate interface classes and lack of powerful reflection capabilities leads to most frameworks having two complete hierarchies: a hierarchy of interfaces and a hierarchy of (implementation) classes.

Java also lacks compact syntax for container literals, such as lists and maps. This discourages the building of lightweight Domain Specific Languages, as common in languages with better literal syntax such as JavaScript and Python.

Finally, due to historical reasons and the targeting of mediocre programmers, many Java programs have unnecessary boilerplate such as getter/setter functions, extremely verbose names, and overuse of design patterns.

Error handling and correctness

Java provides exception-handling constructs and a rich library of exceptions. The exceptions are "checked", meaning that each method must declare all the possible types that might be thrown. Although this feature is unpopular, it can result in more robust programs. However, the exceptions in the standard library and third party applications are not always well-thought out, leading to awkward code.

Java's type system provides generics and can help to catch errors at compile time. However, it is not nearly as powerful as the type system found in languages like OCaml, where the type system can be used to help capture high level invariants of a program.

Ecosystem

Java has an extensive ecosystem of software. In particular, most of the Hadoop world is written in Java. Other languages, such as Scala, Clojure, and Groovy target the JVM and interoperate with Java. However, there is no single global repository of software, as there is for other languages like Python, Ruby, and Perl.

Deployment

Compiled Java libraries are distributed as JAR files. A library can be made available by adding it to an application's class path. However, JAR files do not provide any metadata regarding versioning and dependencies. This makes it very difficult to assemble and maintain applications. External tools, such as Maven, can help with this, but are very complex to use.

Concurrency

Java has good support for multi-threading, with a robust concurrent virtual machine. Multi-process parallelism is certainly possible, but not as natively supported as it is in Python. This is partly due to the language's poor reflection capabilities.

The Java File Walker

Let us now look at the Java implementation of our file tree walker. The full source code is on Github here. We will focus on the primary walk method. Here's the code:

public class JavaWalker {
    ...
    public static Results walk(Path root, Callbacks cb)
        throws InvalidDirectory {
    int files = 0;
    int directories = 1; // include the root
    int errors = 0;
    Path dir = root;
    PosixFileAttributes attrs;
    try {
        attrs = Files.readAttributes(dir, PosixFileAttributes.class);
    } catch (IOException e) {
        System.err.println("Unable to access file attributes for root");
        throw new InvalidDirectory(dir);
    }
    if (!attrs.isDirectory()) {
        System.err.println("Root is not a directory!");
        throw new InvalidDirectory(dir);
    }
    cb.directoryAttributes(dir, attrs);
    ArrayDeque<Path> work_q = new ArrayDeque<Path>();
    work_q.add(dir);
    while (!work_q.isEmpty()) {
        dir = work_q.remove();
        // iterate through the contents of the directory
        try (DirectoryStream<Path> stream =
                 Files.newDirectoryStream(dir)) {
                for (Path file: stream) {
                    try {
            attrs = Files.readAttributes(file,
                                    PosixFileAttributes.class,
                    LinkOption.NOFOLLOW_LINKS);
            } catch (IOException e) {
            cb.error(file, e);
            errors += 1;
            continue;
            }
            if (attrs.isDirectory() && !attrs.isSymbolicLink()) {
            cb.directoryAttributes(file, attrs);
            work_q.add(file);
            directories += 1;
            }
            else {
            cb.fileAttributes(file, attrs);
            files += 1;
            }
        }
        } catch (IOException e) {
        // problem in the directory stream
        cb.error(dir, e);
        errors += 1;
        }
    }
    return new Results(files, directories, errors);
    }
}

The walk method takes as its parameters a Path object pointing to the root directory and a subclass of Callbacks. In lines 10 through 20, we handle the root directory. Lines 23 through 53 are the main loop. The logic of the main loop is pretty similar to the Python version: directories are removed from a work queue and then each entry in the directory is processed. The code for processing each directory entry is inside the main loop rather than in a separate function. Lacking the yield statement of Python, it is not so convenient to put the directory reading logic in a separate function.

Now, a few comments on the coding style encouraged by Java:

As expected for a Java program, the source code is very verbose. In particular, try-catch loops take up a lot of space on the page. Rather than use built-in types, the lack of a good literal syntax in Java encourages the definition of special purpose classes. Compare the Results class in the Java version to just using a literal tuple in the Python version. Furthermore, note how the Java standard library defines many more types in its APIs. For example, the filesystem APIs represent file paths using a Path class rather than just using a standard string. This may enforce correct usage in some cases, but it also makes the API more complex and verbose.
For the longest time, Java did not even provide APIs to get to the real POSIX file attributes, instead using a least-common denominator interface that lost much of the useful data. In Java 7, the java.nio.file package was added, which, among other improvements, provides access to POSIX file attributes.
Java 7 even provides a file walker function: java.nio.file.Files.walkFileTree. Unfortunately, the interface provided by this API is insufficient to implement the specification for our file walker. In particular, the visitor callbacks provide only BasicFileAttributes, rather than PosixFileAttributes. Furthermore, it does not provide sufficient control over error handling.
Attempting to parallelize the file walk in Java using threads should be successful due to the JVM's robust support for concurrency. The concurrency library java.util.concurrent should make this task pretty straightforward.

Next up: Go and OCaml

That's all for Part 2. In Part 3, we will look in more detail at Go and OCaml. To know when the next part is ready, please follow me on Twitter.