Programming Languages for Pipelined Analytics: Go and OCaml
Wed 06 January 2016 by Jeff FischerThis is part 3 of my series on pipelined analytics language choices. In this part, we will look in more detail at Go and OCaml. Here is the outline of the entire series:
- Introduction
- Commentary on the Python and Java implementations
- Commentary on the Go and OCaml implementations (this post)
- Performance results and conclusions
Go
Go is a static-typed, procedural language, with limited object-oriented programming support via interfaces. Go was developed at Google, with a focus on simplicity and fast, modular builds. It has a very C-like style, and is a spiritual successor to Limbo and C. This is not surprising, as two of the original developers on Go were a part of the Bell Labs group that created Unix and C.
Go provides automatic memory management through a concurrent garbage collector. Go has several features that allow for writing programs that look more like those in a dynamic language. These features include convenient literal syntax for maps and arrays, type inference for variable initialization statements, structural subtyping, and an easy-to-use package system.
Now, let us look at how Go matches up against the criteria we defined in the introduction.
Performance
Much like C, Go provides a simple programming model with the goal of predictable performance. By compiling directly to native code, it avoids the overheads associated with a virtual machine. The compiler is relatively immature, so it does not implement all the optimizations of a mature Java (or C/C++) compiler. It should improve over time. One interesting optimization that it does provide: variables that are not referenced outside of a function are allocated on the stack rather than the heap. This is accomplished using an escape analysis to find those variables which definitely have no aliases which could be referenced outside of the function.
Expressiveness
I would rank Go as more expressive than Java, but less expressive than OCaml or Python. Go avoids much of the boilerplate of Java through a combination of support for non-Object Oriented styles, flexible program layout (e.g. no single-class per file organization), better literal notations for containers, and a simpler type system (no generics). However, the limited static type system requires variable declarations and run-time casts, making programs more verbose than those in OCaml (with type inference) and Python (with dynamic types). The Go language has some quirks that can limit expressiveness. For looping, Go has a very anemic for statement, which is barely better than the error-prone construct provided in the C language. Go does not provide a unified iterator framework as seen in Python (or even a comparable approach to the more heavy-handed Java style). Go also does not provide exception support. The error handling approach it does provide is more limited (see below). Overall, Go feels like it lives in a pre-C++ and pre-Java world (but definitely higher-level than C).
Error handling and correctness
Unlike Python, Java, or OCaml, Go does not provide an exception handling construct. Instead Go provides an error interface type. This has one method, Error(), which returns an error message string. By convention, functions that could fail return a tuple with the last element being an error variable named err. After the call completes, an if statement is needed to check the value of err. A value of nil indicates that the call was successful. Otherwise, the error value can be examined and (manually) propagated up the return stack. Here is a short example of error handling in Go from the file walker code:
files, err = dirfile.Readdirnames(-1) if err != nil { cb.error(dir, err) errcnt += 1 dirfile.Close() continue }
This style of coding makes relatively easy to miss the handling of an error. In languages like Java with checked exceptions, the compiler checks that you do indeed handle every exception case (except for runtime errors such as dereferencing a null pointer). In languages like Python and OCaml with unchecked exceptions, the compiler does not provide this guarantee, but at least an error situation must be caught further up the stack or it will stop the process. In the old C-style of error handling, errors can be unintentionally dropped, with no way of discovering this in production. Go is slightly better than C, in that the err return value must at least be assigned to a variable and thus cannot be completely ignored.
Go's error handling also makes it inconvenient to unconditionally release resources upon exiting a code block. In languages with exceptions, the finally block usually provides this capability. Go's defer statement can be used to achieve this, but it only works at a function scope and it not inherently tied to the error handling code.
Ecosystem
As a newer language, Go's software ecosystem is not nearly developed as the ecosystems surrounding Java and Python. Go's standard library is clean and well-documented (although much smaller than those provided by the aforementioned mature languages). The library interfaces and source code are all browsable on the web, which makes it easy to understand and use the code.
There is no centralized repository of third party packages for Go. The Go compiler does provide a get command to download and build an external package from source.
Deployment
Go provides a very simple (or simplistic) deployment model: programs are compiled into self-contained native executables, which can then be copied and run in production environments. This makes it easy to set up a an application. However, Go does not support any form of dynamic package loading. Thus, the entire application and all dependencies must be available at compile time and cannot be extended at runtime without changing the executable. This prevents the creation of dynamically extensible systems common in the Java and Python worlds.
Concurrency
Go has the strongest concurrency support of the four languages we are evaluating. It provides goroutines which are very lightweight threads. Unlike operating system threads, goroutines have a growable stack and are scheduled by the Go language runtime. The Go scheduler can make use of multiple OS threads, allowing the system to take advantage of multi-core systems while avoiding the need of an OS context switch for each thread scheduling event.
Go programs do not have the concurrency limitations of Python or OCaml programs. Since there is no interpreter or virtual machine, there is no scaling limitation imposed by a global interpreter lock. Also, the garbage collection system has been designed with concurrency in mind.
Furthermore, as a result of their implementation, goroutines scale much better than threads. This enables programs to be written in a more nature style when the natural unit of concurrency is much larger than the number of threads (e.g. a web server).
Goroutines can communicate using shared state and locking, much like multi-threaded programs. Go also provides channels, which are typed message-passing objects. Channels can be buffered (creating a queue) or unbuffered (creating a synchronization point between the sender and receiver). Special syntax is provided to send on a channel and to receive from one of several channels.
The Go File Walker
Let us now look at the Go implementation of our file tree walker. The full source code is on Github here. We will focus on the primary walk function. Here's the code:
func walk(root string, cb Filer) (int, int, int) { filecnt := 0 dircnt := 1 errcnt := 0 var stats syscall.Stat_t err := syscall.Lstat(root, &stats) if err != nil { cb.error(root, err); return 0, 0, 1 } cb.directory_attributes(root, stats); var work_q []string var dir string var files []string work_q = make([]string, 1, 10) work_q[0] = root for len(work_q) > 0 { dir, work_q = work_q[len(work_q)-1], work_q[:len(work_q)-1] dirfile, err := os.Open(dir) if err != nil { cb.error(dir, err) errcnt += 1 continue } // Would like to use defer here, but it only works // in a separate function! //defer dirfile.Close() files, err = dirfile.Readdirnames(-1) if err != nil { cb.error(dir, err) errcnt+= 1 dirfile.Close() continue } for _, filename := range files { var path = filepath.Join(dir, filename) err = syscall.Lstat(path, &stats) if err != nil { cb.error(path, err) errcnt += 1 continue } if ((stats.Mode & syscall.S_IFDIR) != 0) && ((stats.Mode & syscall.S_IFLNK)==0) { cb.directory_attributes(path, stats) work_q = append(work_q, path) dircnt += 1 } else { cb.file_attributes(path, stats) filecnt += 1 } } dirfile.Close() } return filecnt, dircnt, errcnt }
This function takes as parameters a root path (just a string) and an object conforming to the Filer interface. Note that, due to Go's structural subtyping, this object does not have to explicitly declare that it implements the Filer interface. Lines 2 through 11 are handling the special case of the root directory: we obtain its file attributes and pass them to the directory callback. Because of Go's error handling style, we must explicitly check for an error in line 7. Next, we initialize our counts and the working queue. The working queue is just a queue of directory paths to be processed. Lines 18 through 54 are the main loop. As long as the queue has directories in it, we pick one and then get all the entries in that directory along with their file attributes. For each file, we call the file callback. For each directory, we call the directory callback and then add it to the work queue.
A few comments on the coding style:
- The source code is fairly verbose. Although declarations are terser than those in Java, the verbose error handling control flow and the formatting style cause the code to take up a lot of space on the page. The whitespace layout of Go programs is standardized through a formatting utility called gofmt. This avoids religious wars over minor format issues and ensures consistency across code bases.
- In line 20, we Open the current directory, returning a File object. This needs to be closed, even if there is an error. Because Go's defer statement is limited to function scope (it only executes the deferred statement when the surrouding function returns), we cannot use it here. Instead, we must be careful to close the file along all control flow paths (lines 33 and 54).
- It is quite straightforward to parallelize the walk. Due to the low overhead of goroutines, we can spawn a goroutine for each directory encountered. This approach is not feasible in any of the other three languages being considered.
OCaml
OCaml is a static-typed functional language from the ML family. It has some support for procedural programming (mutable variables) and support for object-oriented programming (the "O" in OCaml stands for Object). OCaml provides type inference, where types of variables can be determined automatically, without any type declarations. This makes programs significantly more concise (particularly when compared to Java).
OCaml has a syntax that, while similar to other languages in the ML family, is quite different from Java, Go, and Python. Most procedural and object-oriented languages have two primary syntactic constructs: expressions that return a value (e.g. x + y) and statements that do not (e.g. z = x + y). OCaml, like most functional languages, only has expressions. OCaml also represents functional calls differently. Most imperative languages use the notation fn(arg1, ..., argN). OCaml represents that call as fn arg1 ... argN.
Like the other languages we have considered, OCaml uses automatic memory management, specifically garbage collection.
Now, let us look at how OCaml matches up against the criteria we defined in the introduction.
Performance
OCaml provides two compiler backends: ocamlc, which compiles to a byte code interpreter, and ocamlopt, which compiles to native code. The native code compiler has seen extensive development work and produces very fast code. Since OCaml encourages an immutable programming style and can tell when individual variables are mutable, it has more opportunities for optimization than an standard procedural language.
Expressiveness
The combination of a powerful type system, compose-able data types, and pattern matching results in a very expressive language. OCaml's match statement for pattern matching is like a switch statement on steroids. In addition to branching on different values, you can extract components out of a compound data structure and bind them to variables.
The OCaml language guarantees a tail recursive optimization — certain recursive calls are rewritten as jumps. This allows for a programming style where recursion is used in place of explicit looping. Overall, I find a functional programming style to be very powerful and compact. However, at times, it can be too compact, resulting in hard-to-read code.
Although OCaml has support for objects, the type inference requires certain limitations on the object support, making them less useful. Furthermore, the standard library does not use objects, even in natural places like containers. Thus, I have found that OCaml encourages an almost exclusively functional programming style. Other languages, such as Scala and F#, do a better job of combining functional programming and objects (both were influenced by OCaml).
Error handling and correctness
OCaml provides unchecked exceptions and a syntax similar to pattern matching for handling them. OCaml exceptions are not a part of a class hierarchy (unlike those in Java and Python), making exception definition and handling a little less flexible. [STRIKEOUT:I did find it a little odd that exceptions and pattern matching are not unified into a single concept in the language.] [See update below]
OCaml's type system is very powerful and allows the encoding of properties that would have to be captured dynamically in languages with weaker types systems. Furthermore, pattern matching includes a compile-time test for exhaustiveness — verifying that all possible cases of a match are indeed covered.
Ecosystem
OCaml's standard library is not nearly as complete as those provided by Python and Java. For example, it does not include any functions for file path manipulation, so I had to include a third-party library to do this. OCaml was the only language of the four I evaluated where the standard library was not sufficient for the file walker example.
To address this issue, multiple third party libraries attempt to provide a more complete base of common functionality. The most popular of those are Jane Street Capital's Core and the community-developed Batteries Included. For more specialized libraries, there is a packaging utility, OPAM, and an associated package repository.
Deployment
OCaml programs are typically compiled to and distributed as native applications. Unlike Go, OCaml does provide a mechanism for dynamic linking to shared libraries. In practice, I have found the process of building and linking OCaml applications to be very much a black art, with cryptic error messages and unexplained failures.
Concurrency
OCaml has support for lightweight threads. There can interleave the execution of multiple threads of control, but do not use multiple threads at the operating system level. This limits OCaml programs to a single core in multicore systems. Thus, the OCaml concurrency story is even worst than Python. There is some hope for the future — the Multicore OCaml project aims to bring true multicore support to OCaml, along with deep language integration.
The OCaml File Walker
Let us now look at the OCaml implementation of our file tree walker. The full source code is on Github here. We will focus on the primary walk function. Here's the code:
let walk root dir_cb file_cb error_cb = let isdir st = st.st_kind == S_DIR and add_dir counts = match counts with (dirs, files, errors) -> (dirs+1, files, errors) and add_file counts = match counts with (dirs, files, errors) -> (dirs, files+1, errors) and add_error counts = match counts with (dirs, files, errors) -> (dirs, files, errors+1) in let rec iterate_directory handle dname work_q counts = (* Go through all the entries of one directory that's been opened. We call the callbacks, add to the counts, and add directories to the work q. *) try match readdir handle with "." | ".." -> iterate_directory handle dname work_q counts | file -> let p = dname ^ "/" ^ file in let (work_q, counts) = try let st = lstat p in if isdir st then (dir_cb p st; (p::work_q, add_dir counts)) else (file_cb p st; (work_q, add_file counts)) with Unix_error (code, fname, param) -> error_cb dname (error_message code); (work_q, add_error counts) in iterate_directory handle dname work_q counts with End_of_file -> closedir handle; (work_q, counts) | Unix_error (code, fname, param) -> (* If we get this exception here, it means that the readdir had a problem. We will exit our loop and attribute the error to the parent directory.*) closedir handle; error_cb dname (error_message code); (work_q, add_error counts) in let rec walk_dirs work_q counts = (* Pick a directory out of the work queue and process the * files/entries. Then call walk_dirs recursively to * process the rest of the queue. *) match work_q with dname::rest -> let work_q, counts = try let handle = opendir dname in iterate_directory handle dname rest counts with Unix_error (code, fname, param) -> (* Unable to open dname as a directory *) error_cb dname (error_message code); (rest, add_error counts) in walk_dirs work_q counts | [] -> counts in let (st:stats) = handle_unix_error lstat root in dir_cb root st; walk_dirs (root::[]) (1, 0, 0) ;;
The walk function takes as its parameters a root path (of type string) and three callbacks for directories, files, and errors.
The structure of the OCaml walk function is quite different from the other versions we have considered. You will notice the frequent use of the let statement to bind variables to (immutable) values. After defining some count updating helper functions in lines 7 through 14, the key functionality is defined through nested functions: iterate_directory (lines 10 through 36) and walk_dirs (lines 37 through 53).
The actual walk is then initiated in lines 54 through 56. Line 54 makes an lstat call on the root directory. This result is then passed to the directory callback in line 56, followed by a call to walk_dirs with a work queue containing only root and an initial result count accounting for the root directory.
Now, a few comments on the coding style:
- iterate_directory and walk_dirs are both tail recursive functions. This is done instead of explicit looping. The actual recursive function calls can be optimized away by the OCaml compiler. Note that walk_dirs still needs to use a queue: if it called itself directly for each subdirectory, it would no longer be tail recursive.
- We see examples of both pattern matching (the match statements starting on lines 15 and 42) and exception handling (the try statements starting on lines 14 and 45). [STRIKEOUT:In this code, I would almost prefer to have the Unix system call functions return an error value (more like Go) rather than throw an exception. Ironically, I think that error values make more sense in an expression-based, pattern matching language like OCaml instead of Go. Even so, exceptions are still useful in other cases, such as handling errors further up the stack or early exits from recursive iteration.] [See update below]
- Attempting to parallelize the file walk in OCaml using threads will likely yield very little improvement.
Next up: Performance Results and Conclusions
That's all for Part 3. In Part 4, we will benchmark our four file walker implementations and then conclude this series of blog posts. To know when the next part is ready, please follow me on Twitter.
Update (Jan 13)
In the original article, I expressed my wish for OCaml to unify pattern matching and exception handling. @EdgarArout pointed out that OCaml does indeed do this: it is new in version 4.02. Unfortunately, my Ubuntu 14.04 install comes with OCaml 4.01. I'll have to update! Here is a good summary of the new feature. Thanks Edgar!