Thursday, December 27, 2007

How I Came to Erlang

I've been learning Erlang, the 20-year-old programming language from Ericsson that seems poised to become the new darling of discriminating developers. There's a fine new book out on it, for one thing, and quite a number of programming blogs are discussing it. Not to mention the fact that the new Amazon database web service, SimpleDB, was built with it. But it doesn't feel like I'm jumping on a bandwagon. I've known about Erlang for several years, and I read the earlier book on the language with some interest. I just never had the opportunity to use it. A confluence of circumstances brought me to consider doing so.

First, I love programming languages. That is, I love the field, and I love languages that are beautiful, clean and powerful.

Second, I'm a working programmer. I need to build code that is reliable and performs well. And I've always had a pragmatic bent. I also have a family. So I devote little time to languages that are interesting in theory but do not have strong implementations.

Third, I have a specific problem. I built a distributed system at work using Python. The system works fine for small loads but suffers under large ones. It needs to scale better. I've tried a few things, and will try more, but I'm hedging my bets. What have I tried? I threw psyco at it. Psyco compiles Python on the fly, and is delightfully seamless. It can provide significant speedups for some programs, but it didn't help mine much. I redid my communication protocol, switching to binary and even rewriting the deserializer in C++. That helped, but not enough.

I'm guessing the problem is that I'm using threads. The server has about 200 clients, each serviced by its own thread. I believe Python threads are inefficient. For one thing, they compete for a single lock that protects the interpreter data structures, known as the Global Interpreter Lock, or GIL. A single Python process will use only one CPU, regardless of how many are in the box. Furthermore, my application uses a single lock to synchronize the threads. It's a complicated state machine, and the idea of making the locking more fine-grained is daunting.

Python developers will tell you that the solution is to embrace an event-driven or continuation-based style of programming, in which an incoming chunk of data triggers a function call. The Python server Twisted exemplifies this programming style. Your program has only one thread. It receives a chunk of data, calls the associated function, and then waits for the next chunk of data. There is no need for locking, and no context-switching overhead.

I find this style of programming highly unpleasant, for two reasons. First, you must be very careful in your callback function to do nothing that could cause a significant delay: if your callback blocks, your server blocks. Even simple file I/O, like writing a message to a log file, is problematic. The second reason is that control state, conveniently and implicitly stored on the stack in a thread-based architecture, must be managed manually. For example, say that incoming messages have a recursive structure, like a parenthesized expression or XML, and further assume that the message length is unknown. To read such an expression recursively is trivial. But you can't do that in an event-driven architecture: you must encode into a data structure where you are in the simulated recursion when the current chunk of input runs out, so you can resume when the next chunk arrives. This is a real pain.

So if I can't lick my Python performance problems and I don't want to write event-driven code, I must rewrite in a different language. What are my choices?
  • C++. This language is part of my workplace's culture, and unfortunately I must use it often for its number-crunching speed. But I don't need that kind of speed for my distributed application. The language's numerous flaws need no elaboration. On the whole, I strive to reduce the total amount of C++ in the world.
  • Ruby. A beautiful language, but I don't expect it to perform any better than Python.
  • Java. This would be a good choice. I've used Java extensively, and I believe it to be quite a well-designed language. It has a mature, stable and fast VM. I'm confident a thread-based server would scale. But I would still have my locking problems.
  • C#. See Java. Also, my system must run on Linux and I don't believe Mono has enough mileage yet.
  • Haskell, OCaml, etc. There is a raft of fascinating and elegant modern languages, but their implementations come out of academia and they have little or no real-world mileage. I can't trust such an implementation in a production environment.
  • Erlang. It's a language with more than its fair share of warts, but it has two wonderful things going for it: a model of concurrent computation that does away with locks and their attendant problems, and an extremely stable and mature VM.
Much is written about Erlang's process-based concurrency model, but the insanely reliable VM is at least as big a selling point. One has to keep that in mind whenever one starts to daydream about improving the language or finding a better one. It would not be too difficult to design a language that was a bit more programmer-friendly, a little more consistent, had a couple more modern features. It is utterly out of scope to build a VM to match Erlang's. It is not so much the technical achievement of the thing as the sheer mileage it has on it. A large company with a guaranteed installed base, like Microsoft, could do it in a couple of years (e.g. .NET), but it would take an open-source or academic project quite a while.

So I've been slowly rewriting my application in Erlang, a little at a time.