Thursday, November 13, 2008

JMagLev: Terracotta based MagLev for Ruby

In one of my previous posts I wrote about clustering solutions for Ruby. One of them was MagLev, based on Gemstone's Smalltalk virtual machine, the second one was based on Terracotta. The latter solution seems more appealing, since Terracotta is open source, and lets you do some own experiments (like this, or this, or even this). Unfortunately, all solutions I have seen so far (including my own) for JVM based languages other than Java depended on a special library that interfaced Terracotta through com.tc.object.bytecode.ManagerUtil class instances, and were not transparent to the programmer. Until now.
Recently, Fabio Kung took a step further and patched JRuby to behave like MagLev, transparently sharing objects across multiple JRuby runtimes. It seems that MagLev got a strong competitor before it even managed to hit the market. Have a look at the demo, it looks equally impressive to what Avi Bryant from Gemstone showed at RailsConf 2008.

Thursday, November 6, 2008

Distributed processing with JScheme and Terracotta

In my post about clustering JScheme with Terracotta I presented how to share Java objects through Terracotta using JScheme. But clustering is not only about sharing data. It is also about parallel (and possibly distributed) task processing.

One of the fundamental assumptions of Scheme (and other Lisps) is that code is data. Programs are just collections of evaluable expressions and thus you can represent data as code, and vice versa. This is exactly what you need when you want to distribute code over a system designed primarily to share data.

I prepared a simple system based on JScheme that allows distributed, concurrent code processing over a Terracotta cluster. It uses a concept of independent nodes (borrowed from Erlang). Each node polls a shared list to find a new expression to evaluate. Once it finds a job to be done, it evaluates the expression using a JScheme evaluator (instance of jscheme.JScheme class) and writes a result back on the list. The client process which initiated the task, reads back the result and returns it.
Since all nodes are independent entities, you can start as many of them as you need and use them concurrently. But in most cases the optimal number of nodes is equivalent to the number of CPUs (or CPU cores) to be used on each machine connected to Terracotta. So if you have 2 computers with a quad core CPU and want to use only half of their power, you can start 2 nodes on each of them. If you want to use them to the full, you should use 8 nodes, 4 per each machine, and so on. You can start the nodes on single machine using a single or multiple JScheme shells, it's up to you. For me, a single Scheme REPL seems to be the most convenient option.
Client jobs are started through tc-map. It's a function that is similar to the standard map, but it takes an additional argument - a list of nodes to use for the job. Unfortunately, the system is not fault tolerant, so if one of the nodes dies during doing the job, the whole processing task hangs up. The only way out then is either to restart the dead node or evaluate the whole tc-map expression again.

OK, enough for the theory, let's do some real work. First you need to download the library. The online folder contains the library itself (jstc.scm), sample Terracotta configuration (tc-config.xml) and some tests. After you get the library and the configuration file, you should start the Terracotta server (I described the whole procedure in detail previously). If you run the Terracotta server on a remote machine, you should also edit the tc-config.xml file and change server host to the IP of the Terracotta host. Now you can start JScheme through the Terracotta bootstrap:
java -Xbootclasspath/p:[terracotta boot jar] -Dtc.config=tc-config.xml -Dtc.install-root=[terracotta install dir] -jar jscheme.jar
You can to find the boot jar in lib/dso-boot folder of your Terracotta installation directory. If it isn't there, you can generate it with Terracotta make-boot-jar.sh script.
Now you can load the library with:
(load "jstc.scm")
and start playing with it. For starters, let's run two nodes:
(start-node "test1")
(start-node "test2")
and define a helper function to generate a list of integers from a specified range:
(define (range min max) (let loop ((x max) (l '())) (if (< x min) l (loop (- x 1) (cons x l)))))
Now we need to "teach" running evaluators the Fibonacci function:
(tc-load (list "test1" "test2") "(define (fib n) (cond ((= n 0) 0) ((= n 1) 1) (else (+ (fib (- n 1)) (fib (- n 2))))))" )
and we are ready to spread a test job across the nodes:
(tc-map (list "test1" "test2") "fib" (range 1 20))
After a few seconds you should receive the following list:
(1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765)
You can use time function to compare computation times with results received by using a single node or a regular, sequential map:
(time (tc-map (list "test1" "test2") "fib" (range 1 20)))
(time (tc-map (list "test1") "fib" (range 1 20)))
(define (fib n) (cond ((= n 0) 0) ((= n 1) 1) (else (+ (fib (- n 1)) (fib (- n 2))))))
(time (map fib (range 1 20)))
Take note that a function passed to map is a regular Scheme expression, while with tc-map it must be a string.

It is quite possible that you get no gain over sequential processing using less than 3 nodes running on 3 cores with this library. The main reason is that Terracotta introduces some overhead itself. The second one is that nodes poll Terracotta for job lists in 20ms intervals. Those intervals are necessary if you don't want to consume the whole CPU power just for loops and leave none for jobs. You can adjust them by changing the value of JSTC_EVAL_DELAY.

I did some tests and I must say that the results surprised me. On my home laptop (Core2 Duo T5450 1.66 Ghz, 2 cores, 2GB RAM) the results looked like this:
(time (map fib (range 1 25))) - 17234 msec
(time (tc-map (list "test1") "fib" (range 1 25))) - 29020 msec
(time (tc-map (list "test1" "test2") "fib" (range 1 25))) - 19620 msec
while on three servers (dual Xeon E5430 2.66 Ghz, 8 cores, 8GB RAM each):
(time (map fib (range 1 25))) - 12687 msec
(time (tc-map (list "test1") "fib" (range 1 25))) - 22502 msec
but:
(time (tc-map (list "test1" "test2") "fib" (range 1 25))) - 25256 msec
(time (tc-map (list "test1" "test2" "test3") "fib" (range 1 20))) - 22355 msec
when I ran the tests on a single machine, and:
(time (tc-map (list "test1" "test2") "fib" (range 1 25))) - 14216 msec
(time (tc-map (list "test1" "test2" "test3") "fib" (range 1 20))) - 11538 msec
when each node was on a different machine.
On my laptop I got almost 150% speedup by using two Terracotta nodes instead of one, but on a server machine two nodes actually slowed the tasks down. I could get faster job processing only by spreading nodes across different machines. Adding new nodes to the machines seemed to have no impact on the results, so I couldn't get past 250% speedup factor.
Weird, isn't it?