[distcc] Discussion of distcc load balancing ... was re: Multiple people running distcc at once, using one DISTCC_DIR

Sat Oct 2 18:53:01 GMT 2004

Good afternoon.  I've been following the discussion by Dan, Victor,
Jean, and others regarding load balancing of distcc jobs.  Sorry I've
been lurking and haven't said anything until now.

Actually, I have said something on the subject, but it was quite a
while ago.  Martin probably still remembers the "tcpbalance" distcc
load balancing proxy[1] that I'd written specifically for solving the
kinds of load balancing problems that Dan & Victor have been trying to
solve.

Tcpbalance does the following:

    * Choose the fastest available distcc server at the moment.
      Servers are configured in a list, sorted from fastest CPUs to
      slowest.  So it's possible to use hand-me-down machines with
      slower CPUs: they'll be used only if there's enough demand for
      tcpbalance to walk that far down the list.  Machines with
      multiple CPUs can be configured to accept multiple simultaneous
      jobs from the proxy. If all servers are busy, the proxy waits
      until a server becomes available.

      At first, I'd used TCP load balancing proxies designed for HTTP
      use.  They work fine as distcc proxies, but their round-robin
      scheduling was suboptimal.  Especially when M developers with N
      simultaneous jobs approached or exceeded the # of CPUs
      available.

    * Supports builds initiated on multiple machines and by multiple
      developers.

    * Does not rely on NFS or other mechanism for sharing state
      information. 

    * Requires no changes to distcc: it's just a dumb TCP proxy (once
      the distcc server is chosen).  All developers have a
      DISTCC_HOSTS environment variable that is simply
      "tcpbalance_proxy_host/N" where N is the same number that the
      developer uses for "make -jN".

    * Automatically detects if a distcc server is down and stops
      sending jobs to down servers. 

    * Has an embedded HTTP server to show the status of all distcc
      servers: # of jobs currently active, total # of jobs since the
      proxy started, total # of seconds used by all jobs since the
      proxy started, ...

    * Well-tested in a cross-compilation environment.  I'd written
      tcpbalance while I was at Caspian Networks.  I'd left Caspian
      before tcpbalance could be deployed across the entire
      company[2], we did have several developers using the same
      tcpbalance proxy quite happily

    * Allowed sysadmins to change the up/down state of existing
      servers as well as add & remove machines from the server list.

    * On-the-fly code upgrades, so I could fix bugs without taking the
      proxy down even for 1 second.

    * Crash-proof: if a bug caused the server to crash, the crashed
      thread(s) would automatically be restarted by a "supervisor"
      component.  Tcpbalance is written in Erlang[3], which is a
      "concurrency-oriented programming language"[4], and such
      "micro-reboots" are extremely easy to do.  If you've heard of
      the trend of Recovery-Oriented Computing[5], the concept is
      quite similar (but much less heavyweight than when done in other
      programming languages.)

Things that tcpbalance does not do:

    * Tcpbalance doesn't keep track of other activities on backend
      machines.  It assumes that they're dedicated to distcc use.  If
      they're used for other purposes (e.g. desktops, running other
      CPU-intensive jobs), then a status monitor would have to be
      added to change the server status based on that external
      activity.  Not hard to do, but I didn't need to do it.

To be honest, I don't know why tcpbalance isn't used more often.  It's
probably a combination of factors.[6]

    * I haven't done much to promote tcpbalance's existance.

    * Building and installing the Erlang VM, and its accompanying s/w
      packages (stuff like a complete CORBA development environment),
      is too much of a hassle, too much disk space for the
      installation (40-50MB), too much time, too steep a learning
      curve, too verbose documentation (when I bothered to write any
      documentation at all), ...

    * People were worried about performance: the proxy is a single
      point where all distcc traffic goes in an out.  In reality, a
      single CPU 800MHz proxy machine (100Mbit/sec Ethernet) handles
      20+ simultaneous jobs without a problem, and that was good
      enough at the time.

    * Not that many people need to solve this particular problem.

Dan's host randomization hack is quite cool for its simplicity.  It
solves the scheduling problem without any shared state.  The solution
isn't optimal, but it's good enough for some environments.

Discussion of solving the distributed scheduling problem has popped up
again in the last few weeks.  I've had an idea kicking around for a
while to solve the biggest "problems" of the tcpbalance solution:

    * Use a more "conventional" (otherwise known as "popular")
      language.

    * Avoid the perceived performance "problem" of the
      man-in-the-middle proxy.  (Such a solution will probably be more
      scalable: fewer worries about CPU utilization, network
      utilization, etc.)

So, this weekend, I decided to set aside a bit of time to cook up such
a "better" solution.  It's in Tcl (run on UNIX, Windows, Mac, ...),
it's quite simple (simpler than "icecream"[7], as far as I can tell),
has most of the features that tcpbalance does, and it wouldn't be very
hard to add some other features that other environments need
(e.g. status monitor on multi-use distcc servers (load average, screen
saver status, whatever)).

It would require a small change to the distcc client in order to be
effective.  Attached is a very short Tcl script that demonstrates that
it's possible to just change "distcc"'s view of the DISTCC_HOSTS
environment variable, and everything else can remain as-is.

The Tcl programs that I'm attaching to this message are:

    1. balance-daemon.tcl.  It listens to 3 TCP ports:

       5656: to tell the distcc client which distcc server to send the
       job to.  The client should keep the TCP connection open until
       the job is done: that's the way the daemon knows that the
       client is done and the distcc server is idle & ready for
       another job.

       5657: Send back a summary of the daemon's state.  It's
       moderately human readable.  It's also parsable by the
       balance-status-daemon.tcl, which uses it to get the list of
       distcc servers.

       5658: Used by the balance-status-daemon.tcl to change the
       up/down state of distcc servers.

    2. balance-status-daemon.tcl.  Queries the balance-daemon on port
       5657, then does a simple TCP connection to each distcc server,
       then sends the status update to the balance-daemon one port
       5658.  (It was simpler Tcl code to use these separate TCP
       ports, no other reason.)  Because it opens a lot of TCP
       connections to distccd and then immediately closes them,
       distccd will spew a lot of complaints to syslog; feel free to
       ignore them.

       It was easier to perform this status checking function outside
       of the balance-daemon due to Tcl event loop reasons (or rather
       my understanding of how to (ab)use the Tcl event loop).  But it
       does set the precedent of using a simple little protocol to
       have outside agents changing the state of the balance-daemon.

    3. bal-wrap.tcl ... use "bal-wrap.tcl distcc gcc ...".  It has
       hardcoded values for how to talk to the balance-daemon, and it
       has some other flaws, but it demonstrates what would be
       required to change in the distcc client in order to use
       balance-daemon.

Distcc has been a wonderful, time-saving tool.  I hope that these
tools help other people solve their problems.  If the techniques
and/or source get folded into the distcc distribution, great.  If not,
c'est la vie.  Enjoy!

-Scott

[1] Martin has mentioned Tcpbalance on the list a couple of times
since my posting, IIRC.  If you're interested, the source code is at
http://www.snookles.com/erlang/tcpbalance/.

[2] Actually, it would be 1 proxy per build cluster, and each
development office had its own build cluster (to avoid cross-continent
distcc traffic ... although running distcc in California and using a
tcpbalance proxy & distcc servers in Minnesota was almost as good).

[3] See http://www.erlang.org/.

[4] See http://www.sics.se/~joe/talks/ll2_2002.pdf for the slides that
Joe Armstrong used for a presentation he made to the 2002 MIT
Lightweight Languages Conference.  A recording of his presentation is
available at http://ll2.ai.mit.edu/.  One of his examples is an HTTP
server written in Erlang that handles 80,000 concurrent HTTP requests
on a modest single CPU Linux box.

Interestingly, at the same conference, Todd Proebsting of Microsoft
Research (IIRC) said in his presentation ("Disruptive Programming
Language Technologies") that the concurrency & isolation features of
Erlang are extremely useful.

[5] Ask Mr. Google to look for the following terms & names:
Recovery-oriented computing, crash-only computing, micro-reboots,
Armando Fox, George Candea.  They use UNIX processes to keep a fault
in one software component from affecting other components
(e.g. running components in separate Java VMs).

Shameless plug: Erlang's threading model provides very strict
isolation between threads within the same VM, so separate UNIX
processes aren't required.  And Erlang has been providing ROC-like
"process supervisor" hierarchies for over a decade.

[6]   I guess I'm in an enumerative mood.  :-)

[7] http://wiki.kde.org/tiki-index.php?page=icecream

--- balance-daemon.tcl ---

#!/bin/sh
# Trick to get next line ignored by tclsh \
    exec /usr/bin/tclsh "$0" ${1+"$@"}

###
### Distcc load balancing daemon.
### Copyright (c) Scott Lystig Fritchie, 2004.
### All rights reserved.
### 
### Permission is hereby granted, free of charge, to any person obtaining a
### copy of this software and associated documentation files (the "Software"),
### to deal in the Software without restriction, including without limitation
### the rights to use, copy, modify, merge, publish, distribute, sublicense,
### and/or sell copies of the Software, and to permit persons to whom the
### Software is furnished to do so, subject to the following conditions:
### 
### The above copyright notice and this permission notice shall be included
### in all copies or substantial portions of the Software.
### 
### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
### IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
### OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
### ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
### OTHER DEALINGS IN THE SOFTWARE.
###

###
### Global variables
###

# The TCP port number that we listen to for balancing queries
set BalanceTcpPort 5656

# The TCP port number that we listen to for status queries
set StatusTcpPort 5657

# The TCP port number that we listen to for distcc server status updates.
set StatusUpdateTcpPort 5658

# The distcc server list.  The inside lists are pairs of:
# 1. "hostname/IP address : distcc server TCP port number", just like
#    the DISTCC_HOSTS syntax
# 2. The maximum number of simultaneous distcc jobs (usually 1 job
#    per CPU).
#
# The hosts should be listed order of in fastest to slowest CPUs.
# We assume all CPUs on a host are the same speed.
set ServerList [list \
		    [list "snookles:3632" 1] \
		    [list "newbb:3632" 1] \
		    [list "10.1.1.143" 1] \
	       ]
set IdleServers 0;			# Will be set correctly later
set DownServers 0

# This is the waiting queue of sockets waiting for an idle server.
# Initialize to be an empty list.
set WaitQueue [list]
set WaitQueueLen 0

# ServerStatus is an array that will be initialized later.
# ServerUpState is an array that will be initialized later.
# ClientStatus is an array that will be initialized later.

# Verbose debugging output
set Verbose 1

###
### Procs
###

proc error_msg {msg} {
    puts stderr $msg
}

proc verbose {msg} {
    global Verbose

    if {$Verbose} {
	puts $msg
    }
}

proc add_to_wait_queue {tuple} {
    global WaitQueue WaitQueueLen

    lappend WaitQueue $tuple
    incr WaitQueueLen
}

proc remove_from_wait_queue {} {
    global WaitQueue WaitQueueLen

    if {$WaitQueueLen == 0} {
	return ""
    } else {
	incr WaitQueueLen -1
	set tuple [lindex $WaitQueue 0]
	set WaitQueue [lreplace $WaitQueue 0 0]
	return $tuple
    }
}

proc get_server_tuple {server_info_wanted} {
    global ServerList

    foreach pair $ServerList {
	set server_info [lindex $pair 0]
	if {$server_info == $server_info_wanted} {
	    return $pair
	}
    }
    return ""
}

proc get_idle_server {} {
    global ServerList IdleServers DownServers ServerStatus
    global ServerUpState

    if {$IdleServers - $DownServers <= 0} {
	return ""
    }
    foreach pair $ServerList {
	set server_info [lindex $pair 0]
	if {$ServerStatus($server_info) > 0 &&
	    $ServerUpState($server_info) == "up"} {
	    incr ServerStatus($server_info) -1
	    incr IdleServers -1
	    return $server_info
	}
    }
    error "get_idle_server: should never happen"
}

proc put_idle_server {server_info} {
    global ServerList IdleServers ServerStatus

    verbose "put_idle_server: $server_info is now idle"
    incr ServerStatus($server_info)
    incr IdleServers
}

proc save_client_state {client_s client_addr client_port server_info} {
    global ClientStatus

    set ClientStatus($client_s) [list $client_addr $client_port $server_info]
}

proc sock_readable {sock} {
    global ClientStatus

    if {[eof $sock]} {
	set client_addr [lindex $ClientStatus($sock) 0]
	set client_port [lindex $ClientStatus($sock) 1]
	set server_info [lindex $ClientStatus($sock) 2]
	verbose "Sock $sock has closed: client_s = client_addr = $client_addr, client_port = $client_port, server_info = $server_info"

	unset ClientStatus($sock)
	put_idle_server $server_info

	catch {
	    fileevent $sock readable ""
	    close $sock
	}
	# Main loop only runs if no other clients are connected.
	# So, we usually need to trigger a new dispatch here.
	dispatch_waiting_clients
    } else {
	while {[read $sock 1024] != ""} {
	    verbose "Sock $sock sent us something..."
	}
    }
}

proc handle_new_client {client_s client_addr client_port} {
    verbose "client_s = $client_s, client_addr = $client_addr, client_port = $client_port"

    set server_info [get_idle_server]
    if {$server_info == ""} {
	verbose "No idle servers, append to wait queue"
	add_to_wait_queue [list $client_s $client_addr $client_port]
	return
    }
    save_client_state $client_s $client_addr $client_port $server_info

    catch {
	puts $client_s $server_info
	flush $client_s
    }

    fconfigure $client_s -blocking 0 -buffering none -translation binary
    fileevent $client_s readable [list sock_readable $client_s]

    verbose "handle_new_client: end"
}

proc dispatch_waiting_clients {} {
    global IdleServers DownServers

    verbose "dispatch_waiting_clients: top"
    while {$IdleServers - $DownServers > 0} {
	verbose "dispatch_waiting_clients: top of loop"
	set last_dispatched 0
	set tuple [remove_from_wait_queue]
	if {$tuple == ""} {
	    break
	}
	set client_s [lindex $tuple 0]
	set client_addr [lindex $tuple 1]
	set client_port [lindex $tuple 2]
	verbose "dispatch_waiting_client: client_s = $client_s"
	handle_new_client $client_s $client_addr $client_port
    }
    verbose "dispatch_waiting_clients: end"
}

proc handle_new_status_client {client_s client_addr client_port} {
    global BalanceTcpPort StatusTcpPort
    global ServerList IdleServers DownServers
    global WaitQueue WaitQueueLen
    global ServerStatus ClientStatus
    global ServerUpState

    catch {
	puts $client_s "BalanceTcpPort = $BalanceTcpPort, StatusTcpPort = $StatusTcpPort"
	puts $client_s "ServerList = $ServerList"
	puts $client_s "IdleServers = $IdleServers"
	puts $client_s "DownServers = $DownServers"
	puts $client_s "WaitQueue = $WaitQueue"
	puts $client_s "WaitQueueLen = $WaitQueueLen"
	puts $client_s "ServerStatus (available distcc servers) = [array get ServerStatus]"
	puts $client_s "ServerUpState (distcc server up/down status) = [array get ServerUpState]"
	puts $client_s "ClientStatus (who's connected to what) = [array get ClientStatus]"
	flush $client_s
	close $client_s
    }
}

proc handle_new_statusupdate_client {client_s client_addr client_port} {
    global ServerUpState
    global DownServers

    set line [gets $client_s]
    set l [split $line]
    set server_info [lindex $l 0]
    set status [lindex $l 1]
    if {$ServerUpState($server_info) == "down" && $status == "up"} {
	set tuple [get_server_tuple $server_info]
	if {$tuple != ""} {
	    verbose "State change: $server_info was down, now up"
	    set ServerUpState($server_info) "up"
	    set max [lindex $tuple 1]
	    incr DownServers "-$max"
	    dispatch_waiting_clients
	}
    }
    if {$ServerUpState($server_info) == "up" && $status == "down"} {
	set tuple [get_server_tuple $server_info]
	if {$tuple != ""} {
	    verbose "State change: $server_info was up, now down"
	    set ServerUpState($server_info) "down"
	    set max [lindex $tuple 1]
	    incr DownServers $max
	}
    }
    close $client_s
}

###
### Argument handling and startup tasks
###

if {[catch {set Balance_s [socket -server handle_new_client $BalanceTcpPort]} cv]} {
    error_msg "Fatal error initializing TCP port $BalanceTcpPort: $cv"
    exit 1
}
if {[catch {set Status_s [socket -server handle_new_status_client $StatusTcpPort]} cv]} {
    error_msg "Fatal error initializing TCP port $StatusTcpPort: $cv"
    exit 1
}
if {[catch {set Update_s [socket -server handle_new_statusupdate_client $StatusUpdateTcpPort]} cv]} {
    error_msg "Fatal error initializing TCP port $StatusUpdateTcpPort: $cv"
    exit 1
}

###
### Initialize the ServerStatus array.
###

set sum 0
foreach pair $ServerList {
    set server_info [lindex $pair 0]
    set max [lindex $pair 1]
    set ServerStatus($server_info) $max
    set ServerUpState($server_info) up	;# External source gives host down info
    incr sum $max
}
set IdleServers $sum

###
### Start running the Tcl event loop
###

vwait forever
error "NOT REACHED"

--- balance-status-daemon.tcl ---

#!/bin/sh
# Trick to get next line ignored by tclsh \
    exec /usr/bin/tclsh "$0" ${1+"$@"}

###
### Distcc load balancing daemon.
### Copyright (c) Scott Lystig Fritchie, 2004.
### All rights reserved.
### 
### Permission is hereby granted, free of charge, to any person obtaining a
### copy of this software and associated documentation files (the "Software"),
### to deal in the Software without restriction, including without limitation
### the rights to use, copy, modify, merge, publish, distribute, sublicense,
### and/or sell copies of the Software, and to permit persons to whom the
### Software is furnished to do so, subject to the following conditions:
### 
### The above copyright notice and this permission notice shall be included
### in all copies or substantial portions of the Software.
### 
### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
### IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
### OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
### ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
### OTHER DEALINGS IN THE SOFTWARE.
###

set LbHost [lindex $argv 0]
set LbStatusPort [lindex $argv 1]
set LbStatusUpdatePort [lindex $argv 2]
set TestInterval 10
set Verbose 1

proc verbose {msg} {
    global Verbose

    if {$Verbose} {
	puts $msg
    }
}

while {1} {
    if {[catch {
	set sock [socket $LbHost $LbStatusPort]
	set stuff [read $sock]
	close $sock
	regexp {ServerList = ([^\n]*)\n} $stuff dummy server_list
	foreach server $server_list {
	    set server_info [lindex $server 0]
	    set si [split $server_info ":"]
	    if {[lindex $si 1] == ""} {
		set port 3632
	    } else {
		set port [lindex $si 1]
	    }
	    if {[catch {
		set sock [socket [lindex $si 0] $port]
		close $sock
	    }]} {
		verbose "$server_info is down"
		set status down
	    } else {
		verbose "$server_info is up"
		set status up
	    }
	    set sock [socket $LbHost $LbStatusUpdatePort]
	    puts $sock "$server_info $status"
	    close $sock
	}
    } cv]} {
	puts stderr "Catch: $cv"
    }
    after [expr $TestInterval * 1000]
}

--- bal-wrap.tcl ---

#!/bin/sh
# Trick to get next line ignored by tclsh \
    exec /usr/bin/tclsh "$0" ${1+"$@"}

set LbHost localhost
set LbPort 5656

#
# There are three things that we don't do well:
#
# 1. We mix the compiler's stderr into stdout
# 2. We don't preserve the exit status of the compiler.
# 3. We assume that the load balancing service is always available.
#
# However, this is meant to be a demo only, so I
# (at least) can live with the limitations.
#

set sock [socket $LbHost $LbPort]
set server_info [gets $sock]
set env(DISTCC_HOSTS) $server_info

set cmd $argv
catch {set out [eval "exec $cmd 2>@ stdout"] ; puts $out}
catch {close $sock}
exit 0