[clug] [OT] Virgin Blue outage: All things come back to Open Source?

Stephen Walsh steve at nerdvana.org.au
Mon Sep 27 21:37:37 MDT 2010

  On 09/28/2010 11:44 AM, steve jenkin wrote:
> A Good write up of the recent Virgin Blue outage.
> Wonder if we'll ever hear the full facts?:
> <http://www.zdnet.com/blog/projectfailures/cloud-based-it-failure-halts-virgin-flights/11061>
> VB had contracted for service to be restored within 3 hours.
> I don't think them having clunky manual systems was unreasonable...

The biggest failure here was for the people doing the support to decide 
to "fix it" instead of "flip it". If you've lost a system or hardware, 
migrate services to working hardware/systems, then try to fix it, 
especially when its disk related. Sure, 9/10 it's a matter of running a 
fsck and it's back up long enough to be someone else's problem when it 
fails next, but that 10th time is the catastrophic failure that strands 
50,000 customers.

> Searching for another piece, it seems MSFT had a little similar trouble
> just recently:
> <http://news.techworld.com/data-centre/3239053/microsoft-apologises-for-cloud-service-outage/?>

The first issue was broken network config than anything else. The follow 
on issues are the usual for a distributed system, and MSFT is well known 
for running the latest code on their hosted platforms (cf live at edu vs 
exchangelabs for tertiary email)

> There was this major outage on 09-Oct-2009: T-Mobile and Sidekick
> <http://en.wikipedia.org/wiki/Microsoft_data_loss_2009>
> <http://en.wikipedia.org/wiki/Danger_Hiptop#Data_service_outage_2009>

This was caused by MSFT not making an acquisition's backend compatible 
with theirs, and in the process of trying to do so, killing not only the 
primary production database, but the primary backup database.  Another 
victim of "we didn't make it, it ain't broke, but how hard can it 
be...oops" method of handing service assimilation.

> It seems the operations are:
>   - too complex for 'mere admins' and
>   - the system designers aren't building in robustness

I don't think it's that, the first and last ones are pure human traits 
(ego/lazyness) , the MSFT '10  one is partially another human trait 
(ego), and also the side effect of eating your own dogfood.

More information about the linux mailing list