Thoughts on pre-production testing

Wed Oct 23 22:38:18 GMT 2002

Please take what I am about to say as a comment made by one computing
professional to other professionals, not as personal remarks. These are my
personal reflections from 30+ years of experience and not necessarily the
opinions of my employer. Obviously my thoughts are motivated by today's
discussions, but I hope to discuss this at a more abstract level.  I
participate in a number of open-source teams, and I have intimate software
maintenance experience through my work, and I believe that the Samba team is
the equal of any software development team anywhere. After factoring in the
complexity of the problem that Samba tries to solve, the Samba team is
better than any other team that I know about.

As computer professionals, we owe it to our clients to thoroughly test new
computing environments before releasing them to our users.  I've had the
privilege of working for computer vendors since 1973, and during that time I
have visited or spoken with hundreds of customers in a wide variety of
enterprises.  I've served on, or led, many crisis-resolution teams.  I say
all of this so that you know that I have extensive experience in this area
and am not making up ideas out of whole cloth.

When a new version of a system or application is put into production and
fails in some manner, the manager of that system has no acceptable excuse
except "I screwed up."  I have heard all too many customers say "We are
unable to test the system to the level required by the production system."
This is a feeble excuse and shows a lack of training or experience. When I
hear one of our customers say it, I always challenge their statement. I
share stories about how other customers test their complex production
systems. I rarely have seen a case where a production system cannot be
tested in advance.  In almost all cases, the truth is closer to the
statement "We willingly risk failure on the production system to save the
expense of adequate testing."  People make all kinds of excuses: We can't
simulate the traffic; the system is too complex; we don't have the staff,
equipment, etc., etc., etc.  Balderdash.  Any incoming data flow can be
simulated or recorded and replayed.  Production systems can be broken down
into subsystems, and the subsystems can be thoroughly tested.  Backup
systems can be used to test new versions of production software or
configurations.  You can purchase extra equipment just for testing.  In some
cases you can temporarily bring up new versions on the production system
during off-peak hours.  With the appropriate level of effort, any system can
be tested.**  However, if you truly cannot test an application to the level
required by production, then it is incumbent upon you to design a roll-out
program that mitigates the risks when the inevitable failure occurs.  For
example, instead of a big-bang cut-over, bring up a small number of users
and slowly add more.  Pay end-users to come in during off-hours and bang on
a test version running on the real network.  Make up synthetic loads.  Break
down the conversion into smaller steps that can be done independently.  Of
course, regardless of how you conduct your testing, there is always a
nonzero probability of failure; we are human beings, after all; we do
occasionally screw up. Therefore, design (and test!) a fall-back plan so
that you have criteria and a procedure for reverting to the previous
version.  The least you can do is to gracefully and quickly revert to the
older version.

If you have not tested your system to the maximum load level, how on earth
do you know where it will top out?  How do you know that it won't break
completely if it gets just a little busier?  These are digital systems,
after all, and can work perfectly at 10 transactions per second (tps), and
die horribly at 11 tps.  Applications and operating systems are
chock-a-block full of queues and hard limits, any one of which can be a
bottleneck or source of failure.  My experience as an operating system
engineer is that we find all sorts of interesting problems at maximum
loading.  In my opinion, if you haven't driven your system as hard as you
can, you aren't a computing professional.

People who follow good change control procedures, who thoroughly test their
software and systems before deployment, and who do maximum load testing,
rarely encounter business-threatening (and job-threatening) outages.  These
people are true professionals.  People who do not perform these steps should
not be surprised when they encounter problems.

** In a large enterprise, the department responsible for testing frequently
does not bear the economic impact of failure.  Due to a phenomena that I
have dubbed "micro-optimization", the testing department is often denied the
resources to perform its work to a level appropriate to the task.  In my
view, this reflects a major managerial and organizational failure of the
entire enterprise. As computing professionals, we need to forge closer
working relationships with our end-users to build a case for adequate
testing.  It is still our fault, even if the cause is budgetary rather than
technological.  Many of the testing techniques that I outline above are
still of value, even in this situation.

Thanks
PG
--
Paul Green, Senior Technical Consultant, Stratus Technologies.
Voice: +1 (978) 461-7557; FAX: +1 (978) 461-3610
Speaking from Stratus not for Stratus