On software quality and engineering

Sat Nov 2 04:18:30 EST 2002

On Saturday 02 Nov 2002 9:55 am, Brad Hards wrote:
> A concept that has been missed in all of this is that all things are
> designed to meet acceptable risk. The Pinto example is where they
> mis-defined what level of risk was acceptable.
>
> Aircraft software is not designed to be bug free. It is designed to
> contribute less than an acceptable conponent of aircraft crashes. Typical
> sort of numbers are "1x10-9 per hour for safety critical items". You can't
> test to that sort of number, so you go with "process" approach.
>
> However most of the software on an aircraft isn't safety critical (RTCA
> DO-178B level A). So you design to a lower level of reliability (eg. the
> intercom is probably level C, and the in flight entertainment systems is
> probably level E - so it doesn't have any software process requirements).

I think the difference between safety criticality and reliability has been 
missed. If the software in the intercom fails and it doesn't work you may 
still have a safe uneventful flight. This is reliability. If the software in 
the intercom fails and causes the radio to transmit while you refueling and a 
spark blows the whole thing up then you have a problem. This is safety 
critical. In an aircraft everything is safety critical that is why an 
aircraft won't take off until the documentation weighs the same as the 
aircraft.

The intercom would then be designed at a lower level of reliabilty, but be 
designed to fail safe.

> It is unrealistic to expect that complex systems will not fail. It only
> realistic that a system fails at (or below) an acceptable level. Normally
> the risks are defined in terms of probability of failure (or partial
> performance) and the consequences of failure (or partial performance).

The reality is physical systems fail. The failure modes are well known and can 
be planned for.

Software systems do exactly what you tell them to do. The problem is most 
people don't know what they want the software to do and just guess. Which 
comes down to requirements and specifications. There are formal specification 
languages that can be used to mathmatically prove the specification. Most 
people don't use them as it takes too much time and more effort when they 
could be programming. I know some companies now use them for all software 
projects as they can produce software with zero defects.

Software has one failure mode. It is implemented in hardware. If the hardware 
malfunctions (or some radiation causes a bit to change values) then the 
software may not work as desired. However this can be designed for.

> If the risk is low (not much chance of things going wrong, and it doesn't
> matter much if it does), then you don't apply as much rigour. If risk is
> high (either things have a good chance of failing, or the consequences of
> failure are serious), then you get people with appropriate qualifications,
> training and experience, and you set up a rigourous process environment.
>
> Does really matter if your game crashes twice a week? Annoying - yes,
> important - no.

It depends to whom.

> In the defence aviation process, the engineers get used for the up-front
> definition of requirements (specification), the risk assessment (judgement
> of significance) and the design review part on significant designs. You
> don't need a design engineer to conduct a simple fastener substitution.

You do need a design engineer to certifiy the substitute part. This does bring 
up the subject of configuration management.

> This might make a decent topic for November CLUG. Not sure if I'll be there
> at this stage, but I'm willing to present on this.
>
> Brad

Michael Bennett.