## Thursday, May 28, 2015

### Six sigma and the Higgs Boson: a convoluted way of expressing unlikeliness

A few years ago IBM asked me to help them calculate "sigma levels" for some of their business processes. Sigma levels are part of the "Six Sigma" approach to  monitoring and improving business quality developed by Motorola in 1986, and since used by numerous consultants right across the world to package well known techniques in order to con money out of gullible businesses.

The name, of course, was an important factor in helping the Six Sigma doctrine to catch on. It is mysterious, with a hint of Greek, both of which suggest powerful, but incomprehensible, maths, for which the help of expensive consultants is obviously needed.

Sigma is the Greek letter "s" which stands for the standard deviation - a statistical measure for the variability of a group of numerical measurements. Sigma levels are a way of relating the number of defects produced by a business process to the variability of the output of the process. The details are irrelevant for my present purposes except in so far as the relationship is complicated, involves an arbitrary input, and in my view is meaningless. (If you know about the statistics of the normal distribution and its relation to the standard deviation you will probably be able to reconstruct part, but only part, of the argument. You should also remember that it is very unlikely that the output measurements will follow the normal distribution.)

The relationship between sigma levels and defect rates can be expressed as a mathematical formula which gives just one sigma level for each percent defective, and vice versa. Some examples are given in the table below which is based on the Wikipedia article on 25 April 2015 - where you will be able to find an explanation of the rationale.

(An Excel formula for converting percent defective to sigma levels is =NORMSINV(100%-pdef)+1.5, and for converting sigma levels to percent defective is =1-NORMDIST(siglev-1.5,0,1,TRUE) where pdef is the percent defective and siglev is the sigma level. The arbitrary input is the number 1.5 in these formulae. So, for example, if you want to know the sigma level corresponding to a percent defective of 5%, simply replace pdef with 5% and put the whole of the first formula including the = sign into a cell in Excel. Excel will probably format the answer as a percentage, so you need to reformat it as an ordinary number. The sigma level you should get is 3.14.)

 Sigma level Percent defective Defectives per million opportunities 1 69.1462461274% 691462.4613 2 30.8537538726% 308537.5387 3 6.6807201269% 66807.20127 4 0.6209665326% 6209.665326 5 0.0232629079% 232.629079 6 0.0003397673% 3.397673134 7 0.0000018990% 0.018989562 2.781552 10% 100000 3.826348 1% 10000 4.590232 0.10% 1000 5.219016 0.01% 100 5.764891 0.0010000000% 10 6.253424 0.0001000000% 1 6.699338 0.0000100000% 0.1

But what, you may wonder, is the point in all this? In mathematics, you normally start with something that is difficult to understand, and then try to find something equivalent which is easier to understand. For example, if we apply Newton's law of gravity to the problem of calculating how far (in meters, ignoring the effect of air resistance) a stone will fall in ten seconds, we get the expression:
Io5 9.8dt
(represents the mathematical symbol for an integral that I can't get into Blogger.)

If you know the appropriate mathematics, you can easily work out that this is equal to 122.5. The original expression is just a complicated way of saying 122.5.

The curious thing about sigma levels is that we are doing just the opposite: going from something that is easy to understand (percent defective) to something that is difficult to understand (sigma levels), and arguably makes little sense anyway.

In defence of sigma levels you might say that defect levels are typically very small, and it is easy to get confused about very small numbers. The numbers 0.0001% and 0.001% may look similar, but one is ten times as big as the other: if the defect in question leads to the death of a patient, for example, the second figure implies ten times as many deaths as the first. Which does matter. But the obvious way round this is to use something like the defectives per million opportunities (DPMO) as in the above table - the comparison then is between 1 defective and 10 defectives. In sigma levels the comparison is between 6.25 and 5.76 - but there is no easy interpretation of this except that the first number is larger than the second implying that first represents a greater unlikelihood than the other. There is no way of seeing that deaths are ten times as likely in the second scenario which the DPMO figures make very clear.

So why sigma levels?  The charitable explanation is that it's the legacy of many years of calculating probabilities by working with sigmas (standard deviations) so that the two concepts become inseparable. Except of course, that for non-statisticians they aren't connected at all: one is obviously meaningful and the other is gibberish.

The less charitable explanation is that it's a plot to mystify the uninitiated and keep them dependent on expensive experts.

Is it stupidity or a deliberate plot? Cock-up or conspiracy? In general I think I favour the cock-up theory, partly because it isn't only the peddlars of the Six Sigma doctrine who are wedded to sigma mystification. The traditional way of expressing quality levels is the capability index cpk - this is another convoluted way of converting something which is obvious into something which is far from obvious. The rot had set in long before Six Sigma.

And it's not just quality control. When the Higgs Boson particle was finally detected by physics researchers at CERN, the announcement was accompanied by a sigma level to express their degree of confidence that the alternative hypothesis that the results were purely a matter of chance could be ruled out:
"...with a statistical significance of five standard deviations (5 sigma) above background expectations. The probability of the background alone fluctuating up by this amount or more is about one in three million" (from the CERN website in April 2015. The sigma level here does not involve the arbitrary input of 1.5 in the Excel formulae above: this should be replaced by 0 to get the CERN results.)

Why bother with the sigma level? The one in three million figure surely expresses it far more simply and far more clearly.