SPECIAL SUMMER READING EDITION: LIES, DAMN LIES, AND STATISTICS
In the past months F2C has published three blog articles helping readers to understand the junk science behind many public policies.
Public policy formulation and implementation may often be based on flawed research methods and skewed analysis. This type of research, rather than following a sound scientific approach, seeks to please the various funders, grab headlines, and back up a favoured policy or support a particular line rather than giving sound evidence.
Very few people, let alone politicians and decision-makers, understand statistical methods: in particular, the mathematics behind calculating risk and probability. If you read these articles it will help you to identify junk science.
The three articles below are well worth reading in depth. The last two require some concentration if you are not already a mathematician but are very logical.
Scientists in general and epidemiologists in particular are on trial. At stake are their reputations and their good names. In fact it is far more serious than that: the reputation of higher education is also at risk. The fallout from junk science will undermine and discredit both scientists and universities worldwide.
Second Hand Smoke, also known as Environmental Tobacco Smoke (ETS), is proving to be very lucrative for scientists and universities, but not for the working people whose livelihoods depended on bars staying open, and not for taxpayers who find themselves paying ever-increasing tax bills to support an ever-growing unemployment line.
So who says Second Hand Smoke science is fraudulent? It seems to be the view of the Brussels Declaration whose conclusion in part states:
“No epidemiological study has ever measured actual lifetime doses of ETS, nor lifetime exposures to ETS. No study has determined the recall bias of people with lung cancer. No study could guarantee that some self-declared non-smokers were in fact or had been smokers. No study could exclude the possibility that the lung cancers observed might have been caused by many known lung cancer risks and thus not by ETS. Plausible publication biases were not accounted for. Most studies did not report statistical differences of risk, and some implied a reduction of risk. In a nutshell, the primary data, their statistical analyses, and the claimed lung cancer risks of epidemiological studies of ETS are illusory-, and by extension the ETS risks claimed by the SRG are equally illusory.”
Yet the Tobacco Control Industry has forced through legislation based on such studies!
In opening we said that both scientists and universities could lose their reputations and their good name. The case we want to draw attention to is that of Enstrom v University of California. At the heart of this is the dismissal of Professor James Enstrom and the forfeiture of his research grants. As the case is currently sub-judice it would be improper to refer to the actual proceedings, although Carl Phillips has published some background information.
Professor Enstrom was half of the team (the co-writer being Professor Geoffrey C. Kabat) who wrote an epidemiological study published in the British Medical Journal in July 2003. The paper - which stated that there was actually no harm from Second Hand Smoke - was criticised by Stanton Glantz of Action on Smoking and Health (and also a member of the faculty at the University of California),who attempted but failed to have the paper removed from the BMJ. The full story of the attempts to censor the Enstrom/Kabat paper is to be found here.
For further reading, see Enstrom's website The Scientific Integrity Institute.
(Photo from http://photobucket.com/images/numerology)
From Wikipedia: “In epidemiology, a risk factor is a variable associated with an increased risk of disease or infection.”
Generally it is a disease or infection which results in death. Lung cancer for example. The determination of a risk factor is done by comparing two otherwise identical population samples, where one is exposed to the 'risk factor' and the other is not.
Suppose that we have one sample exposed to a risk factor (we'll call it F1). And that 200 deaths occur in that sample compared to 100 deaths in the non-exposed sample. Assuming both samples are the same size then that represents a relative risk (RR) of 2 (twice as many deaths).
Correlation does not imply causation but if we accept that the relationship is causal then the extra 100 deaths can be attributed to risk factor F1. 50% of the deaths in the exposed sample.
Such attribution sounds eminently reasonable and that is generally how epidemiologists and the media report them.
Next, suppose that our valiant epidemiologists conduct a similar trial but on another risk fact (we'll call that F2) and find the same RR. So in this new study, 50% of deaths are attributed to F2.
So consider what happens in a population that is exposed to both F1 and F2. Taking the above figures at face value, we can attribute 50% of deaths to F1 and 50% to F2. So we've accounted for 100% of deaths.
With 3 risk factors we'll end up accounting for 150% of all deaths. And the more risk factors that you add in, the crazier it gets.
The number of possible risk factors is to all intents and purposes infinite, some identified, some not . Epidemiologists have already identified over 100 for lung cancer and over 500 for heart disease.
There is a formula for 'attributable risk' that tries to avoid these pitfalls. However this makes several assumptions. Notably that ALL risk factors and the interactions between them have been identified - this is of course impossible. .
The eminent psychologist Hans Eysenk described this as 'Alice in Wonderland Arithmetic' on pages 8 and 9 of his book).
This was also brought up with an additional twist in the McTear vs ITL court case:
[5.147] ..." There is a tendency to think that the sum of the fractions of disease attributable to each of the causes of disease should be 100%. For example, in their widely cited work, The Causes of Cancer, Doll and Peto (1981: Table 20) created a table giving their estimates of the fraction of all cancers caused by various agents; the total for the fractions was nearly 100%. Although they acknowledged that any case could be caused by more than one agent (which would mean that the attributable fractions would not sum to 100%), they referred to this situation as a 'difficulty' and an 'anomaly'. It is, however, neither a difficulty nor an anomaly, but simply a consequence of allowing for the fact that no event has a single agent as the cause. The fraction of disease that can be attributed to each of the causes of disease in all the causal mechanisms has no upper limit: For cancer or any disease, the upper limit for the total of the fraction of disease attributable to all the component causes of all the causal mechanisms that produce it is not 100% but infinity. Only the fraction of disease attributable to a single component cause cannot exceed 100%."
Here is table 20 from 'The Causes of Cancer' by Doll and Peto (1981) - Oxford Medical Publications. Reproduced here for public interest and fair comment.
They also included a further figure (table 21) showing published figures by other researchers:
Notice how in each case the figures total to 100%. Also how 'infection' is either very low or non-existent in these tables.
This is where the oft quoted 50-70% of cancers are caused by lifestyle and are preventable comes from. CRUK still peddle this.
The fact that the main cause of cervical cancer and (probably) a high proportion of head and neck cancers are now known to be caused by the 'human papillomavirus (HPV)', seems to have passed them by.
(Sigma is the Greek letter σ)
Medical science and especially epidemiology treat 2-sigma as a kind of 'gold standard'. This article attempts to explain what that means and why it is a scandal.
There has always been a magical appeal in the notion of extracting information from apparently random events. Early examples include the 'I Ching'. However, in recent centuries, the development of probability theory has provided a rigorous mathematical understanding of the nature of chance. So providing a basis from which to develop methods for analysing statistical data.
Amongst others, Sir Ronald (Ralf) Fisher otherwise known as the 'father of modern statistics'; developed a method for designing statistical tests and arriving at a result which would provide a level of 'confidence' in the conclusion. The calculations involved in these tests were complex and he envisioned them being deployed by professional, qualified statisticians equipped with log tables and slide rules. He suggested that a confidence level of 95% might be a useful benchmark for a one-off randomised controlled trial. Without going into detail, this equates to 2-sigma (1.96 to be precise).
There are many different types of statistical tests that can use these methods. However the most common one in junk science is a measure called 'relative risk' (RR). Note that some studies calculate on 'odds ratio' (OR) instead. The OR is very similar to an RR but will generally produce a larger figure.
The idea is that you compare the incidence of something (e.g. disease or death) in a group exposed to a substance under test, with a group that is not exposed (the control group). The resulting ratio is the RR. This is expressed as a point estimate, along with a range of values within which there is confidence that the real value lies. This is called a confidence interval, normally written as “CI = (n to m) at 95%”.
To run a statistical test it is critical that other factors are excluded. Fisher laid great emphasis on this. The only way to exclude other factors is to run a 'randomised controlled double blind trial'. 'Randomised/controlled' means that the test population is split into an 'exposed' group and a 'non-exposed' group (the control group). And that the selection of each group is not biased by any other factors, known or unknown, hence 'random'. 'Double blind' refers to the important issue that neither the researcher nor the subject knows who is in which group. At least until the trial ends.
It is also vital that the trial is fully defined before it starts. For example if a trial is specified as running for 1 year then it must be run for one year and neither cut short nor extended.
To understand the importance of this, consider a drunken walk. The drunk randomly takes a step to either right or left. So a trial to see if he is biased to the right or left might consist of looking at say 100 steps and reporting based on where he ended up. If the researchers have the option of earlier or later termination then they could simply wait until he was a long way off centre and report that as their result. Obviously he might well veer a long way off centre during the trial but that is not a valid test. The only thing that is valid is his final position at the pre-specified end.
The trouble is that, with the advent of computers, anyone can perform such a test. All you need is a statistical package or even a humble spreadsheet (see below). What's more, epidemiologists don't bother with 'randomised controlled trials' and instead just observe data from the real world with all its inherent complexity. This introduces confounding factors and, potentially, researcher bias.
So whereas Fisher anticipated a small number of experienced statisticians performing occasional tests and interpreting the results with care, we now have thousands of epidemiologists, with little or no understanding of the statistical limitations, performing vast numbers of tests at the click of a button. It is no wonder that there is so much junk science around! At best, it is to be expected that 5% of their results are in error and in reality, many more. This is partly because they can decide which results to publish and partly many other reasons including premature termination (examined below)
There is a very simple improvement that could be made though. Particle physicists insist on at least 3, 4 or 5-sigma tests. The difference in confidence is enormous. A 3 sigma test gives a confidence level of 99.7%, 4-sigma 99.99% and 5 sigma 99.99999%.
So how to design such tests?
All you need to do is to increase the sample size. If you have seen the results of a trial that is at 2-sigma (95%) then treating that as suggestive you simply try your own version but double the size of the sample. Then, if there was any substance in the original you should achieve a significant result at 3-sigma(99.7%).
Specifically, to replicate a 2-sigma test at 3-sigma, just multiply the sample size by 9/4 (3 squared divided by 2 squared). To move from 2 to 4 the factor is 16/4 and so forth.
For those who wish to verify this for themselves, here is a rather scruffy spreadsheet. It includes a little detail on the intermediate steps to calculating confidence intervals and RR/OR values. And also a little toolbox for calculation of sample sizes so as to re-test at different confidence (sigma) levels.
Note that medics almost exclusively use 1.96 sigma. So there are hundreds of passive smoking studies all attempting to reach this magic number, when what should have been done was to treat the first few as indicative and then run a better test with a larger sample.
Pharmaceutical companies use these same flawed 'standards' for drug trials too. And there is no (advance) registry of drug trials to constrain their abuse of method.
See also: medcalc
Finally, please remember that all these tests can do is to find, or fail to find, correlations and correlation does not imply causation.
One of many flaws in the medical use of statistics is that researchers often have the freedom to terminate a trial early. That may not sound like such a big deal but in principle it could reduce the confidence level from 95% to more like 50%. Even at 3 sigma this is a problem because it could reduce the actual confidence level to around 95%. Of course, in the real world, researchers would not have quite as much flexibility as to when to stop so the above figure should be seen as purely theoretical worst case scenario.
To verify this for yourself, try this rather rough and ready programme here. You will need to have 'java' enabled in your browser. Please feel free to examine, save and edit the code for your own use by viewing the html source on your browser ('view-source' in IE or 'tools-developer-page source' in Firefox), copy/pasting into a text document using windows notepad (or better still notepad++) and saving as e.g. myprog.html. Then you can edit it and test by opening the text file with your browser.
The programme runs 2000 trials of 5000 coin tosses each. Note that with more coin tosses, the confidence level decreases even further. The earliest termination allowed is after 20 tosses as it is only at that point that an approximation to the normal distribution can be said to exist (not that medics would worry about such niceties).