Hors-Série IA 2020

Watch out, statistics ! How to explain mistrust ?

Statistics confused

Benjamin Disraeli, Great Britain’s prime minister during second half of 19th century, said “There are three kinds of lies: lies, damned lies, and statistics”. This subject already suffered from huge mistrust at this time. Hut its role continues to grow around all activity fields.
Today, statistics are used for everything: they daily obsess our political men and women (surveys) and they enable to guide and motivate their actions (road safety, jobs…); they enable sociologists to analyze human beings groups and to decode their behavior; they have a fundamental role in sciences and enable to validate hypotheses of art people and to produce the good conclusions (medicine, physics…); they also enable companies to refine their offers, to guide their strategies and to target their markets better via marketing surveys.
Despite this central role, statistics daily suffer from critics and incite debates. Each official statistics published brings endless arguments and polemics (cf recent statistics about the time of work in France or about delinquency). Political surveys are daily contested and mocked, especially the major concerned people and especially when their results are not in their favor. The use of statistics in human sciences fields is also very controversial. And even scientist statistics is today subject of debate.
How to explain this mistrust? Can we continue to trust surveys statistics to guide choices and decisions in companies?

I describe, you deduce…

The statistic (from Latin « status » which means condition) is divided in two distinct parts but really compatible: the descriptive statistic and the inferential statistic:

- Methods of descriptive statistics analysis try to reach the most exact image of the population to describe from a huge number of analyzed elements. Usual indicators that can be the average, the standard deviation and the variance are a part of the descriptive statistic. But this statistic branch also comprises more sophisticated methods such as the factorial analysis.
- The aim of the inferential statistic is more about estimating the validity of the hypotheses, about detecting eventual links between the variables and about establishing general extrapolations concerning the analyzed observations. This branch comprises hypothesis tests, variance analyses, declines…

Descriptive statistics try to sum up the characteristics of the studied populations whereas inferential statistics aim to discover hidden characteristics of these populations and the rules that we can obtain from these hidden characteristics.
All these methods are based on strong mathematics rules. However…

Bikini statistics

GGeorges Gallup, the famous American statistician considered as the father of opinion survey affirmed “I could prove God statistically”. Another famous statistician, Aaron Levenstein pronounced this well-known sentence “Statistics are like a bikini. What they reveal is interesting/ But what they hide is vital”. Actually, statistics always had the reputation to be tractable and able to say what we want them to say. It is clear that manipulation is easy in this field, it might be by omission. For instance, we can affirm that the average salary in a 200 people company is up to €3200 even though 80% of the people only earn €1500 (the 20% earning about €10.000). We can highlight the important increase in the value of a product sales whereas the market share of the company is falling on this market in development (with a high growth).
The official statistics are the most highly suspected. This is the case during the pre-election period but we also notice it more generally. The employment statistics, the delinquency statistics and repeat offense statistics, the poverty statistics where prices are usually argued.

But suspicion widely exceeds official numbers to touch the social sciences. Some sociologists refute the legitimacy of the statistics utilization when it deals with human beings groups. They consider that classifications and categorizations operated with the statistical approach of phenomena bring subjectivity and harm the understanding of the reality. They are in the wake of the American ethno-methodologist Aaron Cicourel who already rejected the statistics about delinquency in United States in the 60s, by affirming that they actually represented the police services activity rather than the real criminal activities.
According to Alain Desrosières (a French specialist of the statistics history and member of the huge group of Insee administrators) the statistic network develops itself following an institutions system. “This similar investment to the investment of a road network or a rail network brings categories that become unavoidable”. Consequently, the field of action of researchers and their ability to transpose social realities tends to be limited. Such as for a work which is not a line of words and for an image which is not a succession of colors points, the social phenomena cannot be divided infinitely to be better caught. The detractors of statistic criticize its simplifying leanings that harm the gripping ability and the global understanding of our environment according to them.

Paradox - Brainwashing

Outside the fragmentary approaches or the follower approaches, the statistical method keeps an amount of traps in which experienced users can fall.

The British statistician Edward Simpson described an example of it in 1951. According to his famous “Simpson paradox”, a result affirmed in several different groups can be inversed if we combine these groups. Here is an example: A company hires 60 men and only 16 women during one year. Is it a sexist company that shows a discriminating behavior because 79% of the hiring benefited to the masculine sex and 21% to the weak sex?
Let us deepen. The company received 244 men applications and 84 women applications. 25% of the men were hired whereas 19% of women were hired. We can affirm that women statistically had about 20% chances less to be hired, which can seem abnormal.
Let us deepen again. The company actually organized hiring in two times.

- First time 190 men were presents and 56 were hired (59%). 40 women were presents and 12 were hired (30%).
- Second time, 54 men and 44 women were presents. The company hired 4 men (7%) and 4 women (9%).

The company always hired a bigger percentage of women. However, the final statistic showed the converse results.
Surprised or not sure to have followed? Just make the calculation and you will see the trap in which many scientists, sociologists and surveys responsible people can easily fall.

Premonition with sobriety

Logical reasonings can sometimes be misleading and lead to wrong deductions. We can illustrate this with the famous taxi driver example used by the 2 Economics Nobel prices Daniel Kahneman (American-Israeli psychologist and economist) and his colleague Amos Tversky (expert in mathematical psychology). Kahneman and Tversky imagine a city where 85% of cabs are red and 15% are blue. A taxi driver knocks a pedestrian down and does not stop. According to a witness who saw the accident, the driver drives a blue car.

Before searching all blue cabs in the city, we make an experiment in a similar context. The result indicates that 20% of the witnesses (having seen the same situation) are wrong. We could quickly conclude that the interviewed witness has 80% of chances to be right. But a more extensive exam of the situation and the use of the famous Bayes theorem show us that the rate almost must be divided by 2: actually, the driver only has 41% probability of driving a blue cab. The taxi driver to blame has actually 69% chances to be yellow.
Here is the calculation: a priori, probability that the cab is blue is 15%. If we take the reliability rate calculated in the experiment into account, the probability that the witness had correctly seen the blue color of a real blue car is 80%. The converse probability that the cab is red whereas it was considered as blue is 20%.
A posteriori, the probability that the car really is blue as the witness affirmed it is 41% according to this below formula: (15%*80%)/[(15%*80%)+(85%*20%)]

Link is not a reason

Mistakes in the statistical reasoning or conclusions are prejudicial in every sector. But this is definitely in the sciences sector and medicine sector that these mistakes can bring serious consequences. But according to a published survey in the United States, more than 50% of scientist publications involving statistics contain some mistakes of reasoning or interpretation. One of the most usual mistakes consists in making abusive conclusions about the cause and effect relationship between several elements for which we found a link. Some seem to believe that 2 linked elements are obviously linked by a strong relationship and a mutual influence. But this is not true: for instance, the real estate price in Paris regularly increased during these last few years. This is the same concerning the age of all samples of people (except Benjamin Button). However, it would be risky to conclude that one of these 2 phenomena influences the other (we can hope a small prices decrease but unfortunately not rejuvenation!).

The hidden factor

Correlation calculations contain other big traps. Actually these two factors are really linked and can come from a common source even if they are not interdependent. The American psychotherapist and sociologist Paul Watzlawick gives an example really interesting and surprising in his book “Pragmatics of Human Communication”: At the beginning of 50s we found a link between beer consumption in the west coast of USA and the infant mortality in Japan. Actually, these two elements were due to a common cause: an important heat wave in the Pacific causing big sanitary problems in Japan and an increased consumption in fresh drinks in the United States.

Many scientist surveys fall in this trap. We can find factors correlation in many sectors. These correlations are only linked with their common cause. Some industrials and communicators use these correlation calculations to highlight conclusions that benefit to their products. This is the case in the food industry that regularly show us new truths about some supposed virtues for the health of their food, longevity, protection against cancer or cardiovascular diseases. Because of the mistakes, we doubt about the opposing information that we get from the scientist community.

Statistics or not?

Which conclusion should we make?

The famous precaution principle should conduce to reject the statistics because of the risks of mistakes. It would reconsider our sciences founding principles.
Actually, statistic, such as other techniques, need to be manipulated safely and with reliability. Statistics broadcasters (scientists, searchers or marketing surveys responsible people) must manage the mistake risks that we mentioned in this text to produce strict reasoning and conclusions that respect the discipline rules and good sense.
Final users of the communicated results (political people, journalists, marketing professional people and other economical deciders) must use data safely, keeping in mind that the zero risk does not exist.