Latest comments

  • More on blue and pin...
    Nicolas Baumard More...
    Great post Ophelia! A very good topic for cognition and culture studies!

    As for the ... 2 hours ago
  • Brilliant!
    Lucy Fisher More...
    It\'s like when people say \"All you need is confidence\" or \"you\'ve got to have trus... 8 days ago
  • False Dichotomies
    Nicholas Smyth More...
    First: The idea of an unconscious mental realm as the source of our action and experien... 9 days ago
  • Fine print
    Olivier Morin More...
    Thank you very much, Young-Hoon Kim, for replying to Karen. I am glad to read from you ... 29 days ago
  • Wonderful Comments
    Young-Hoon Kim More...
    I really appreciate the comments by Karen Lofstrom. More of importance, I totally under... 29 days ago

Latest Blog Posts

Why pink? Color matters

Ophelia Deroy | 5/9/2010

Paul Rozin on what psychologists should study

Hugo Mercier | 28/7/2010

What if there had never been a Cognitive Revolution?

Nicolas Baumard | 22/7/2010

Paul the Octopus, relevance and the joy of superstition

Dan Sperber | 13/7/2010

Opacity tasting with Dan and Maurice

György Gergely | 11/7/2010

Homeopathy as witchcraft

Nicolas Baumard | 2/7/2010

The sacredness of God

Brian Malley | 27/6/2010

“Oy vey, have you got the wrong vampire!” A reply to Frans de Waal

Dan Sperber | 22/6/2010

Three Questions for Michael Tomasello

Emma Cohen | 20/6/2010

Why do acamedics oppose capitalism?

Nicolas Baumard | 14/6/2010

Communication, punishment and common pool resources

Hugo Mercier | 6/6/2010

Believing Maurice Bloch on doubting, doubting him on believing

Dan Sperber | 30/5/2010

Why do we make our tastes public?

Nicolas Baumard | 23/5/2010

Doubting among the Zafimaniry

Maurice Bloch | 16/5/2010

camphor - ammonia = anniseed x peppermint

Olivier Morin | 9/5/2010

Heaven before the space age

Brian Malley | 6/5/2010

Innocents fornicating and apes grieving

Dan Sperber | 4/5/2010

Is there a language instinct?

Nicolas Baumard | 1/5/2010

Endorsing evolution: A matter of authority?

Helen De Cruz | 27/4/2010

Are variations in economic games really caused by culture?

Nicolas Baumard | 23/4/2010

What explains the stability of animal culture?

Nicolas Claidière | 15/4/2010

On the Use of Natural Experiments in Anthropology

Nicolas Baumard | 5/4/2010

The social rationality of footballers

Hugo Mercier | 27/3/2010

Varieties of disbelief

Dan Sperber | 23/3/2010

Is the “problem of evil” universal?

Brian Malley | 18/3/2010

Cultural differences and linguistic justice

Nicolas Baumard | 15/3/2010

Pictures of the week: Globalized Prehistory in Arunachal Pradesh

Philippe Ramirez | 28/2/2010

Block and Kitcher review What Darwin Got Wrong by Fodor and Piatelli-Palmarini

Dan Sperber | 24/2/2010

Can you tell who will win the election in another society just by looking at the faces of the candidates?

Hugo Mercier | 22/2/2010

Religion science: if you pay the piper, do you call the tune?

Olivier Morin | 19/2/2010

Better live in Sweden than in the US: Why More Equal Societies Almost Always Do Better

Nicolas Baumard | 11/2/2010

There is no such thing as sexual intercourse

Pascal Boyer | 8/2/2010

Altruistic adoption in chimpanzees?

Nicolas Baumard | 3/2/2010

Experimental epidemiology: The work of Chip Heath

Hugo Mercier | 1/2/2010

Four recipes for religion

Harvey Whitehouse | 25/1/2010

Mad in America

Ophelia Deroy | 20/1/2010

Na'vi Cognition and Culture

Nicolas Baumard | 19/1/2010

Cognition under the high brow

Pascal Boyer | 14/1/2010

Cross potatoes

Brian Malley | 7/1/2010

Essentialist animals?

Helen De Cruz | 5/1/2010

Jingle Bell - Punjabi Tadka

Dan Sperber | 24/12/2009

Golden bell and Iron shirt

Brian Malley | 17/12/2009

Conversation Hackers

Olivier Morin | 12/12/2009

Three Questions for Simon Baron-Cohen

Emma Cohen | 8/12/2009

The scope of natural pedagogy theory (II): uniquely human?

Pierre Jacob | 6/12/2009

Can you tell the language of the mother from her baby's cry?

Nicolas Claidière | 2/12/2009

Death, where is thy sting ?

Pascal Boyer | 30/11/2009

The scope of natural pedagogy theory (I): babies

Pierre Jacob | 26/11/2009

Some like it hot

Ophelia Deroy | 25/11/2009

Language faculty? Semiotic system? Or what?

Dan Sperber | 22/11/2009

How much of a difference does culture make ?
Olivier's blog
Written by Olivier Morin   
Monday, 31 August 2009 00:00

In my latest post, I mentioned a very nice study that looked at differences in face-processing between East Asians and Westerners. Though it made a couple of fascinating points, the study also claimed that Asian culture strongly hindered Asians from understanding Western emotions. In fact, their statistically significant result was much too weak to warrant that conclusion. A recent pamphlet has been looking, among other things, at what makes scientists confound the statistical significance of an effect with its importance. The debate over the significance of significance has precedents in cross-cultural psychology.

I just finished Stephen Ziliak and Deirdre McCloskey's vital pamphlet, The cult of statistical significance: how the standard error is costing us jobs, justice, and lives. These two economic historians did an excellent job of convincing me that everyone in the human sciences (medicine included) should heed their advice and read the book. It's quite a nice read, too - except when the authors let the gossip and the anecdotes run loose, which makes you feel like you're squeezed at a conference buffet between two economy professors talking shop and lambasting some unknown colleague. Anyway, the book's message, old and banal as it can be in certain circles, is a crucial one.

Ziliak and McCloskey

It is about null-hypothesis significance testing. Before you stop reading, please remember that it is the tool you use in order to prove reviewers that your data are worth publishing. It is the mandatory p < 0.05 threshold over which there is no publishable truth. It has become the de facto golden standard of scientific validity.

This kind of significance testing has been under attack for many years in various fields, including psychology.

Many problems come from the fact that we in the soft science are eager to use statisitcal tests as a badge of scientificity and as a way of getting published ; yet most of us (and that includes yours truly) are really bad at handling these. Basic mistakes are routine. The most widespread are : using the word "significant" ambiguously to imply that an effect is important or big ; assuming that a significant result rules out the null hypothesis or even proves our own hypothesis ;  and various minor sins like unwarranted assumptions of normalcy, outlandish null hypotheses, etc. Most significance test are half-understood prepackaged softwares. Yet these problems are not specific to null-hypothesis significance  testing (would we be any better at handling Bayesian tools ? I very much doubt it).

The great strength of Ziliak and McCloskey's approach is its radicalism. Their target is not so much the many misuses of the test - although they have a lot of clever things to say about it. They target the very idea of using such tests. By choosing significance testing over any other measure of error, they claim, scientists have traded size for precision. Precision means being able to recognize an effect against the surrounding noise. Size is how much of a difference an effect makes in the world. Significant effects are precise, but they do not necessarily make a big difference to the state of the world. With a large enough sample, you can easily get easily recognizable effects, yet these effects will be too small to make a difference. With smaller samples, however, significant testing will tell you to ignore important effects that are just too noisy to get under the p < 0.05 threshold, but nevertheless make a huge (albeit a little chaotic) difference to the state of the world.

This reminded me of a debate that took place in 2001 among cross-cultural psychologists. David Matsumoto and two colleagues re-analyzed several studies published in the Journal of Cross-Cultural Psychology (paper here). For each of the studies they examined (including one from their own research group), they showed that significant differences, most of the time, do not indicate substantive or important differences between individuals. In other words, statistically signigificant differences were insignificant by any other standard. For example, in a famous paper by Matsumoto and Ekman (1989), American subjects score higher than Japanese subjecst in a disgust-recognition task. But the probability that an American subject to score higher than a Japanese subject was only slightly above chance (0.56). There was a great overlap between Japanese and American results. In the recent study on disgust-perception by Asians and Westerners, another negligible-yet-significant difference was exaggerated by the authors - who claimed that the phenomenon had "critical consequences for cross-cultural communication and globalization".

As usual, the rhetoric twist is implicit and not acknowledged by the authors. As Matsumoto et al. write about the  articles they analyzed, "Although most researchers are aware of these limitations of significance testing, most reports rely solely on them, largely ignoring the practical significance of the results." Ziliak and McCloskey came to the same conclusion in their analysis of articles of the American Economic Review. Ziliak and McCloskey would probably concur with Matsumoto et al., who write :

"Interpretations of cultural differences between people based on statistically significant findings may be based on practically insignificant differences between means. Just pause to consider the wealth of knowledge concerning cultural differences in any area of cross-cultural comparison that some may assume to be important or large on the level of individuals because previous research has documented statistically significant differences betweenculturemeans.How many of these are actually reflective of meaningful differences on the level of individuals? Unfortunately, the answer is unknown, unless tests of group differences in means are accompanied by measures of cultural effect size such as those presented here. If theories, research, and practical work that are supposedly applicable to individuals are based on such limited group difference comparisons, theories, research, and applied programs based on these cultural differences may be based on a house of cards."

But there is more to Ziliak and McCloskey's point than a lesson in methodology. Their deeper, and even more worrying point, is that a few decades ago, the quantitative sciences lost sight of the quantitative. We should reclaim it. Few things in life are worth the pain of learning some stats, but this seems like one.

Comments
Search RSS
Additional pitfalls -  Simon Barthelme 31-August-2009

I haven't read Ziliak and McCloskey's article but it sounds like the sort of criticism Bayesians have been firing at significance testing for decades. Berger's classic book on Statistical Decision Theory has a whole list of pathologies, if you don't mind the equations. Andrew Gelman (who works in political science and stats) keeps pointing out that the null hypotheses that are set up are usually false. For example, if the null is that East Asians and Westerners do not differ on a particular performance measure, it is trivially false for all sorts of uninteresting reasons (maybe one population is a bit more motivated than the other, maybe one has more experience with video games, etc.). So if you want to find "significant" differences between cultures, in the statistical sense, all you need is a large enough sample size. Without a prior theory predicting at least the direction of the effect, for instance that East Asians should do better in a certain task, the intercultural differences we find in experiments may be entirely trivial.
Even if we had null hypotheses that made sense, and significance testing was actually what we wanted (i.e., we really wanted to control the false positive rate), it would be rendered useless by the actual practice of experimenters. One of them is to try different kinds of tests until your data come out significant. The other is to run subjects until your test comes out significant (discussed here). Finally in an issue like intercultural differences I suspect a huge file-drawer effect: no one is interested in null results. All in all the best thing to do when reading an experimental psychology paper is to ignore the p-values and look at the graphs.

Pascal Boyer 02-September-2009

Ziliak and McCloskey's book is indeed remarkable - it is also, more surprisingly, remarkably funny. In particular, the authors quite cheerfully admit that their cogent, meticulous arguments about the follies of significance testing and the cult of P will almost certainly have no effect whatsoever on current (mal)practice, especially in psychology and the social sciences. Indeed, similar points had been made by Cohen in the 60's, by Gerd Gigerenzer in the 80's, to no avail. These authors, and Ziliak and McCloskey after them, recommended paying attention to effect sizes rather than p values, reporting power, reporting confidence intervals. Yet the same emphasis on p values is characteristic of most research papers published in our fields.
If I may mention a pet peeve of mine, the practice of not just trusting p values, but actually ranking them is particularly offensive. It is rife in political science, in which one commonly finds results adorned with star-ratings (* for .05, ** for .01, *** for .001) , as if these ranks denoted different degrees of support for the hypotheses, which is of course not the case.
Another interesting aspect of this phenomenon is that almost everyone in the fields concerned (well, especially in psychology) will recognize that p values are useful but potentially misleading and that their current use as a standard of confirmation, disregarding power and effect sizes, is insane. Yet the social practice goes on! Another interesting case for the epidemiology of scientific norms.
As for the issue of cultural differences - Simon Berthelme and Olivier Morin are of course right. It is almost impossible to test two culturally different samples without finding differences (at p.05). This is like the good-bad old days of cognitive neuroscience, when all you had to do to get a study published was to plonk people in the scanner and get them to do two different things... with the guaranteed result that activations would be significantly different.
The problem for our field is not just that the null hypothesis in many cases is absurd or trivially false, it is also that "culture" is considered an independent variable in most of psychology. So when you find that e.g. Chinese and US students do not react the same way, you can publish this as evidence that "culture" matters in the domain of behavior you tested. The idea of having to explain whythis cultural difference arose or is maintained is not even considered. After all, psychologists study the effects of aging but do not think they should have to explain aging!
So it is our job to persuade everyone that "culture" is there to be explained but that in itself it explains nothing. Back to work!

Only registered users can write comments!
Bookmark Google Yahoo MyWeb Del.icio.us Digg Facebook Myspace Reddit Ma.gnolia Technorati Stumble Upon
 

Creative Commons License
All the content and downloads are published under Creative Commons license