Finding Her Voice

Have you ever heard of a cool scientific paper, went out to find yourself a copy, and been frustrated to find no trace of it? I’ve been there for years for one particular paper, until I got lucky.

Cutler, Anne, and Donia R. Scott. “Speaker Sex and Perceived Apportionment of Talk.” Applied Psycholinguistics 11, no. 03 (1990): 253–272.
I’m sure you’ve heard the stereotype that women talk excessively. A number of studies have actually sat down and counter total talking time, only to find that men tend to be the blabbermouths. What gives?
An alternative suggestion is more complex and may rely on a difference in content between men’s and women’s speech. Kramer (1975) and Spender (1980) suggested that women are undervalued in society, and as a consequence women’s speech is undervalued – female contributions to conversation are overestimated because they are held to have gone on “too long” relative to what female speakers are held to deserve. Preisler (1986) similarly argued that evaluation of women’s speech is a function of (under)evaluation of the social roles most usually fulfilled by women.
The former explanation suggests that overestimation of women’s conversational contributions is a perceptual bias effect that should be reproducible in the laboratory simply by asking listeners to judge amount of talk produced by male and female speakers, even if content of the talk is controlled. [pg. 255]
So Anne Cutler and the other authors tested that by having the standard reference human listen to excerpts from plays, where both speaking roles said about the same number of words. The sex was varied, of course.
In single-sex conversations, female and male first speakers received almost identical ratings (49.5% and 50%, respectively), but in mixed-sex conversations, female speakers were judged to be talking more (55.2%), male speakers to be talking less (47.8%). Although the number of words spoken was identical for each column, listeners believed that in mixed-sex conversations, females spoke more and males spoke less.

In fact, three of these mean ratings are actually underestimates, since the true mean first speaker contribution across all four dialogues was 53.7%. ….

The interaction of speaker sex with whether the dialogue was mixed- or single-sex was significant in both analyses … There was also a main effect of speaker
sex, with female speakers’ contributions being overestimated, but male speakers’ contributions being underestimated relative to the actual number of words spoken. [pg. 259-260]
What’s interesting is that when people were asked to guess the sex of each role, handed nothing more than the script, men and women sometimes differed.
When a part was not particularly sex-marked (Dialogue 1), females speaking it were judged to have said more than males speaking it. When a part was marked as female for male and for female subjects alike (Dialogue 2), the same effect was found. When, however, a part was marked as female for male subjects only (Dialogue 3), only male subjects showed the effect; and when a part was marked as female for female subjects only (Dialogue 4), only female subjects showed any effect. [pg. 268]
Unfortunately, this muddied up the conclusions a bit. And I do have other issues with the paper, primarily in their use of p-values, but I think the findings rise above it. They also fit nicely into the existing body of work on sexism and speech.
These behaviors, the interrupting and the over-talking, also happen as the result of difference in status, but gender rules. For example, male doctors invariably interrupt patients when they speak, especially female patients but patients rarely interrupt doctors in return. Unless the doctor is a woman. When that is the case, she interrupts far less and is herself interrupted more. This is also true of senior managers in the workplace. Male bosses are not frequently talked over or stopped by those working for them, especially if they are women; however, female bosses are routinely interrupted by their male subordinates.

What can we do to raise women’s voices? Maybe technology can help.

Gender Timer is the app that measures the talk times between the sexes. It is used to raise awareness and generate discussion about how airtime looks in practice. The aim is to ultimately develop your organization and its meeting culture.

Available on Android and iPhone.

Proof from Fine-Tuning (1)

You are alive.

This may not seem like much, but consider the other options. Our planet could have been a little closer to our sun, near enough to boil off all the oceans and prevent life from forming. Alternatively, it could have been too far away for liquid water to persist and thus a lifeless ice-ball.

It could have orbited a more massive star, which are quicker to swell into a red giant. From there it could have either expanded and consumed our planet with its scorching plasma, or gone nova and blasted all life on this planet from existence. Smaller stars are actually worse; all stars throw a tantrum during their early years, frying all but the most distant comets with radiation, and the less a star weighs the longer this rebellious phase lasts. We could have been closer to our galaxy’s black hole (death by radiation and random debris), or further away (lack of material and more inter-galaxy cosmic rays). We could have been in a binary system, like nearly all stars out there, and been hurtled into the frigid darkness by the resulting gravitational ballet.

And what if gravity wasn’t as strong? Suns would not have collapsed enough to properly create the higher elements, preventing life from forming. Too strong, and stars would go from “crunch” to “bang” too fast for life to grab a tentacle-hold. Or the entire universe would have reversed its expansion and popped back out of existence before an eye could have blinked.

There are much, much fewer ways to exist than to not exist. And yet here we are, nicely placed around a well-behaved star in a quiet galaxy, in a universe that’s comfortable and gives us a great view.

The odds against our very existence make even the most well-documented miracle seem  inevitable. This is clearly one of god’s fingerprints.

Coming Out of Retirement

As the references to biology imply, the Fine-Tuning proof is closely related to the proof from Design. The difference is largely one of emphasis; Design concerns itself with the patterns in biology, while Fine-Tuning points to the patterns within the world that permit biology.

Aristotle’s First Mover is an early example. While Plato’s “Form of the Good” is defined in terms of justice and intelligence, and implied to take an interest in human affairs, the First Mover does not meddle. It merely acts like the pendulum or battery of a clock, ensuring the entire machine runs smoothly. Some other deity does the more personal things. Another ancient Greek philosopher named Cicero puts it better:

When you see a sundial or a water-clock, you see that it tells the time by design and not by chance. How then can you imagine that the universe as a whole is devoid of purpose and intelligence, when it embraces everything, including these artifacts themselves and their artificers?

(De Natura Deorum, ii. 34)

When Nicolaus Copernicus published De Revolutionibus Orbium Coelestium from his deathbed two thousand years later, a new line of argument began to open up. The planets and stars did not circle the Earth, as everyone thought, but instead twirled around the Sun. It didn’t take long for other thinkers to examine the evidence, and reluctantly agree Copernicus was on to something. The structure of the heavens was starting to disagree with the assertions of the ancients and the religious, which prompted questions of how much their elders really knew. The same thing was happening to biology; De Humani Corporis Fabrica, by Andreas Vesalius, was one of the first books to honestly take apart the human body, and revealed there weren’t as many magical bits as previously thought.  Instead, our bodies were just messy hunks of meat, with few signs of clear design.

But while biology was getting much more complicated, cosmology was cleaning up its act. The old Ptolemeic system of the heavens was a messy and arbitrary collection of circles moving in circles. The new system, hinted at by Johannes Kepler and fully fleshed out by Isaac Newton, had the universe running according to fewer and much simpler rules. Everything ticked along smoothly, like a precision clock, and there was far more everything than we could comprehend.

In contrast, the simple, perfect body designs we saw from a distance were becoming ridiculously complex plumbing. Even when the simple laws finally showed up, all they described was the overall pattern instead of the detail.

Heading into the 1900’s, the smart money was giving up on biology, and looking to the heavens for evidence of design.

The bet quickly paid off. Scientists were still going along with Plato and Aristotle’s assertion of an infinite, eternal universe at that time. Ironically, the rationale most gave was that it dodged around the infinite chain of the Cosmological proof that was pushed by both ancient Greeks. Most religions, in contrast, were very clear that the universe had a beginning and weren’t willing to give that up.[161] They didn’t have to; in 1929 Edwin Hubble and Milton Humason peered at distant galaxies, did some math, and discovered the universe was running away from us. This matched up with the work of Alexander Friedmann and Georges Lemaître, who noted that General Relativity implied the universe would collapse unless it was expanding. Scientists were grudgingly forced to agree with the old creation stories; the universe had a beginning, after all.

Stay Tuned

But before we can properly begin this chapter, I need to answer one simple question first:  Could the universe be tuned at all?

If there’s only one way to create a universe, there’s no way for the gods to tune it. Try pulling the knobs off a radio then changing the station, if you doubt me. And the only way to know if the universe can be tuned is to know exactly how it began and how it developed.

Surprisingly, the only people who fess up to having this information are the religious. Astronomers have been formally studying the universe for centuries, informally for at least ten millennia, and it’s only in recent times that they’ve made progress. When the first fictional spacecraft entered the public’s mind around 1860, astronomers had measured the distance to a second star and the speed of light, and discovered a new element from a hundred million kilometres away. By the time we’d tossed a human to the moon in such a capsule in the late 1960’s, we knew how the elements were formed, how stars and the universe “evolved,” and suspected that most of the matter in the universe is invisible. These are all basic facts about how the universe is structured, without which we couldn’t even start to guess how it was formed, let alone if it could be formed another way.

Thanks to particle accelerators, cosmologists can create hotter temperatures than we’ve ever observed in the universe, which simulate the universe a fraction of a second after it began, and the theories we have extend to within a hair’s breadth of the beginning.[162] Neither can reach time zero, though. Any tip that could get physicists that last step would be rewarded with a Nobel Prize and a permanent spot in physics textbooks.

I find it surprising that theologians, who claim to hold a key piece of the puzzle about the early universe, would stay silent and turn down these riches. Still, I have to either give them the benefit of doubt, or end the chapter right here.

But if I grant that one exception, I’m quickly forced to concede more. It doesn’t matter if the  universe can be tuned, if all the possible tunings are equally comfortable for life. For instance, there are about 26 settings related to sub-atomic particles that could be tweaked, but most of them control exotic forms of matter that we never deal with outside of a particle accelerator. Little to nothing would change if they did. To answer the question, we also need to know what settings can be tuned, and by how much.

This proof isn’t just asking for a universe that’s compatible with life, though. It’s claiming the current settings are the best possible. To properly show that, you have to explore all other possible tunings and demonstrate that they don’t cut the mustard. There may be an infinite number of tunings to select from, and there may be radically different kinds of life in wildly different universes that need to be judged against our own; without answering all the questions I’ve just asked, we’ll never be sure.

Few theists have even started this, and yet the Fine Tuning proof depends on their answers. As mentioned in the introduction, you need evidence to have a proof. Since Fine-Tuning has no evidence, it’s more of a hope than a proof. Believers and critics alike are forced to invent the missing pieces to make it a proof at all:

More specifically, the values of the various forces of nature appear to be fine-tuned for the existence of intelligent life. The world is conditioned principally by the values of the fundamental constants a (the fine structure constant, or electromagnetic interaction), mn/me (proton to electron mass ratio, aG (gravitation), aw (the weak force), and as (the strong force). When one mentally assigns different values to these constants or forces, one discovers that in fact the number of observable universes, that is to say, universes capable of supporting intelligent life, is very small. Just a slight variation in any one of these values would render life impossible.

(“THE TELEOLOGICAL ARGUMENT AND THE ANTHROPIC PRINCIPLE,” Dr. William Lane Craig, retrieved May 29, 2011)

[…] the most important quantities that determine stellar properties — and are allowed to vary — are the gravitational constant G, the fine structure constant α, and a composite parameter C that determines nuclear reaction rates. Working within this model, we delineate the portion of parameter space that allows for the existence of stars. Our main finding is that a sizable fraction of the parameter space (roughly one fourth) provides the values necessary for stellar objects to operate through sustained nuclear fusion. As a result, the set of parameters necessary to support stars are not particularly rare.

(“Stars In Other Universes: Stellar structure with different fundamental constants”, Fred C. Adams, Journal of Cosmology and Astroparticle Physics, August 2008)

[161]  Notable exceptions include Jainism, which proposes an eternal universe, and some branches of Hinduism, which propose an eternal cycle of universe creation.

[162] Actually, that’s not a fair analogy. A hair is about a hundred microns across, or 10-4 metres, while our theories are helpless only in the Planck epoch, which ranged from time zero to 10-43 seconds after the Big Bang.

Daryl Bem and the Replication Crisis

I’m disappointed I don’t see more recognition of this.

If one had to choose a single moment that set off the “replication crisis” in psychology—an event that nudged the discipline into its present and anarchic state, where even textbook findings have been cast in doubt—this might be it: the publication, in early 2011, of Daryl Bem’s experiments on second sight.

I’ve actually done a long blog post series on the topic, but in brief: Daryl Bem was convinced that precognition existed. To put these beliefs to the test, he had subjects try to predict an image that was randomly generated by a computer. Over eight experiments, he found that they could indeed do better than chance. You might think that Bem is a kook, and you’d be right.

But Bem is also a scientist.

Now he would return to JPSP [the Journal of Personality and Social Psychology] with the most amazing research he’d ever done—that anyone had ever done, perhaps. It would be the capstone to what had already been a historic 50-year career.

Having served for a time as an associate editor of JPSP, Bem knew his methods would be up to snuff. With about 100 subjects in each experiment, his sample sizes were large. He’d used only the most conventional statistical analyses. He’d double- and triple-checked to make sure there were no glitches in the randomization of his stimuli. Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of ESP. Eight of those returned the same effect.

One way to attack an argument is to merely follow its logic. If you can find it leads to an absurd conclusion, the argument must have been flawed even if you cannot find the flaw. Bem had inadvertently discovered a “reductio ad absurdum” argument against contemporary scientific practice: if proper scientific procedure can prove ESP exists, proper scientific procedure must be broken.

Meanwhile, at the conference in Berlin, [E.J.] Wagenmakers finally managed to get through Bem’s paper. “I was shocked,” he says. “The paper made it clear that just by doing things the regular way, you could find just about anything.”

On the train back to Amsterdam, Wagenmakers drafted a rebuttal, to be published in JPSP alongside the original research. The problems he saw in Bem’s paper were not particular to paranormal research. “Something is deeply wrong with the way experimental psychologists design their studies and report their statistical results,” Wagenmakers wrote. “We hope the Bem article will become a signpost for change, a writing on the wall: Psychologists must change the way they analyze their data.”

Slate has a long read up on the current replication crisis, and how it links to Bem. It’s aimed at a lay audience and highly readable; I recommend giving it a click.

So You Wanna Falsify Gender Studies?

How would a skeptic determine whether or not an area of study was legit? The obvious route would be to study up on the core premises of that field, recording citations as you go; map out how they are connected to one another and supported by the evidence, looking for weak spots; then write a series of articles sharing those findings.

What they wouldn’t do is generate a fake paper purporting to be from that field of study but deliberately mangling the terminology, submit it to a low-ranked and obscure journal for peer review, have it rejected from that journal, based on feedback then submit it to an second journal that was semi-shady and even more obscure, have it published, then parade that around as if it meant something.

Alas, it seems the Skeptic movement has no idea how basic skepticism works. Self-proclaimed “skeptics” Peter Boghossian and James Lindsay took the second route, and were cheered on by Michael Shermer, Richard Dawkins, Jerry Coyne, Steven Pinker, and other people calling themselves skeptics. A million other people have pointed and laughed at them, so I won’t bother joining in.

But no-one seems to have brought up the first route. Let’s do a sketch of actual skepticism, then, and see how well gender studies holds up.

What’s Claimed?

Right off the bat, we hit a problem: most researchers or advocates in gender studies do not have a consensus sex or gender model.

The Genderbread Person, version 3.3. From

This is one of the more popular explainers for gender floating out on the web. Rather than focus on the details, however, I’d like you to note this graphic is labeled “version 3.3”. In other words, Sam Killermann has tweaked and revised it three times over. It also conflicts with the Gender Unicorn, which has a categorical approach to “biological sex” and adds “other genders,” and it no longer embraces the idea of a spectrum thus contradicting a lot of other models. Confront Killermann on this, and I bet they’d shrug their shoulders and start crafting another model.

The model isn’t all that important. Instead, gender studies has reached a consensus on an axiom and a corollary: the two-sex, two-gender model is an oversimplification, and that sex/gender are complicated. Hence why models of sex or gender continually fail, the complexity almost guarantees exceptions to your rules.

There’s a strong parallel here to agnostic atheism’s “lack of belief” posture, as this flips the burden of proof. Critiquing the consensus of gender studies means asserting a positive statement, that the binarist model is correct, while the defense merely needs to swat down those arguments without advancing any of its own.

Nothing Fails Like Binarism

A single counter-example is sufficient to refute a universal rule. To take a classic example, I can show “all swans are white” is a false statement by finding a single black swan. If someone came along and said “well yeah, but most swans are white, so we can still say that all swans are white,” you’d think of them as delusional or in denial.

Well, I can point to four people who do not fit into the two-sex two-gender model. Ergo, that model cannot be true in all cases, and the critique of gender studies fails after a thirty second Google search.

When most people are confronted with this, they invoke a three-sex model (male, female, and “other/defective”) but call it two-sex in order to preserve their delusion. That so few people notice the contradiction is a testament to how hard the binary model is hammered into us.

But Where’s the SCIENCE?!

Another popular dodge is to argue that merely saying you don’t fit into the binary isn’t enough; if it wasn’t in peer-reviewed research, it can’t be true. This is no less silly. Do I need to publish a paper about the continent of Africa to say it exists? Or my computer? If you doubt me, browse Retraction Watch for a spell.

Once you’ve come back, go look at the peer-reviewed research which suggests gender is more complicated than a simple binary.

At times, the prevailing answers were almost as simple as Gray’s suggestion that the sexes come from different planets. At other times, and increasingly so today, the answers concerning the why of men’s and women’s experiences and actions have involved complex multifaceted frameworks.

Ashmore, Richard D., and Andrea D. Sewell. “Sex/Gender and the Individual.” In Advanced Personality, edited by David F. Barone, Michel Hersen, and Vincent B. Van Hasselt, 377–408. The Plenum Series in Social/Clinical Psychology. Springer US, 1998. doi:10.1007/978-1-4419-8580-4_16.

Correlational findings with the three scales (self-ratings) suggest that sex-specific behaviors tend to be mutually exclusive while male- and female-valued behaviors form a dualism and are actually positively rather than negatively correlated. Additional analyses showed that individuals with nontraditional sex role attitudes or personality trait organization (especially cross-sex typing) were somewhat less conventionally sex typed in their behaviors and interests than were those with traditional attitudes or sex-typed personality traits. However, these relationships tended to be small, suggesting a general independence of sex role traits, attitudes, and behaviors.

Orlofsky, Jacob L. “Relationship between Sex Role Attitudes and Personality Traits and the Sex Role Behavior Scale-1: A New Measure of Masculine and Feminine Role Behaviors and Interests.” Journal of Personality 40, no. 5 (May 1981): 927–40.

Women’s scores on the BSRI-M and PAQ-M (masculine) scales have increased steadily over time (r’s = .74 and .43, respectively). Women’s BSRI-F and PAQ-F (feminine) scale  scores do not correlate with year. Men’s BSRI-M scores show a weaker positive relationship with year of administration (r = .47). The effect size for sex differences on the BSRI-M has also changed over time, showing a significant decrease over the twenty-year period. The results suggest that cultural change and environment may affect individual personalities; these changes in BSRI and PAQ means demonstrate women’s increased endorsement of masculine-stereotyped traits and men’s continued nonendorsement of feminine-stereotyped traits.

Twenge, Jean M. “Changes in Masculine and Feminine Traits over Time: A Meta-Analysis.” Sex Roles 36, no. 5–6 (March 1, 1997): 305–25. doi:10.1007/BF02766650.

Male (n = 95) and female (n = 221) college students were given 2 measures of gender-related personality traits, the Bem Sex-Role Inventory (BSRI) and the Personal Attributes Questionnaire, and 3 measures of sex role attitudes. Correlations between the personality and the attitude measures were traced to responses to the pair of negatively correlated BSRI items, masculine and feminine, thus confirming a multifactorial approach to gender, as opposed to a unifactorial gender schema theory.

Spence, Janet T. “Gender-Related Traits and Gender Ideology: Evidence for a Multifactorial Theory.” Journal of Personality and Social Psychology 64, no. 4 (1993): 624.

Oh sorry, you didn’t know that gender studies has been a science for over four decades? You thought it was just an invention of Tumblr, rather than a mad scramble by scientists to catch up with philosophers? Tsk, that’s what you get for pretending to be a skeptic instead of doing your homework.

I Hate Reading

One final objection is that field-specific jargon is hard to understand. Boghossian and Lindsay seem to think it follows that the jargon is therefore meaningless bafflegab. I’d hate to see what they’d think of a modern physics paper; jargon offers precise definitions and less typing to communicate your ideas, and while it can quickly become opaque to lay people jargon is a necessity for serious science.

But let’s roll with the punch, and look outside of journals for evidence that’s aimed at a lay reader.

In Sexing the Body, Gender Politics and the Construction of Sexuality Fausto-Sterling attempts to answer two questions: How is knowledge about the body gendered? And, how gender and sexuality become somatic facts? In other words, she passionately and with impressive intellectual clarity demonstrates how in regards to human sexuality the social becomes material. She takes a broad, interdisciplinary perspective in examining this process of gender embodiment. Her goal is to demonstrate not only how the categories (men/women) humans use to describe other humans become embodied in those to whom they refer, but also how these categories are not reflect ed in reality. She argues that labeling someone a man or a woman is solely a social decision. «We may use scientific knowledge to help us make the decision, but only our beliefs about gender – not science – can define our sex» (p. 3) and consistently throughout the book she shows how gender beliefs affect what kinds of knowledge are produced about sex, sexual behaviors, and ultimately gender.

Gober, Greta. “Sexing the Body Gender Politics and the Construction of Sexuality.” Humana.Mente Journal of Philosophical Studies, 2012, Vol. 22, 175–187

Making Sex is an ambitious investigation of Western scientific conceptions of sexual difference. A historian by profession, Laqueur locates the major conceptual divide in the late eighteenth century when, as he puts it, “a biology of cosmic hierarchy gave way to a biology of incommensurability, anchored in the body, in which the relationship of men to women, like that of apples to oranges, was not given as one of equality or inequality but rather of difference” (207). He claims that the ancients and their immediate heirs—unlike us—saw sexual difference as a set of relatively unimportant differences of degree within “the one-sex body.” According to this model, female sexual organs were perfectly homologous to male ones, only inside out; and bodily fluids—semen, blood, milk—were mostly “fungible” and composed of the same basic matter. The model didn’t imply equality; woman was a lesser man, just not a thing wholly different in kind.

Altman, Meryl, and Keith Nightenhelser. “Making Sex (Review).” Postmodern Culture 2, no. 3 (January 5, 1992). doi:10.1353/pmc.1992.0027.

In Delusions of Gender the psychologist Cordelia Fine exposes the bad science, the ridiculous arguments and the persistent biases that blind us to the ways we ourselves enforce the gender stereotypes we think we are trying to overcome. […]

Most studies about people’s ways of thinking and behaving find no differences between men and women, but these fail to spark the interest of publishers and languish in the file drawer. The oversimplified models of gender and genes that then prevail allow gender culture to be passed down from generation to generation, as though it were all in the genes. Gender, however, is in the mind, fixed in place by the way we store information.

Mental schema organise complex experiences into types of things so that we can process data efficiently, allowing us, for example, to recognise something as a chair without having to notice every detail. This efficiency comes at a cost, because when we automatically categorise experience we fail to question our assumptions. Fine draws together research that shows people who pride themselves on their lack of bias persist in making stereotypical associations just below the threshold of consciousness.

Everyone works together to re-inforce social and cultural environments that soft-wire the circuits of the brain as male or female, so that we have no idea what men and women might become if we were truly free from bias.

Apter, Terri. “Delusions of Gender: The Real Science Behind Sex Differences by Cordelia Fine.” The Guardian, October 11, 2010, sec. Books.

Have At ‘r, “Skeptics”

You want to refute the field of gender studies? I’ve just sketched out the challenges you face on a philosophical level, and pointed you to the studies and books you need to refute. Have fun! If you need me I’ll be over here, laughing.

[HJH 2017-05-21: Added more links, minor grammar tweaks.]

[HJH 2017-05-22: Missed Steven Pinker’s Tweet. Also, this Skeptic fail may have gone mainstream:

Boghossian and Lindsay likely did damage to the cultural movements that they have helped to build, namely “new atheism” and the skeptic community. As far as I can tell, neither of them knows much about gender studies, despite their confident and even haughty claims about the deep theoretical flaws of that discipline. As a skeptic myself, I am cautious about the constellation of cognitive biases to which our evolved brains are perpetually susceptible, including motivated reasoning, confirmation bias, disconfirmation bias, overconfidence and belief perseverance. That is partly why, as a general rule, if one wants to criticize a topic X, one should at the very least know enough about X to convince true experts in the relevant field that one is competent about X. This gets at what Brian Caplan calls the “ideological Turing test.” If you can’t pass this test, there’s a good chance you don’t know enough about the topic to offer a serious, one might even say cogent, critique.

Boghossian and Lindsay pretty clearly don’t pass that test. Their main claim to relevant knowledge in gender studies seems to be citations from Wikipedia and mockingly retweeting abstracts that they, as non-experts, find funny — which is rather like Sarah Palin’s mocking of scientists for studying fruit flies or claiming that Obamacare would entail “death panels.” This kind of unscholarly engagement has rather predictably led to a sizable backlash from serious scholars on social media who have noted that the skeptic community can sometimes be anything but skeptical about its own ignorance and ideological commitments.

When the scientists you claim to worship are saying your behavior is unscientific, maaaaybe you should take a hard look at yourself.]

Proof from Popularity (2)

Over-active Pattern Matching

One key element is our phenomenal ability to find patterns. Scientists have recently started to capitalize on this.

The “Rosetta” project followed the same pattern as many other distributed computing projects, at least at the start. David Baker and his colleagues were dealing with the difficult problem of protein folding. These little molecules are the workhorses of all living things, and do everything from speed up chemical reactions to transmit signals in your brain… yet they’ve been remarkably difficult to study. The problem is not with our understanding of molecular forces, or the blueprints to make any protein, but the sheer number of computations required to combine both into a folded protein. On a single computer, crunching through the numbers can take years. [149]

Baker decided to solve that problem by distributing the work; his group created a software program that could do protein folding, then gave it away to anyone interested. As of this writing, about 30,000 people are running Rosetta on at least one computer, [150] and speeding up the process by at least that many times. Every single one of them is doing this voluntarily, with their only compensation being a pretty picture of the folding process in action.

Soon, however, Baker’s group was fielding emails about those pretty pictures. The people running Rosetta noticed the software would get “stuck” in places, or waste time on solutions that even a non-expert could tell would go nowhere. They wondered if there was any way to “nudge” the software with hints.

Baker decided to take the hint himself, and hired Zoran Popović and David Salesin to turn his science program into a game called “foldit.” Users could now do much more than help their computer along; they could team up with others to solve “puzzles,” compete with others to earn a high score, or even write their own scripts to help remove some of the grunt work.

Adding humans into the mix paid off. One user named “Vertex” came up with the “Blue Fuse” helper script, and with refinements developed by others it rapidly became the most popular such helper on “foldit.” [151] When Baker had a look at the code, however, he was astonished to find it was a near-duplicate of an algorithm his lab was privately testing. Some comparisons between the two revealed that in seven months, a community of novices had managed to best years of research by experts in protein folding. [152] Other citizen science projects have found the same pattern; pooling together the opinions of non-experts gives results equal to what an expert could churn out, in a fraction of the time.

This ability to suss out patterns comes at a cost, however. As mentioned in the Witness proof, we’re also prone to finding patterns that don’t exist. It takes a minor stretch of someone else’s imagination to start assigning a language to random sounds, or guess there’s a mind behind mindless processes. From that, gods can be formed.


Back in the Morality proof, I used a little game theory to describe how morality could be evolved as an instinct. In the process, I also showed how classism[153] could also be bred into our bones.

Henri Tajfel made a career out of studying this, in fact. One fascinating study asked people to judge the length of lines. The participants were split into two groups, one of which was given the lines without any labels. The other group had their lines labelled by length, either “A” if the line was shorter than average or “B” if the line was longer. The second group automatically lumped lines by label, consistently guessing longer lengths for short lines and shorter lengths for longer ones to match up with their expectations for the two categories. The first group didn’t show any bias. [154]

What applies to lines also applies to people. Here’s an example from Tajfel himself:

The boys, who knew each other well, were divided into groups defined by flimsy and unimportant criteria [in this case, they were told it was how well they could count groups of dots; in reality, the researchers randomly assigned groups]. Their own individual interests were not affected by their choices, since they always assigned points to two other people and no one could know what any other boy’s choices were. The amount of money were not trivial for them: each boy left the experiment with the equivalent of about a dollar. Inasmuch as they could not know who was in their group and who was in the other group, they could have adopted either of two reasonable strategies. They could have chosen the maximum-joint-profit point of the matrices, which would mean that the boys as a total group would get the most money out of the experimenters, or they could choose the point of maximum fairness. Indeed, they did tend to choose the second alternative when their choices did not involve a distinction between ingroup and outgroup. As soon as this differentiation was involved, however, they discriminated in favour of the ingroup. The only thing we needed to do to achieve this result was to associate their judgements of numbers of dots with the use of the terms “your group,” and “the other group” in the instructions […]

        Tajfel, Henri. “Experiments in intergroup discrimination.” Scientific American 223.5 (1970): 96-102.

It’s a sobering thought. If we can start favouring an ingroup and punishing an outgroup, along lines as arbitrary as how good we are at counting dots, what are the odds of us carving lines in the sand over skin colour, or genitalia?

Or for that matter, beliefs and rituals?

There’s two key differences, however. You can’t change genitalia or skin colour, and gradations and variety are guaranteed. [155] Behaviour, however, is easily changed, allowing anyone to hop from outgroup to ingroup. It’s quite possible then for a behaviour-based ingroup to grow in size and dominate over all their outgroups. At a magical tipping point, the special treatment enjoyed within the ingroup outweighs any harm that could be inflicted by an outgroup, and the reverse is true from any outgroup’s point of view. There’s a strong incentive to switch, which only grows as more people give in. The eventual result is the disappearance of all outgroups, and the people that remain are nicer and more trusting to one another than they would have been in a non-classist situation.

In theory, of course. There are a number of practical barriers to this utopia.

For one, the ease that we divide ourselves means that in-groups are prone to splintering along the most trivial lines. This becomes a problem when the in-group cannot provide the benefits it once used to, or there is no real out-group, as there’s very little enforcing group cohesion. If only there were some way to invent an out-group, either by spreading tales of extreme debauchery and mayhem about a group that doesn’t live nearby, or simply making up an all-powerful one out of, say, folklore or leftover gods. If only.

I alluded to another in the Morality proof. Classism is easy prey for cheaters, who will happily wave around the in-group symbols but refuse to act as nice; the obvious counter is to add costly symbols and rituals, such as piercings and other body modifications. A less obvious one is to invoke an always-present, always watching police-thing to ensure everyone toes the line.


While the above two components are enough to get a religion off the ground, they don’t explain why religions have such staying power. For that, you need one more ingredient: evolution.

As I discussed in my chapter on the Design/Teleological proof, evolution applies to far more than biology. It’s a general-purpose feedback loop that works equally well with culture and ideas. Religions are no exception, as they have all the basic requirements.

Traditions and rituals are easily passed from person to person, forming new copies of themselves. African-Americans brought over to the United States for the slave trade readily absorbed the religion of their captors, to the point that they are more likely to be Christian than those of European descent. [156]

The Christianity they adopted differs in important ways, however. African-Americans placed far more emphasis on music; while their European counterparts specialized in boring chants of ancient lyrics, they formed lively quoirs of freshly-minted words and created an entire musical genre known as “gospel.” [157] While their fellow European citizens became useed to dealing with a distant, aloof church system, African-Americans made theirs keenly interested in social justice and humanitarian causes. [158] By changing Christianity to suit them, they made it more suitable and thus tougher to walk away from.

This hasn’t escaped the notice of non-Africans. Faced with emptying pews, church leaders elsewhere have started adopting the innovations of African-Americans to woo churchgoers back. The use of popular music is on the uptick, [159] as well as an emphasis on charity and improving the lot of your fellow human. [160]

Self-replication, variation, limited environment, and feedback. Every aspect is there, creating a feedback loop of self-preservation replication.

Social Attachment

While evolution is the key ingredient, that doesn’t rule out some spices to help seal the deal.

We are social creatures, at heart. Like dogs, bats, prairie dogs, and our fellow apes, we rely on teamwork to survive. Not surprisingly, the process of evolution has strengthened that by planting various rewards within us.

[TODO: friendship bonus]

[TODO: parent bonus]

These same rewards could be redirected to other ends, however. Making friends with an imaginary being would convey some of the same rewards granted by hanging around with a real-life friend, only this imaginary being will never talk back to you. Having an imaginary being as a parent would provide a sense of security that a real-life parent could never provide.

Fear of Death

[TODO. But see “Terror Management Theory“]

How Religion Started

All merged to form religion. Hunter-gatherer society had good punishment for misbehavior, in the form of ostracism, but as population grew it became less useful. No police around to enforce rules, so who could? Religion evolved a solution with divine punishment via afterlife and a central authority, which made social organization of large groups much easier. Can see this in tribal spirituality vs. Early religions.

Thus: religion is social structure that benefits members by policing group behavior via a supernatural justice system. Evidence:

  • As countries get wealthier and more secure, religosity drops off dramatically
  • More likely in less secure nations, such as US (high health bankrupcies, high prison population, low feelings of security)
  • Religion is strongly correlated with large groups; smaller tribes just don’t need it.
  • Belief in god isn’t important, playing along with group is. EG:
  • limited grasp of important religious precepts, ignorance of holy texts
  • ease of ignoring basic codes when impractical (churchgoing stats in US, “believing in belief”)
  • afterlife more common than god
  • emphasis on community and communal ritual, instead of private prayer
  • the highly religious are treated with disdain, like the non-believers
  • wait, what role do true believers play? They make it easier to accept tribal markers, but also raise the bar for the rest; thus a love-hate relationship (“I wish I felt what they did”).





[153]  Early drafts used the word “tribalism,” but I found a lot of people took me to task for promoting discrimination against “less advanced” people. I struggled to think of a better name, but even “classism” carried an implication of discrimination. Then it hit me: there is no name for this which is free of discrimination, because by its very nature it is group-based discrimination, no more or less. It isn’t fair to say we were born and bred to discriminate, but it is fair to say all of us have that capacity built-in at the lowest possible level.
[HJH of the FUTURE: I’ve since seen “groupiness” tossed around in the scientific literature.]

[154]  “Human groups and social categories: studies in social psychology,” pg. 91-104, Henri Tajfel, 1981.

[155]  24 different genes have been associated with skin colour, and genetalia come in even more varieties; see here for illustrations:

156  Pew Forum, “U.S. Religious Landscape Survey, 2007.”


[158]  Lincoln, C. Eric, and Lawrence H. Mamiya. The Black church in the African American experience. Duke University Press Books, 1990.

[159]  TODO

[160]  TODO

The Most Hacked President

The current US President is a beginner’s class in hacking. Let’s rewind back to the end of January.

Lost amid the swirling insanity of the Trump administration’s first week, are the reports of the President’s continued insistence on using his Android phone (a Galaxy S3 or perhaps S4). This is, to put it bluntly, asking for a disaster. President Trump’s continued use of a dangerously insecure, out-of-date Android device should cause real panic. And in a normal White House, it would.

A Galaxy S3 does not meet the security requirements of the average teenager, let alone the purported leader of the free world. The best available Android OS on this phone (4.4) is a woefully out-of-date and unsupported. The S4, running 5.0.1, is only marginally better. Without exaggerating, hacking a Galaxy S3 or S4 is the type of project I would assign as homework for my advanced undergraduate classes.

I know, that one’s a bit old, but it nicely bookends more recent reporting.

We also visited two of President Donald Trump’s other family-run retreats, the Trump International Hotel in Washington, D.C., and a golf club in Sterling, Va. Our inspections found weak and open Wi-Fi networks, wireless printers without passwords, servers with outdated and vulnerable software, and unencrypted login pages to back-end databases containing sensitive information.

The risks posed by the lax security, experts say, go well beyond simple digital snooping. Sophisticated attackers could take advantage of vulnerabilities in the Wi-Fi networks to take over devices like computers or smart phones and use them to record conversations involving anyone on the premises.

“Those networks all have to be crawling with foreign intruders, not just [Gizmodo and] ProPublica,” said Dave Aitel, chief executive officer of Immunity, Inc., a digital security company, when we told him what we found.

Worried that your Pringles can will rat you out? Not to worry, planting a pineapple is easy-peasy.

At the White House, visitors must undergo a rigorous background screening before they’re let in the door. Agents scan every visitor’s full name, birth date, Social Security number, city of residence and country of birth.

But at Mar-a-Lago, gaining entry doesn’t require that degree of disclosure. Guests entering the club go through multiple security checkpoints staffed by the Secret Service looking for weapons or other immediate threats. But there’s only one requirement to produce a photo ID, and the club itself does not ask guests to provide their names or other information when they enter through the main wrought-iron gated door.

The club also serves as a venue for ticketed public events. Hosts for the slate of political and charity dinners booked at the president’s part-time home from now to the end of the club’s season in May told POLITICO the only request for information about attendees has come from the club itself. And all they’re asked to provide is a name, not additional information that can be used for Secret Service background checks in the event the president is in residence.

I’m a bit shocked no-one has tried blackmailing Trump yet. Maybe there are too many people jockeying for that honor? The WiFi networks probably look like a Spy vs. Spy comic by now.

Abolish Gender, unless it’s convenient for us

I was mulling over a post on Meghan Murphy, someone I’d heard about via Bill C-16, when I noticed Shiv beat me to it and did a much better job than I could. She even makes the same point I would have reached for:

… socialization cannot be both something that is possible to reject–as these feminists do with feminine gender roles–and also inevitable destiny. These are obviously mutually exclusive states. That women buck against the subordination expected of them by patriarchs is plain evidence that these socialized experiences are not fixed points of references but experiences that can be continuously and willfully re-contextualized. And if that’s the case, so-called “male socialization”–the standard idea of which does not map neatly to trans women’s experiences–is not as useful if one’s intention is to drive a wedge between cis and trans womanhood. That this observation is seldom accounted for in the TERF mythology speaks to its importance in these kinds of narratives.

This bugged me when I first learned of TERFs, I found it bizarre that they simultaneously argued gender is fluid like water, yet sticks to you like superglue.

… if anatomy is so strongly associated with a tendency to violence, how can you hope to improve things by destroying the concept of “gender?” …  I have yet to see a single TERF with a self-coherent view of sex/gender. That’s because their “criticism” isn’t actually a critique, based on solid evidence and analysis, but a fig leaf to disguise their bigotry.

I prefer Shiv’s phrasing, though, and her post covers a lot more than one note. Give it a boo.

Proof from Popularity (1)

Proof from Popularity

Some things never go out of style.

The Sun always rises in the East and sets in the West. The seasons come and go in an orderly manner. Tides rise and fall; there’s never a miscommunication.

We always seem to have a god around. The vast majority of human beings, living or dead, believe or believed in one or more gods. The details differ, of course, but not the desire.

Nothing else in our cultures has been as permanent. Traditions get created, changed, lost, and revived all the time. In the United States, an ancient fertility festival has become an excuse to eat chocolate. In Japan the tradition of Seppuku, or ritual suicide by slicing open one’s stomach, has died out. Norway has largely given up blót, which consisted of hanging various animals (including humans) and creating a feast from their flesh. Fondue was revived by Swiss wine and cheese producers, to encourage people to buy more wine and cheese. Something similar happened in the United States in the 1930’s; diamond producers had an excess of diamonds, so they hired marketers to create more demand by linking marriage proposals to the gift of a diamond ring.

Doesn’t the continuous popularity of religion speak to the existence of a higher power?

Bridge Jumping

Many of us were taught at a young age that just because something is popular doesn’t mean it’s right. Humans, like other social creatures, tend to form packs or tribes with a hierarchy of power. We reinforce these groupings through shared behaviour, by grooming one another or parcelling out food.

So if a high-ranking member does something notable, like harass someone not in the clan, there’s an incredible amount of pressure to imitate them. Our culture has decided that this instinct should be resisted,[146] so we try to teach children to think of the greater good instead. Wrong is wrong, no matter how popular it is.

This idea persists into adulthood. Think of the people you consider moral heroes. I’m willing to bet that while their neighbours cried yes, they said no. Oscar Schindler is praised for saving a thousand Jews while his peers were hunting them down. From the opposite end, the Neurumberg trials sent out a clear message that “I’m just following orders” is not an excuse; if a superior commands you to do something amoral, or everyone else in your unit is committing vile acts, you must refuse to go with the crowd. Otherwise, you are as guilty as them.

Therefore, we try not to judge the truth of things based on popularity. Adding a special exemption for religion is a poor idea. The non-religious[147] are currently the third most popular “religion,” after Islam and Christianity, and have never held a larger proportion of the world’s population. Does this mean a god is less likely to exist as time goes on? If Europe were hit by a giant meteor, wiping out a large chunk of the non-religious, does this mean religion is now more truthful?

Judgement Day

So if we can’t judge religion to be useful by how popular it is, how can we judge it?

No, wait, we have another question to answer first: can we judge religion? The religious claim to be above the fray, after all, pulling from a divine mandate of some sort that secular people lack. Doesn’t this make them impossible to judge?

I’d be more swayed by this argument if there was only one religion in the world. Instead we find thousands of religions, many of them splintered into various sects. How will you decide which religion to follow, without judging one against the other? If you dodge that by saying you worship all faiths, even though you don’t follow all of their must-follow rules, then I have some bad news:

Strive against the disbelievers and the hypocrites! Be harsh with them. Their ultimate abode is hell, a hapless journey’s end.

(Quoran, verse 9:73)

He that sacrificeth unto any god, save unto Jehovah only, shall be utterly destroyed.

(Old Testament, Exodus 22:20, American Standard translation)

If you worship any religion other than Islam, you will suffer eternally. If you worship any religion other than Judaism,[148] you’ll be killed. Worship both Islam and Judaism, or neither religion, and you’ll have both fates. Ignore one or both of these lines and you’ve placed your moral judgement above god’s, since both of these sources are divinely-inspired words from a god.

Before reading this book, you were forced to make a judgement on religion. Since I presume you’re still alive and in reasonably good health, I think that signals it’s A-OK to judge religion in general.

What criterion should we use for judgement? I’d argue that the best way is through behaviour. All religions tell their adherents how to live a moral, just life. We should expect the religious to live better than their godless counterparts, perhaps by having to deal with less crime or consistently coming up happier in surveys.

In the Pragmatic Argument I consider this, and reject it.

The Ascent of Religion

If you agree with my assessment in Pragmatic, though, we’re left with an unsettling conclusion. If religion is not that useful, why do so many people insist on being religious? Couldn’t that imply we must believe in something, or that we’re being compelled to join religion by something external?

I think I can answer this by sharing my theory of how religion got started in the first place.

I’m not the first to come up with a theory, not by a long shot: for instance, Edward Burnett Tyler had a reasonable one back in 1871. Modern theories tend to fall into a few categories, such as those that invoke evolution:

Perhaps the most basic question is whether the trait is an adaptation that evolved by a process of selection. Does a given element of religion exist because it helps an entity (such as an individual or a group) survive and reproduce better than competing entities? If so, then we need to determine the relevant entity. Does the given element of religion increase the fitness of whole groups, compared to other groups (between-group selection), or by increasing the fitness of individuals compared to other individuals within the same group (within-group selection)? With cultural evolution there is an interesting third possibility. A cultural trait can spread by benefiting whole groups or individuals within groups, but it can also spread by enhancing its own transmission at the expense of human individuals and groups, as if it were a parasitic organism in its own right (Dawkins 2006, Dennett 2006). The concept of religion as a disease is highly novel against the background of traditional religious scholarship.

If a trait it not an adaptation, it can nevertheless persist in the population for a variety of reasons. Perhaps it was adaptive in the past but no longer in the present. For example, our eating habits make excellent sense in a world of food scarcity but have become a major cause of death in modern fast-food environments. Perhaps some elements of religion are like obesity—adaptive in the tiny social groups of our ancestral past, but not in modern mega-societies (Alexander 1987).

Alternatively, a trait can be a non-adaptive byproduct of another trait. An architectural example made famous by Stephen Jay Gould and Richard Lewontin (1979) is a spandrel, the triangular space that inevitably forms when two arches are placed next to each other. Arches have a function but spandrels do not, although they can acquire a secondary function such as a decorative space. As a biological example, moths use celestial light sources to navigate (an adaptation) but this causes them to spiral inward toward earthly light sources such as a streetlamp or flame—a highly destructive byproduct. Perhaps some elements of religion are like a moth to flame (Dawkins 2006).

Finally, a trait can have no effect whatsoever on survival and reproduction and simply drift into the population. Many genetic mutations are selectively neutral, enabling them to be used as a molecular “clock” for measuring the amount of time that species have been genetically isolated from each other. Some elements of religion might similarly have no rhyme or reason, other than the vagaries of chance.

        (“Evolutionary Religious Studies (ERS): A Beginner’s Guide ,” David Sloan Wilson and William Scott Green, draft copy dated September 12th, 2007)

Others point to psychology. Religion could be a cultural system to ease our fears, or a proto-science that satisfied our curiosity and need for explanations before we thought up science proper.

Religion is primarily a search for security and not a search for truth. Religion is what we so often use to bank the fires of our anxiety. That is why religion tends toward becoming excessive, neurotic, controlling and even evil. That is why a religious government is always a cruel government. People need to understand that questioning and doubting are healthy, human activities to be encouraged not to be feared. Certainty is a vice not a virtue. Insecurity is something to be grasped and treasured. A true and healthy religious system will encourage each of these activities. A sick and fearful religious system will seek to remove them.

(“Q&A on biblical criticism,” John Shelby Spong, a weekly mailing dated June 15th, 2005)

The idea that religion is an early form of science is found in many Enlightenment authors, usually with the implication that it has now been replaced by science. Moderate versions of this thesis are found in Auguste Comte and Émile Durkheim. Primarily, however, it was the British anthropologists of religion Edward B. Tylor and James Frazer who defended this view. On the basis of a cognitively oriented associationist psychology, they identified religion with early forms of rational and, especially, scientific thought. For them, religion represented an insufficient answer to cognitive problems such as the explanation of dreams or death. Religion and magic were related in the same way as theory and proctice or science and technology. This tradition is represented today by anthropologists such as Robin Horton, who maintains that “primitive” religion is primarily a rational attempt to interpret the world.

(“The promise of salvation: a theory of religion,” pg 56-57, Martin Riesebrodt and Steven Rendal, 2010)

My own theory is primarily evolutionary, but borrows freely from both branches. Religion likely emerged from five separate elements, two of which are optional.

[146]  I agree. We have other, less destructive ways to define and foster groups. Our tendency to live close to each other and our toolmaking skills can fan small flares into big fires.

[147]  This category lumps people who are “spiritual but not religious” in with atheists and agnostics. If you only consider the latter two to be truly non-religious, then the atheist/agnostic stance becomes the fifth most popular “religion” in the world.

[148]  Christianity includes the Old Testament in its bible. Does this mean Christians would kill Jews for refusing to worship the same god, even though they wrote that rule?

A Trump Controversy, in Tweets

Donald Trump:
Crooked Hillary Clinton and her team “were extremely careless in their handling of very sensitive, highly classified information.” Not fit!

Washington Post:
President Trump’s disclosures jeopardized a critical source of intelligence on the Islamic State, officials said

CBS News:
“Highly damaging”: Ex-CIA deputy director on WaPo report that Pres. Trump revealed classified info to Russians

Think about this… Lavrov & Kislyak given classified info from #Trump bc his need for their approval is stronger than his loyalty to U.S

Matthew Chapman:
Lavrov will share the classified info Trump gave him with the Syrians and the Iranians. Americans fighting in the region are going to die.

Ricky Davila:
Just to be clear, Reuters, NYT, & Buzzfeed have all confirmed the #WaPo‘s report about trump giving highly classified info to the Russians.

Adrian Carrasquillo:
Per @TreyYingst, Bannon, Mike Dubke, Sarah Sanders and Spicer walked into cabinet room just now. They did not look happy.
Can now hear yelling coming from room where officials are.
WH comms staffers just put the TVs on super loud after we could hear yelling coming from room w/ Bannon, Spicer, Sanders

Hayley Byrd:
Dianne Feinstein exits Senate subway and is surrounded by reporters. “Oh my goodness. What’s happened?” (She hasn’t seen the WaPo story.)
Lindsey Graham tells us the WaPo report is “troubling” if true. I ask him if it’s only troubling. “Yeah, because I don’t know if it’s true.”
I wonder how many GOP senators will say they’re troubled before calling for more information.

Thomas Burr‏:
Asked whether @jasoninthehouse still trusts Trump with classified info, Chaffetz says, “Of Course.”

Scott Wong‏:
.@SpeakerRyan spox on WaPo story: “The speaker hopes for a full explanation of the facts from the administration.”

Alice Ollstein:
.@SenatorRisch defends Trump revealing classified info to the Russians: “It’s no longer classified the minute he utters it.”

Hannity right now: “Clinton Email Server Scandal”

Kurt Schlichter‏:
So: HR McMaster, author of Dereliction of Duty, sat back as Trump disgorged critical classified info, then went outside and lied about it?

The Baxter Bean:
Self-serving Republicans ignoring Trump gave highly classified info to foreign adversaries in the WH, but here’s what they said about email

Tony Posnanski:
“He defended Trump when he gave the Russians classified security info!” – The opening line to everyone running against GOP in 2018

Al Weaver:
MCCONNELL react to Wapo story: “We could do with a little less drama from the White House.”
Full quote. [this is worth clicking through, trust me – HJH]

Norah O’Donnell:
“We had lengthy interactions w/ White House all day yesterday. McMaster never said it was false until after it was published” @gregpmiller

Donald Trump:
As President I wanted to share with Russia (at an openly scheduled W.H. meeting) which I have the absolute right to do, facts pertaining….
…to terrorism and airline flight safety. Humanitarian reasons, plus I want Russia to greatly step up their fight against ISIS & terrorism.
I have been asking Director Comey & others, from the beginning of my administration, to find the LEAKERS in the intelligence community…..

P-hacking is No Big Deal?

Possibly not. simine vazire argued the case over at “sometimes i’m wrong.”

The basic idea is as follows: if we use shady statistical techniques to indirectly adjust the p-value cutoff in Null Hypothesis Significance Testing or NHST, we’ll up the rate of false positives we’ll get. Just to put some numbers to this, a p-value cutoff of 0.05 means that when the null hypothesis is true, we’ll get a bad sample about 5% of the time and conclude its true. If we use p-hacking to get an effective cutoff of 0.1, however, then that number jumps up to 10%.

However, p-hacking will also raise the number of true positives we get. How much higher it gets can be tricky to calculate, but this blog post by Erika Salomon gives out some great numbers. During one simulation run, a completely honest test of a false null hypothesis would return a true positive 12% of the time; when p-hacking was introduced, that skyrocketed to 74%.

If the increase in false positives is balanced out by the increase in true positives, then p-hacking makes no difference in the long run. The number of false positives in the literature would be entirely dependent on the power of studies, which is abysmally low, and our focus should be on improving that. Or, if we’re really lucky, the true positives increase faster than the false positives and we actually get a better scientific record via cheating!

We don’t really know which scenario will play out, however, and vazire calls for someone to code up a simulation.

Allow me.

My methodology will be to divide studies up into two categories: null results that are never published, and possibly-true results that are. I’ll be using a one-way ANOVA to check whether the average of two groups drawn from a Gaussian distribution differ. I debated switching to a Student t test, but comparing two random draws seems more realistic than comparing one random draw to a fixed mean of zero.

I need a model of effect and sample sizes. This one is pretty tricky; just because a study is unpublished doesn’t mean the effect size is zero, and vice-versa. Making inferences about unpublished studies is tough, for obvious reasons. I’ll take the naive route here, and assume unpublished studies have an effect size of zero while published studies have effect sizes on the same order of actual published studies. Both published and unpublished will have sample sizes typical of what’s published.

I have a handy cheat for that: the Open Science Collaboration published a giant replication of 100 psychology studies back in 2015, and being Open they shared the raw data online in a spreadsheet. The effect sizes are in correlation coefficients, which are easy to convert to Cohen’s d, and when paired with a standard deviation of one that gives us the mean of the treatment group. The control group’s mean is fixed at zero but shares the same standard deviation. Sample sizes are drawn from said spreadsheet, and represent the total number of samples and not the number of samples per group. In fact, it gives me two datasets in one: the original study effect and sample size, plus the replication’s effect and sample size. Unless I say otherwise, I’ll stick with the originals.

P-hacking can be accomplished a number of ways: switching between the number of tests in the analysis and iteratively doing significance tests are but two of the more common. To simply things I’ll just assume the effective p-value is a fixed number, but explore a range of values to get an idea of how a variable p-hacking effect would behave.

For some initial values, let’s say unpublished studies constitute 70% of all studies, and p-hacking can cause a p-value threshold of 0.05 to act like a threshold of 0.08.

Octave shall be my programming language of choice. Let’s have at it!

(Template: OSC 2015 originals)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 12.3654% (333 f.p, 2360 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 18.2911% (548 f.p, 2448 t.p)

(Template: OSC 2015 replications)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 19.2810% (354 f.p, 1482 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 26.2273% (577 f.p, 1623 t.p)

Ouch, our false positive rate went up. That seems strange, especially as the true positives (“t.p.”) and false positives (“f.p.”) went up by about the same amount. Maybe I got lucky with the parameter values, though; let’s scan a range of unpublished study rates from 0% to 100%, and effective p-values from 0.05 to 0.2. The actual p-value rate will remain fixed at 0.05. So we can fit it all in one chart, I’ll take the proportion of p-hacked false positives and subtract it from the vanilla false positives, so that areas where the false positive rate goes down after hacking are negative.

How varying the proportion of unpublished/false studies and the p-hacking amount changes the false positive rate.

There are no values less than zero?! How can that be? The math behind these curves is complex, but I think I can give an intuitive explanation.

Drawing the distribution of p-values when the result is null vs. the results from the OSC originals.The diagonal is the distribution of p-values when the effect size is zero; the curve is what you get when it’s greater than zero. As there are more or less values in each category, the graphs are stretched or squashed horizontally. The p-value threshold is a horizontal line, and everything below that line is statistically significant. The proportion of false to true results is equal to the proportion between the lengths of that horizontal line from the origin.

P-hacking is the equivalent of nudging that line upwards. The proportions change according to the slope of the curve. The steeper it is, the less it changes. It follows that if you want to increase the proportion of true results, you need to find a pair of horizontal lines where the horizontal distance increases as fast or faster in proportion to the increase along that diagonal. Putting this geometrically, imagine drawing a line starting at the origin but at an arbitrary slope. Your job is to find a slope such that the line pierces the non-zero effect curve twice.

Slight problem: that non-zero effect curve has negative curvature everywhere. The slope is guaranteed to get steeper as you step up the curve, which means it will curve up and away from where the line crosses it. Translating that back into math, it’s guaranteed that the non-effect curve will not increase in proportion with the diagonal. The false positive rate will always increase as you up the effective p-value threshold.

And thus, p-hacking is always a deal.