The Sinmantyx Posts

It started off somewhere around here.

Richard Dawkins: you’re wrong. Deeply, profoundly, fundamentally wrong. Your understanding of feminism is flawed and misinformed, and further, you keep returning to the same poisonous wells of misinformation. It’s like watching creationists try to rebut evolution by citing Kent Hovind; do you not understand that that is not a trustworthy source? It’s a form of motivated reasoning, in which you keep returning to those who provide the comfortable reassurances that your biases are actually correct, rather than challenging yourself with new perspectives.

Just for your information, Christina Hoff Sommers is an anti-feminist. She’s spent her entire career inventing false distinctions and spinning fairy tales about feminism.

In the span of a month, big names in the atheo-skeptic community like Dawkins, Sam Harris, and DJ Groethe lined up to endorse Christina Hoff Sommers as a feminist. At about the same time, Ayaan Hirsi Ali declared “We must reclaim and retake feminism from our fellow idiotic women,” and the same people cheered her on. Acquaintances of mine who should have known better defended Sommers and Ali, and I found myself arguing against brick walls. Enraged that I was surrounded by the blind, I did what I always do in these situations.

I researched. I wrote.

The results were modest and never widely circulated, but it caught the eye of M.A. Melby. She offered me a guest post at her blog, and I promised to append more to what I had written. And append I did.

After that was said and done, Melby left me a set of keys and said I could get comfortable. I was officially a co-blogger. I started pumping out blog posts, and never really looked back. Well, almost; out of all that I wrote over at Sinmantyx, that first Christina Hoff Sommers piece has consistently been the most popular.
I’ll do the same thing here as with my Sinmantyx statistics posts, keep the originals intact and in-place and create an archive over here.

The Sinmantyx Statistic Posts

Some of my fondest childhood memories were of reading Discover Magazine and National Geographic in my grandfather’s basement. He more than anyone cultivated my interest in science, and having an encyclopedia for a dad didn’t hurt either. This led to a casual interest in statistics, which popped up time and again as the bedrock of science.

Jumping ahead a few years, writing Proof of God led me towards the field of epistemology, or how we know what we know. This fit neatly next to my love of algorithms and computers, and I spent many a fun afternoon trying to assess and break down knowledge systems. I forget exactly how I was introduced to Bayesian statistics; I suspect I may have stumbled across a few articles by chance, but it’s also possible Richard Carrier’s cheerleading was my first introduction. Either way, I began studying the subject with gusto.

By the time I’d started blogging over at Sinmantyx, I had a little experience with the subject and I was dying to flex it. And so Bayesian statistics became a major theme of my blog posts, to the point that I think it deserves its own section.

Speaking of which, I’ve decided to post-date any and all Sinmantyx posts that I re-post over here. There was never any real “publication date” for Proof of God, as it was never published and I constantly went back and revised it over the years I spent writing it, so I feel free to assign any date I want to them. The opposite is true of my Sinmantyx work, and so I’ll defer to their original publication date. This does create a problem in finding these posts, as more than likely they’ll never make the RSS feed. Not to worry: I’ll use this blog post to catalog them, so just bookmark this or look for it along my blog header.

[Read more…]

Replication Isn’t Enough

I bang on about statistical power because it indirectly raises the odds of a false positive. In brief, it forces you to do more tests to reach a statistical conclusion, stuffing the file drawer and thus making published results appear more certain than they are. In detail, see John Borghi or Ioannidis (2005). In comic, see Maki Naro.

The concept of statistical power has been known since 1928, the wasteful consequences of low power since 1962, and yet there’s no sign that scientists are upping their power levels. This is a representative result:

Our results indicate that the average statistical power of studies in the field of neuroscience is probably no more than between ~8% and ~31%, on the basis of evidence from diverse subfields within neuro-science. If the low average power we observed across these studies is typical of the neuroscience literature as a whole, this has profound implications for the field. A major implication is that the likelihood that any nominally significant finding actually reflects a true effect is small.

Button, Katherine S., et al. “Power failure: why small sample size undermines the reliability of neuroscience.” Nature Reviews Neuroscience 14.5 (2013): 365-376.

The most obvious consequence of low power is a failure to replicate. If you rarely try to replicate studies, you’ll be blissfully unaware of the problem; once you take replications seriously, though, you’ll suddenly find yourself in a “replication crisis.”

You’d think this would result in calls for increased statistical power, with the occasional call for a switch in methodology to a system that automatically incorporates power. But it’s also led to calls for more replications.

As a condition of receiving their PhD from any accredited institution, graduate students in psychology should be required to conduct, write up, and submit for publication a high-quality replication attempt of at least one key finding from the literature, focusing on the area of their doctoral research.
Everett, Jim AC, and Brian D. Earp. “A tragedy of the (academic) commons: interpreting the replication crisis in psychology as a social dilemma for early-career researchers.” Frontiers in psychology 6 (2015).


Much has been made of preregistration, publication of null results, and Bayesian statistics as important changes to how we do business. But my view is that there is relatively little value in appending these modifications to a scientific practice that is still about one-off findings; and applying them mechanistically to a more careful, cumulative practice is likely to be more of a hindrance than a help. So what do we do? …

Cumulative study sets with internal replication.

If I had to advocate for a single change to practice, this would be it.

There’s an intuitive logic to this: currently less than one in a hundred papers are replications of prior work, so there’s plenty of room for expansion; many key figures like Ronald Fisher and Jerzy Neyman have emphasized the necessity of replications; and it doesn’t require any modification of technique; and the “replication crisis” is primarily about replications. It sounds like an easy, feel-good solution to the problem.

But then I read this paper:

Smaldino, Paul E., and Richard McElreath. “The Natural Selection of Bad Science.” arXiv preprint arXiv:1605.09511 (2016).

It starts off with a meta-analysis of meta-analyses of power, and comes to the same conclusion as above.

We collected all papers that contained reviews of statistical power from published papers in the social, behavioural and biological sciences, and found 19 studies from 16 papers published between 1992 and 2014. … We focus on the statistical power to detect small effects of the order d=0.2, the kind most commonly found in social science research. …. Statistical power is quite low, with a mean of only 0.24, meaning that tests will fail to detect small effects when present three times out of four. More importantly, statistical power shows no sign of increase over six decades …. The data are far from a complete picture of any given field or of the social and behavioural sciences more generally, but they help explain why false discoveries appear to be common. Indeed, our methods may overestimate statistical power because we draw only on published results, which were by necessity sufficiently powered to pass through peer review, usually by detecting a non-null effect.

Rather than leave it at that, though, the researchers decided to simulate the pursuit of science. They set up various “labs” that exerted different levels of effort to maintain methodological rigor, killed off labs that didn’t publish much and replaced them with mutations of labs that published more, and set the simulation spinning.

We ran simulations in which power was held constant but in which effort could evolve (μw=0, μe=0.01). Here selection favoured labs who put in less effort towards ensuring quality work, which increased publication rates at the cost of more false discoveries … . When the focus is on the production of novel results and negative findings are difficult to publish, institutional incentives for publication quantity select for the continued degradation of scientific practices.

That’s not surprising. But then they started tinkering with replication rates. To begin with, replications were done 1% of the time, were guaranteed to be published, and having one of your results fail to replicate would exact a terrible toll.

We found that the mean rate of replication evolved slowly but steadily to around 0.08. Replication was weakly selected for, because although publication of a replication was worth only half as much as publication of a novel result, it was also guaranteed to be published. On the other hand, allowing replication to evolve could not stave off the evolution of low effort, because low effort increased the false-positive rate to such high levels that novel hypotheses became more likely than not to yield positive results … . As such, increasing one’s replication rate became less lucrative than reducing effort and pursuing novel hypotheses.

So it was time for extreme measures: force the replication rate to high levels, to the point that 50% of all studies were replications. All that happened was that it took longer for the overall methodological effort to drop and false positives to bloom.

Replication is not sufficient to curb the natural selection of bad science because the top performing labs will always be those who are able to cut corners. Replication allows those labs with poor methods to be penalized, but unless all published studies are replicated several times (an ideal but implausible scenario), some labs will avoid being caught. In a system such as modern science, with finite career opportunities and high network connectivity, the marginal return for being in the top tier of publications may be orders of magnitude higher than an otherwise respectable publication record.

Replication isn’t enough. The field of science needs to incorporate more radical reforms that encourage high methodological rigor and greater power.

Veritasium on the Reproducibility Crisis

It’s a great summary, going into much more depth than most. I really like how Muller brought out a concrete example of publication bias, and found an example of p-hacking in a branch of science that’s usually resistant to it, physics.

But I’m not completely happy with it. Some of this comes from being a Bayesian fanboi that didn’t hear the topic mentioned, but Muller also makes a weird turn of phrase at the end. Muller argues that, as bad as the flaws in science may be, think of how much worse they are in all our other systems of learning about the world.

Slight problem: there are no other systems. Even “I feel it’s true” is based on an evidential claim, evaluated for plausibility against other competing hypotheses. The weighting procedure may be hopelessly skewed, but so too are p-values and the publication process.

Muller could have strengthened his point by bringing up an example, yet did not. We’re left taking his word that science isn’t the sole methodology we have for exploring the world, and that those alternate methodologies aren’t as rigorous. Meanwhile, he explicitly points out that a small fraction of “landmark cancer trials” could be replicated; this implies that cancer treatments, and by extension the well-being of millions of cancer patients, are being harmed by poor methodology in science. Even if you disagree with my assertion that all epistemologies are scientific in some fashion, it’s tough to find a counter-example that effects 40% of us and will kill a quarter.

My hope doesn’t come from a blind assurance that other methodologies are worse than science, it comes from the news that scientists have recognized the flaws in their trade, and are working to correct them. To be fair to Muller, he’d probably agree.

What is False?

John Oliver weighed in on the replication crisis, and I think he did a great job. I’d have liked a bit more on university press departments, who can write misleading press releases that journalists jump on, but he did have to simplify things for a lay audience.

It got me thinking about what “false” means, though. “True” is usually defined as “in line with reality,” so “false” should mean “not in line with reality,” the precise compliment.

But don’t think about it in terms of a single thing, but in multiple data points applied to a specific theory. Suppose we analyze that data, and find that all but a few datapoints are predicted by the hypothesis we’re testing. Does this mean the hypothesis is false, since it isn’t in line with reality in all cases, or true, because it’s more in line with reality than not? Falsification argues that it is false, and exploits that to come up with this epistemology:

  1. Gather data.
  2. Is that data predicted by the hypothesis? If so, repeat step 1.
  3. If not, replace this hypothesis with another that predicts all the data we’ve seen so far, and repeat step 1.

That’s what I had in mind when I said that frequentism works on streams of hypotheses, hopping from one “best” hypothesis to the next. The addition of time changes the original definitions slightly, so that “true” really means “in line with reality in all instances” while “false” means “in at least one instance, it is not in line with reality.”

Notice the asymmetry, though. A hypothesis has to reach a pretty high bar to be considered “true,” and “false” hypotheses range from “in line with reality, with one exception” to “never in line with reality.” Some of those “false” hypotheses are actually quite valuable to us, as John Oliver’s segment demonstrates. He never explains what “statistical significance” means, for instance, but later on uses “significance” in the “effect size” sense. This will mislead most of the audience away from the reality of the situation, and in the absolute it makes his segment “false.” Nonetheless, that segment was a net positive at getting people to understand and care for the replication crisis, so labeling it “false” is a disservice.

We need something fuzzier than the strict binary of falsification. What if we didn’t compliment “true” in the set-theory sense, but in the definitional sense? Let “true” remain “in line with reality in all instances,” but change “false” from “in at least one instance, it is not in reality” to “never in line with reality.” This creates a gap, though: that hypothesis from earlier is neither “true” nor “false,” as it isn’t true in all cases nor false in all. It must be in a third category, as part of some sort of paraconsistent logic.

This is where the Bayesian interpretation of statistics comes from, it deliberately disclaims an absolute “true” or “false” label for descriptions of the world, instead holding them up as two ends of a continuum. Every hypothesis in the third category inbetween, hoping that future data will reveal that its closer to one end of the continuum or the other.

I think it’s a neat way to view the Bayesian/Frequentism debate, as a mere disagreement over what “false” means.

Index Post: P-values

Over the months, I’ve managed to accumulate a LOT of papers discussing p-values and their application. Rather than have them rot on my hard drive, I figured it was time for another index post.

Full disclosure: I’m not in favour of them. But I came to that by reading these papers, and seeing no effective counter-argument. So while this collection is biased against p-values, that’s no more a problem than a bias against the luminiferous aether or humour theory. And don’t worry, I’ll include a few defenders of p-values as well.

What’s a p-value?

It’s frequently used in “null hypothesis significance testing,” or NHST to its friends. A null hypothesis is one you hope to refute, preferably a fairly established one that other people accept as true. That hypothesis will predict a range of observations, some more likely than others. A p-value is simply the odds of some observed event happening, plus the odds of all events more extreme, assuming the null hypothesis is true. You can then plug that value into the following logic:

  1. Event E, or an event more extreme, is unlikely to occur under the null hypothesis.
  2. Event E occurred.
  3. Ergo, the null hypothesis is false.

They seem like a weird thing to get worked up about.

Significance testing is a cornerstone of modern science, and NHST is the most common form of it. A quick check of Google Scholar shows “p-value” shows up 3.8 million times, while its primary competitor, “Bayes Factor,” shows up 250,000. At the same time, it’s poorly understood.

The P value is probably the most ubiquitous and at the same time, misunderstood, misinterpreted, and occasionally miscalculated index in all of biomedical research. In a recent survey of medical residents published in JAMA, 88% expressed fair to complete confidence in interpreting P values, yet only 62% of these could answer an elementary P-value interpretation question correctly. However, it is not just those statistics that testify to the difficulty in interpreting P values. In an exquisite irony, none of the answers offered for the P-value question was correct, as is explained later in this chapter.

Goodman, Steven. “A Dirty Dozen: Twelve P-Value Misconceptions.” In Seminars in Hematology, 45:135–40. Elsevier, 2008. http://www.sciencedirect.com/science/article/pii/S0037196308000620.

The consequence is an abundance of false positives in the scientific literature, leading to many failed replications and wasted resources.

Gotcha. So what do scientists think is wrong with them?

Well, th-

And make it quick, I don’t have a lot of time.

Right right, here’s the top three papers I can recommend:

Null hypothesis significance testing (NHST) is arguably the mosl widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.

Nickerson, Raymond S. “Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy.” Psychological Methods 5, no. 2 (2000): 241.

After 4 decades of severe criticism, the ritual of null hypothesis significance testing (mechanical dichotomous decisions around a sacred .05 criterion) still persists. This article reviews the problems with this practice, including near universal misinterpretation of p as the probability that H₀ is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects H₀ one thereby affirms the theory that led to the test.

Cohen, Jacob. “The Earth Is Round (p < .05).” American Psychologist 49, no. 12 (1994): 997–1003. doi:10.1037/0003-066X.49.12.997.

This chapter examines eight of the most commonly voiced objections to reform of data analysis practices and shows each of them to be erroneous. The objections are: (a) Without significance tests we would not know whether a finding is real or just due to chance; (b) hypothesis testing would not be possible without significance tests; (c) the problem is not significance tests but failure to develop a tradition of replicating studies; (d) when studies have a large number of relationships, we need significance tests to identify those that are real and not just due to chance; (e) confidence intervals are themselves significance tests; (f) significance testing ensure objectivity in the interpretation of research data; (g) it is the misuse, not the use, of significance testing that is the problem; and (h) it is futile to reform data analysis methods, so why try?

Schmidt, Frank L., and J. E. Hunter. “Eight Common but False Objections to the Discontinuation of Significance Testing in the Analysis of Research Data.” What If There Were No Significance Tests, 1997, 37–64.

OK, I have a bit more time now. What else do you have?

Using a Bayesian significance test for a normal mean, James Berger and Thomas Sellke (1987, pp. 112–113) showed that for p values of .05, .01, and .001, respectively, the posterior probabilities of the null, Pr(H₀ | x), for n = 50 are .52, .22, and .034. For n = 100 the corresponding figures are .60, .27, and .045. Clearly these discrepancies between p and Pr(H₀ | x) are pronounced, and cast serious doubt on the use of p values as reasonable measures of evidence. In fact, Berger and Sellke (1987) demonstrated that data yielding a p value of .05 in testing a normal mean nevertheless resulted in a posterior probability of the null hypothesis of at least .30 for any objective (symmetric priors with equal prior weight given to H₀ and HA ) prior distribution.

Hubbard, R., and R. M. Lindsay. “Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing.” Theory & Psychology 18, no. 1 (February 1, 2008): 69–88. doi:10.1177/0959354307086923.

Because p-values dominate statistical analysis in psychology, it is important to ask what p says about replication. The answer to this question is ‘‘Surprisingly little.’’ In one simulation of 25 repetitions of a typical experiment, p varied from .44. Remarkably, the interval—termed a p interval —is this wide however large the sample size. p is so unreliable and gives such dramatically vague information that it is a poor basis for inference.

Cumming, Geoff. “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.Perspectives on Psychological Science 3, no. 4 (July 2008): 286–300. doi:10.1111/j.1745-6924.2008.00079.x.

Simulations of repeated t-tests also illustrate the tendency of small samples to exaggerate effects. This can be shown by adding an additional dimension to the presentation of the data. It is clear how small samples are less likely to be sufficiently representative of the two tested populations to genuinely reflect the small but real difference between them. Those samples that are less representative may, by chance, result in a low P value. When a test has low power, a low P value will occur only when the sample drawn is relatively extreme. Drawing such a sample is unlikely, and such extreme values give an exaggerated impression of the difference between the original populations. This phenomenon, known as the ‘winner’s curse’, has been emphasized by others. If statistical power is augmented by taking more observations, the estimate of the difference between the populations becomes closer to, and centered on, the theoretical value of the effect size.

is G., Douglas Curran-Everett, Sarah L. Vowler, and Gordon B. Drummond. “The Fickle P Value Generates Irreproducible Results.” Nature Methods 12, no. 3 (March 2015): 179–85. doi:10.1038/nmeth.3288.

If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time. If, as is often the case, experiments are underpowered, you will be wrong most of the time. This conclusion is demonstrated from several points of view. First, tree diagrams which show the close analogy with the screening test problem. Similar conclusions are drawn by repeated simulations of t-tests. These mimic what is done in real life, which makes the results more persuasive. The simulation method is used also to evaluate the extent to which effect sizes are over-estimated, especially in underpowered experiments. A script is supplied to allow the reader to do simulations themselves, with numbers appropriate for their own work. It is concluded that if you wish to keep your false discovery rate below 5%, you need to use a three-sigma rule, or to insist on p≤0.001. And never use the word ‘significant’.

Colquhoun, David. “An Investigation of the False Discovery Rate and the Misinterpretation of P-Values.” Royal Society Open Science 1, no. 3 (November 1, 2014): 140216. doi:10.1098/rsos.140216.

I was hoping for something more philosophical.

The idea that the P value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).

Goodman, Steven N. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130, no. 12 (1999): 995–1004.

Overemphasis on hypothesis testing–and the use of P values to dichotomise significant or non-significant results–has detracted from more useful approaches to interpreting study results, such as estimation and confidence intervals. In medical studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Some methods of calculating confidence intervals for means and differences between means are given, with similar information for proportions. The paper also gives suggestions for graphical display. Confidence intervals, if appropriate to the type of study, should be used for major findings in both the main text of a paper and its abstract.

Gardner, Martin J., and Douglas G. Altman. “Confidence Intervals rather than P Values: Estimation rather than Hypothesis Testing.” BMJ 292, no. 6522 (1986): 746–50.

What’s this “Neyman-Pearson” thing?

P-values were part of a method proposed by Ronald Fisher, as a means of assessing evidence. Even as the ink was barely dry on it, other people started poking holes in his work. Jerzy Neyman and Egon Pearson took some of Fisher’s ideas and came up with a new method, based on long-term prediction. Their method is superior, IMO, but rather than replacing Fisher’s approach it instead wound up being blended with it, ditching all the advantages to preserve the faults. This citation covers the historical background:

Huberty, Carl J. “Historical Origins of Statistical Testing Practices: The Treatment of Fisher versus Neyman-Pearson Views in Textbooks.” The Journal of Experimental Education 61, no. 4 (1993): 317–33.

While the remainder help describe the differences between the two methods, and possible ways to “fix” their shortcomings.

The distinction between evidence (p’s) and error (a’s) is not trivial. Instead, it reflects the fundamental differences between Fisher’s ideas on significance testing and inductive inference, and Neyman-Pearson’s views on hypothesis testing and inductive behavior. The emphasis of the article is to expose this incompatibility, but we also briefly note a possible reconciliation.

Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p ’S) Versus Errors ( α ’S) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): 171–78. doi:10.1198/0003130031856.

The basic differences are these: Fisher attached an epistemic interpretation to a significant result, which referred to a particular experiment. Neyman rejected this view as inconsistent and attached a behavioral meaning to a significant result that did not refer to a particular experiment, but to repeated experiments. (Pearson found himself somewhere in between.)

Gigerenzer, Gerd. “The Superego, the Ego, and the Id in Statistical Reasoning.” A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, 1993, 311–39.

This article presents a simple example designed to clarify many of the issues in these controversies. Along the way many of the fundamental ideas of testing from all three perspectives are illustrated. The conclusion is that Fisherian testing is not a competitor to Neyman-Pearson (NP) or Bayesian testing because it examines a different problem. As with Berger and Wolpert (1984), I conclude that Bayesian testing is preferable to NP testing as a procedure for deciding between alternative hypotheses.

Christensen, Ronald. “Testing Fisher, Neyman, Pearson, and Bayes.” The American Statistician 59, no. 2 (2005): 121–26.

C’mon, there aren’t any people defending the p-value?

Sure there are. They fall into two camps: “deniers,” a small group that insists there’s nothing wrong with p-values, and the much more common “fixers,” who propose making up for the shortcomings by augmenting NHST. Since a number of fixers have already been cited, I’ll just focus on the deniers here.

On the other hand, the propensity to misuse or misunderstand a tool should not necessarily lead us to prohibit its use. The theory of estimation is also often misunderstood. How many epidemiologists can explain the meaning of their 95% confidence interval? There are other simple concepts susceptible to fuzzy thinking. I once quizzed a class of epidemiology students and discovered that most had only a foggy notion of what is meant by the word “bias.” Should we then abandon all discussion of bias, and dumb down the field to the point where no subtleties need trouble us?

Weinberg, Clarice R. “It’s Time to Rehabilitate the P-Value.” Epidemiology 12, no. 3 (2001): 288–90.

The solution is simple and practiced quietly by many researchers—use P values descriptively, as one of many considerations to assess the meaning and value of epidemiologic research findings. We consider the full range of information provided by P values, from 0 to 1, recognizing that 0.04 and 0.06 are essentially the same, but that 0.20 and 0.80 are not. There are no discontinuities in the evidence at 0.05 or 0.01 or 0.001 and no good reason to dichotomize a continuous measure. We recognize that in the majority of reasonably large observational studies, systematic biases are of greater concern than random error as the leading obstacle to causal interpretation.

Savitz, David A. “Commentary: Reconciling Theory and Practice.” Epidemiology 24, no. 2 (March 2013): 212–14. doi:10.1097/EDE.0b013e318281e856.

The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real.

Hunter, John E. “Testing Significance Testing: A Flawed Defense.” Behavioral and Brain Sciences 21, no. 02 (April 1998): 204–204. doi:10.1017/S0140525X98331167.

A Statistical Analysis of a Sexual Assault Case: Part One

[statistics for the people, and of the people]

I just can’t seem to escape sexual assault. For the span of six months I analysed the Stollznow/Radford case, then finished an examination of Carol Tavris’ talk at TAM2014, so the topic never wandered far from my mind. I’ve bounced my thoughts off other people, sometimes finding support, other times running into confusion or rejection. It’s the latter case that most fascinates me, so I hope you don’t mind if I write my way through the confusion.

The most persistent objection I’ve received goes something like this: I cannot take population statistics and apply them to a specific person. That’s over-generalizing, and I cannot possibly get to a firm conclusion by doing it.

It makes sense on some level. Human beings are wildly different, and can be extremely unpredictable because of that. The field of psychology is scattered with the remains of attempts to bring order to the chaos. However, I’ve had to struggle greatly to reach even that poor level of intellectual empathy, as the argument runs contrary to our every moment of existence. This may be a classic example of talking to fish about water; our unrelenting leaps from the population to the individual seem rare and strange when consciously considered, because these leaps are almost never conscious.

Don’t believe me? Here’s a familiar example.

P1. That object looks like a chair.
P2. Based on prior experience, objects that look like chairs can support my weight.
C1. Therefore, that object can support my weight.

Yep, the Problem of Induction is a classic example of applying the general to the specific. I may have sat on hundreds of chairs in my lifetime, without incident, but that does not prove the next chair I sit on will remain firm. I can even point to instances where a chair did collapse… and yet, if there’s any hesitation when I sit down, it’s because I’m worried about whether something’s stuck to the seat. The worry of the chair collapsing never enters my mind.

Once you’ve had the water pointed out to you, it appears everywhere. Indeed, you cannot do any action without jumping from population to specific.

P1. A brick could spontaneously fly at my head.
P2. Based on prior experience, no brick has ever spontaneously flown at my head.
C1. Therefore, no brick will spontaneously fly at my head.

P1. I’m typing symbols on a page.
P2. Based on prior experience, other people have been able to decode those symbols.
C1. Therefore, other people will decode those symbols.

P1. I want to raise my arm.
P2. Based on prior experience, triggering a specific set of nerve impulses will raise my arm.
C1. Therefore, I trigger those nerve impulses and assume it’ll raise my arm.

“Action” includes the acts of science, too.

P1. I take a measurement with a specific device and a specific calibration.
P2. Based on prior experience, measurements with that device and calibration were reliable.
C1. Therefore, this measurement will be reliable.

Philosophers may view the Problem of Induction as a canyon of infinite width, but it’s a millimetre crack in our day-to-day lives. Not all instances are legitimate, though. Here’s a subtle failure:

P1. This vaccine contains mercury.
P2. Based on prior experience, mercury is a toxic substance with strong neurological effects.
C1. Therefore, this vaccine is a toxic substance with strong neurological effects.

Sure, your past experience may have included horror stories of what happens after chronic exposure to high levels of mercury… but unbeknownst to you, it also included chronic exposure to very low levels of mercury compounds, of varying toxicity, which had no effect on you or anyone else. There’s a stealth premise here: this argument asserts that dosage is irrelevant, something that’s not true but easy to overlook. It’s not hard to come up with similarly flawed examples that are either more subtle (“Therefore, I will not die today”) or less (“Therefore, all black people are dangerous thugs”).

Hmm, maybe this type of argument is unsound when applied to people? Let’s see:

P1. This is a living person.
P2. Based on prior experience, living persons have beating hearts.
C1. Therefore, this living person has a beating heart.

Was that a bit cheap? I’ll try again:

P1. This is a person living in Canada.
P2. Based on prior experience, people living in Canada speak English.
C1. Therefore, this person will speak English.

Now I’m skating onto thin ice. According to StatCan, only 85% of Canadians can speak English, so this is only correct most of the time. Let’s improve on that:

P1. This is a person living in Canada.
P2. Based on prior experience, about 85% of people living in Canada speak English.
C1. Therefore, there’s an 85% chance this person will speak English.

Much better. In fact, it’s much better than anything I’ve presented so far, as it was gathered by professionals in controlled conditions, an immense improvement over my ad-hoc, poorly-recorded personal experience. It also quantifies and puts implicit error bars around what it is arguing. Don’t see how? Consider this version instead:

P1. This is a person living in Canada.
P2. Based on prior experience, about 84.965% of people living in Canada speak English.
C1. Therefore, there’s an 84.965% chance this person will speak English.

The numeric precision sets the implicit error bounds; “about 85%” translates into “from 84.5 to 85.5%.”

Having said all that, it wouldn’t take much effort to track down a remote village in Quebec where few people could talk to me, and the places where I hang out are well above 85% English-speaking. But notice that both are a sub-population of Canada, while the above talks only of Canada as a whole. It’s a solid argument over the domain it covers, but adding more details can change that.

Ready for the next step? It’s a bit scary.

P1. This is a man.
P2. Based on prior experience, between 6 and 62% of men have raped or attempted it.
C1. Therefore, the chance of that man having raped or attempted rape is between 6 and 62%.

Hopefully you can see this is nothing but probability theory at work. The error bars are pretty huge there, but as with the language statistic we can add more details.

P1. This is a male student at a mid-sized, urban commuter university in the United States with a diverse student body.
P2. Based on prior experience, about 6% of such students have raped or attempted it.
C1. Therefore, the odds of that male student having raped or attempted rape is about 6%.

We can do much better, though, by continuing to pile on the evidence we have and watching how the probabilities shift around. Interestingly, we don’t even need to be that precise with our numbers; if there’s sufficient evidence, they’ll converge on an answer. One flip of a coin tells you almost nothing about how fair the process is, while a thousand flips taken together tells you quite a lot (and it isn’t pretty). Even if the numbers don’t come to a solid conclusion, that still might be OK; you wouldn’t do much if there was a 30% chance your ice cream cone started melting before you could lick it, but you would take immediate action if there was a 30% chance of a meteor hitting your house. Fuzzy answers can still justify action, if the consequences are harsh enough and outweigh the cost of getting it wrong.

So why not see what answers we can draw from a sexual assault case? Well, maybe because discussing sexual assault is a great way to get sued, especially when the accused in question is rumoured to be very litigious.

So instead, let’s discuss birds

[HJH 2015-07-19: Changed a link to point to the correct spot.]

When Secularism Is A Lie

In 1990, Gregg Cunningham thought the anti-choice movement was losing the battle for reproductive rights. In response, he formed the Center for Bioethical Reform, then spent years brainstorming how he could reinvent the movement. His answer: secularize it. This allowed anti-choice messaging to dodge past religious disagreement over abortion (Christian denominations are evenly divided over support for abortion) by pretending to be above it all, and get into places a religious approach was barred from entering.

… this is very carefully targeted. When we do this on a university campus there is actually an enormous amount of preparation, and we do a great deal of follow-up. We start pro-life organizations on the campus where none had existed previously, we greatly strengthen currently existing pro-life groups by increasing the size of their membership, by donating to them all kinds of educational resources they can use, we help recruit students to volunteer at the local crisis pregnancy centers. We do a myriad of things of that sort. The same is true of churches. […]

The Genocide Awareness Project is one of a myriad of projects which we are doing, but they are all aimed at the same thing: how can we engage a reluctant culture and educate it over its own objections? It all starts with a willingness to take the heat. We lack moral authority if we are not willing to take the heat.

It signaled that lies and half-truths were perfectly acceptable, since Cunningham’s organization was secular in name only.

We are a secular organization, we’re not a Christian organization, but we are an organization comprised of Christians, and the thing that motivates us personally is the Gospel of Jesus Christ.

While Cunningham is an extremist, his ideas have been very influential. The moderates in the anti-choice movement have since noted the failure of religious arguments, and have embraced trojan secularism. Emphasis mine:

the strenuous efforts of abolitionists have yielded very little in terms of measurable progress in reducing abortion, so it’s time to try a more fruitful strategy.

I have my own beliefs about the sanctity and rights of an unborn baby, but I don’t think we’ll change many minds by arguing about that. The proliferation of 3D ultrasound machines, new research about fetal awareness and pain, and the increasing viability of extremely premature babies will continue to make an impression on some people, but for those who are heavily invested in the moral neutrality of abortion on demand, and who see the concession of any status to the fetus as in direct conflict with the rights of the mother, this won’t make a lot of difference.

We need more discussion, then, of abortion as a women’s issue. Abortion damages women. It does them physical and psychological harm, which is multiplied by the fact that very few women seeking abortions give their informed consent (meaning consent even after being advised of the risks.) Those of us who take such things seriously tend to agree that it does them spiritual harm. More broadly, a culture in which abortion is seen as essentially harmless wreaks profound changes to our collective understanding of motherhood, sexuality, the obligations of mothers and fathers to each other and their children, and adulthood.

It’s been embraced so much by extremists and moderates alike, Kelly Gordon found that only 1.9% of anti-choice messages contained a religious element.[1]

The latest variation of this that I’ve heard of this comes from Crisis Pregnancy Centres. Cunningham called them “Ministries,” which is more accurate than I realized.

In a conference room at the Embassy Suites in Charleston, South Carolina, Laurie Steinfeld stood behind a podium speaking to an audience of about 50 people. Steinfeld is a counselor at a pregnancy center in Mission Hills, California, and she was leading a session at the annual Heartbeat International conference, a gathering of roughly 1,000 crisis pregnancy center staff and anti-abortion leaders from across the country. Her talk focused on how to help women seeking abortions understand Jesus’s plan for them and their babies, and she described how her center’s signage attracts women.

“Right across the street from us is Planned Parenthood,” she said. “We’re across the street and it [their sign] says ‘Pregnancy Counseling Center,’ but these girls aren’t — they just look and see ‘Pregnancy’ and think, Oh, that’s it! So some of them coming in thinking they’re going to their abortion appointments.” […]

In her workshop, “How to Reach and Inspire the Heart of a Client,” Steinfeld told her audience about her mission to convert clients: “If you hear nothing today, I want you to hear this one thing,” she said. “We might be the very first face of Christ that these girls ever see.”

When someone’s salvation is on the line, anything is justified. Exploiting the desperation of someone in order to bring them into a relationship with Christ is completely justified, so long as you don’t use the word “exploit.”

Multiple women told me it was their job to protect women from abortion as “an adult tells a child not to touch a hot stove.” Another oft-repeated catchphrase was, “Save the mother, save the baby,” shorthand for many pregnancy center workers’ belief that the most effective way to prevent abortion is to convert women. In keeping with Evangelicalism’s central tenets, many pregnancy center staff believe that those living “without Christ”— including Christians having premarital sex — must accept Christ to be born again, redeem their sins, and escape spiritual pain. Carrying a pregnancy to term “redeems” a “broken” woman, multiple staff people told me.

And here again, we find they deliberately avoid the “G” or “J” words until they’ve sealed a connection.

The website for Heartbeat International’s call center, Option Line, offers to connect women with a pregnancy center that “provides many services for free.” It encourages women who are curious about emergency contraception to call its hotline to speak to a representative about “information on all your options.” On the Option Line website, there is no mention of Christ, no religious imagery, no talk of being saved. But visit the website of Heartbeat itself and you’ll find very different language. “Heartbeat International does promote God’s Plan for our sexuality: marriage between one man and one woman, sexual intimacy, children, unconditional/unselfish love, and relationship with God must go together,” it says. […]

In her session, “Do I Really Need Two Sites?” Chenoweth explained that, yes, in fact, pregnancy centers do. She recommended that centers operate one that describes an anti-abortion mission to secure donors and another that lists medical information to attract women seeking contraception, counseling, or abortion. […]

Johnson … emphasized that waiting rooms should feel like “professional environments” instead of “grandma’s house,” and discouraged crucifixes, fake flowers, and mauve paint before showing slides of Planned Parenthood waiting rooms and encouraging staff to make their centers look just as “beautiful and up-to-date,” especially if they have a “medical model,” meaning they offer sonograms and other medical services. Johnson also said pregnancy center staff should mirror Planned Parenthood’s language.

Lies are an integral part of the anti-choice movement. Lies about what abortion does to you, and lies about what they stand for and believe in. Anyone hoping to promote secularism and humanist values should be wary of religion in secular clothing.

 

[1] Gordon, Kelly. “‘Think About the Women!’: The New Anti-Abortion Discourse in English Canada,” 2011. pg. 42.

Index Post: Rape Myth Acceptance

Apologies for going silent, but I’ve been in crunch mode over a lecture on rape culture. The crush is over, thankfully, and said lecture has been released in video, transcript, and footnote form.

But one strange thing about it is that I never go into depth on the rape myth acceptance literature. There’s actually a good reason why: after thirty years of research, modern papers don’t even bother with 101 level stuff like “why is this a myth?” or even “how many people believe myth X?”, because it’s been done and covered and consensus has been reached. My intended audience was below the 101 level and hostile to the very notion of “rape culture,” rendering much of the literature useless.

But there is soooooo much literature that it feels like a grave injustice not to talk about it. So, let’s try something special: this will be an index post to said literature. It’ll give you the bare minimum of preamble you need to jump in, and offer a little curation. This will evolve and change over time, too, so check back periodically.

[section on comment policy deleted, for obvious reasons]

What is a “Rape Myth”?

A “rape myth” is pretty self-explanatory: it is a false belief about sexual assault, typically shared by more than one person. Martha Burt’s foundational paper of 1980 includes these, for instance:

“One reason that women falsely report a rape is that they frequently have a need to call attention to themselves.”
“Any healthy woman can successfully resist a rapist if she really wants to.”
“Many women have an unconscious wish to be raped, and may then unconsciously set up a situation in which they are likely to be attacked.”
“If a woman gets drunk at a party and has intercourse with a man she’s just met there, she should be considered “fair game” to other males at the party who want to have sex with her too, whether she wants to or not.”

Other myths include “men cannot be raped” and “if you orgasm, it can’t be rape” (we’re meat machines, and at some point low-level physiology will override high-level cognition).

What papers should I prioritize?

As mentioned, there’s Burt’s 1980 contribution, which goes into great detail about validity and correlations with environmental factors, and developed a questionnaire that became foundational for the field.

The present research, therefore, constitutes a first effort to provide an empirical foundation for a combination of social psychological and feminist theoretical analysis of rape attitudes and their antecedents.

The results reported here have two major implications. First, many Americans do indeed believe many rape myths. Second, their rape attitudes are strongly connected to other deeply held and pervasive attitudes such as sex role stereotyping, distrust of the opposite sex (adversarial sexual beliefs), and acceptance of interpersonal violence. When over half of the sampled individuals agree with statements such as “A women who goes to the home or apartment of a man on the first date implies she is willing to have sex” and “In the majority of rapes, the victim was promiscuous or had a bad reputation,” and when the same number think that 50% or more of reported rapes are reported as rape only because the woman was trying to get back at a man she was angry with or was trying to cover up an illegitimate pregnancy, the world is indeed not a safe place for rape victims.
Burt, Martha R. “Cultural Myths and Supports for Rape.” Journal of Personality and Social Psychology 38, no. 2 (1980): 217.
http://www.excellenceforchildandyouth.ca/sites/default/files/meas_attach/burt_1980.pdf

But there’s also the Illinois Rape Myth Acceptance Scale, developed twenty years later and benefiting greatly from that.

First, we set out to systematically elucidate the domain and structure of the rape myth construct through reviewing the pertinent literature, discussion with experts, and empirical investigation. Second, we developed two scales, the 45-item IRMA and its 20-item short form (IRMA-SF), designed to reflect the articulated domain and structure of the rape myth construct, as well as to possess good psychometric properties. Finally, whereas content validity was determined by scale development procedures, construct validity of the IRMA and IRMA-SF was examined in a series of three studies, all using different samples, methodologies, and analytic strategies. […]

This work revealed seven stable and interpretable components of rape myth acceptance labeled (1) She asked for it; (2) It wasn’t really rape; (3) He didn’t mean to; (4) She wanted it; (5) She lied; (6) Rape is a trivial event; and (7) Rape is a deviant event. […]

individuals with higher scores on the IRMA and IRMA-SF were also more likely to (1) hold more traditional sex role stereotypes, (2) endorse the notion that the relation of the sexes is adversarial in nature, (3) express hostile attitudes toward women, and (4) be relatively accepting of both interpersonal violence and violence more generally.
Payne, Diana L., Kimberly A. Lonsway, and Louise F. Fitzgerald. “Rape Myth Acceptance: Exploration of Its Structure and Its Measurement Using theIllinois Rape Myth Acceptance Scale.” Journal of Research in Personality 33, no. 1 (March 1999): 27–68. doi:10.1006/jrpe.1998.2238.

What else is interesting?

There was marked variability (…) among studies in their reported relationships between RMA and attitudinal factors related with gender and sexuality (…). Not surprisingly, however, large overall effect sizes with a positive direction were found with oppressive and adversarial attitudes against women, such as attitudes toward women (…), combined measures of sexism (…), victim-blaming attitudes (…), acceptance of interpersonal violence (…), low feminist identity (…), and adversarial sexual beliefs (…). Decision latency (i.e., estimated time for a woman to say no to sexual advances), hostility toward women, male sexuality, prostitution myth, therapists’ acceptance of rape victim scale, sexual conservatism, vengeance, and sociosexuality (i.e., openness to multiple sexual partners) were examined in one study each, and their effect sizes ranged between medium to large and were all significantly larger than zero. Homophobia had a significant moderate effect size (…) as well as male-dominance attitude (…), acceptance of rape (…), and violence (…). However, profeminist beliefs (…), having sexual submission fantasies (…), and male hostility (…) were negatively related to RMA.
Suarez, E., and T. M. Gadalla. “Stop Blaming the Victim: A Meta-Analysis on Rape Myths.” Journal of Interpersonal Violence 25, no. 11 (November 1, 2010): 2010–35. doi:10.1177/0886260509354503.
http://474miranairresearchpaper.wmwikis.net/file/view/metaanalysisstopblamingvictim.pdf

Results of a multiple regression analysis indicated that sexism, ageism, classism, and religious intolerance each were significant predictors of rape myth acceptance (all
p < 0.01; … ). Racism and homophobia, however, failed to enter the model. Sexism, ageism, classism, and religious intolerance accounted for almost one-half (45%) of the variance in rape myth acceptance for the present sample. Sexism accounted for the greatest proportion of the variance (35%). The other intolerant beliefs accounted for relatively smaller amounts of variance beyond that of sexism: classism (2%), ageism (2%), and religious intolerance (1%).
Aosved, Allison C., and Patricia J. Long. “Co-Occurrence of Rape Myth Acceptance, Sexism, Racism, Homophobia, Ageism, Classism, and Religious Intolerance.” Sex Roles 55, no. 7–8 (November 28, 2006): 481–92. doi:10.1007/s11199-006-9101-4.
http://www.researchgate.net/publication/226582617_Co-occurrence_of_Rape_Myth_Acceptance_Sexism_Racism_Homophobia_Ageism_Classism_and_Religious_Intolerance/file/72e7e52bd021d8bc72.pdf

We did not find any effect of participant’s gender on rape attributions. Our results confirm those obtained by other authors (Check & Malamuth, 1983; Johnson & Russ, 1989; Krahe, 1988) who haven’t found significant gender effects on rape perception when situational factors were manipulated. Our results also contradict the general finding that men hold more rape myths than women do (Anderson et al., 1997). Our data indicate that it is not the observer’s gender that determines rape attributions but his or her preconceptions about rape. Thus, the influence of gender on rape attributions might be mediated by RMA, which then might explain why some studies reveal a significant gender effect (Monson et al., 1996; Stormo et al., 1997).
Frese, Bettina, Miguel Moya, and Jesús L. Megías. “Social Perception of Rape How Rape Myth Acceptance Modulates the Influence of Situational Factors.” Journal of Interpersonal Violence 19, no. 2 (February 1, 2004): 143–61. doi:10.1177/0886260503260245.
http://www.d.umn.edu/cla/faculty/jhamlin/3925/4925HomeComputer/Rape%20myths/Social%20Perception.pdf

The current research further corroborates the role of rape myths as a factor facilitating sexual aggression. Taken together, our findings suggest that salient ingroup norms may be important determinants of the professed willingness to engage in sexually aggressive behavior. Our studies go beyond quasi-experimental and correlational work that had shown a close relationship between RMA and rape proclivity [RP] as well as our own previous experimental studies, which have shown individuals’ RMA to causally affect RP. They demonstrate that salient information about others’ RMA may cause differences in men’s self-reported proclivity to exert sexual violence.
Frese, Bettina, Miguel Moya, and Jesús L. Megías. “Social Perception of Rape How Rape Myth Acceptance Modulates the Influence of Situational Factors.” Journal of Interpersonal Violence 19, no. 2 (February 1, 2004): 143–61. doi:10.1177/0886260503260245.
http://www.d.umn.edu/cla/faculty/jhamlin/3925/4925HomeComputer/Rape%20myths/Social%20Norms.pdf

Rape myth acceptance and time of initial resistance appeared to be determining factors in the assignment of blame and perception of avoid-ability of a sexual assault for both men and women. Consistent with the literature, women in this study obtained a lower mean rape myth acceptance score than men. As hypothesized, men and women with low rape myth acceptance attributed significantly less blame to the victim and situation, more blame to the perpetrator, and were less likely to believe the assault could have been avoided. Likewise, when time of initial resistance occurred early in the encounter, men and women attributed significantly less blame to the victim and situation, more blame to the perpetrator, and were less likely to believe the sexual assault could have been avoided.

The hypothesis that traditional gender-role types (masculine and feminine) would be more likely to blame the victim following an acquaintance rape than nontraditional gender-role types (androgynous and undifferentiated) was unsupported.
Kopper, Beverly A. “Gender, Gender Identity, Rape Myth Acceptance, and Time of Initial Resistance on the Perception of Acquaintance Rape Blame and Avoidability.” Sex Roles 34, no. 1–2 (January 1, 1996): 81–93. doi:10.1007/BF01544797.
http://www.researchgate.net/profile/Allison_Aosved/publication/226582617_Co-occurrence_of_Rape_Myth_Acceptance_Sexism_Racism_Homophobia_Ageism_Classism_and_Religious_Intolerance/links/02e7e52bd021d8bc72000000.pdf

Given that callous sexual attitudes permit violence and consider women as passive sexual objects, it follows that for men who endorse these, sexual aggression becomes an appropriate and accepted expression of masculinity. In this sense, using force to obtain intercourse does not become an act of
rape, but rather an expression of hypermasculinity, which may be thought of as a desirable disposition in certain subcultures. Taken together, these research findings suggest that an expression of hypermasculinity through callous sexual attitudes may relate to an inclination to endorse a behavioral description
(i.e., using force to hold an individual down) versus referring to a sexually aggressive act as rape. Hence, we hypothesize that the construct of callous sexual attitudes will be found at the highest levels in those men who endorse intentions to force a woman to sexual acts but deny intentions to rape.Edwards, Sarah R., Kathryn A. Bradshaw, and Verlin B. Hinsz. “Denying Rape but Endorsing Forceful Intercourse: Exploring Differences among Responders.” Violence and Gender 1, no. 4 (2014): 188–93.

The majority of participants were classified as either sexually coercive (51.4%) or sexually aggressive (19.7%) based on the most severe form of sexual perpetration self-reported on the SEQ or indicated in criminal history information obtained from institutional files. Approximately one third (33.5%) of coercers and three fourths (76%) of aggressors endorsed the use of two or more tactics for obtaining unwanted sexual contact on the SEQ. Although 63.4% of sexually aggressive men were classified based on their self-reported behavior on the SEQ alone, another 31% were classified on the basis of criminal history information indicating a prior sexual offense conviction involving an adult female, or on the agreement of both sources (5.6%). Notably, 90.1% of sexually aggressive men also reported engaging in lower level sexually coercive behaviors.DeGue, S., D. DiLillo, and M. Scalora. “Are All Perpetrators Alike? Comparing Risk Factors for Sexual Coercion and Aggression.” Sexual Abuse: A Journal of Research and Treatment 22, no. 4 (December 1, 2010): 402–26. doi:10.1177/1079063210372140.

The tactics category reported most frequently was sexual arousal, with 65% of all participants being subjected to at least one
expenence. Within this category, persistent kissing and touching was the most cited tactic (62% of all participants). Emotional manipulation
and deception was the next most frequently reported category, with 60% of participants being subjected to at least one experience. Within this category, participants
cited the specific tactics of repeated requests (54%) and telling lies (34No) most often. Intoxication was the third most frequently reported category, with 38% of all participants being subjected to at least one tactic. More participants reported being taken advantage of while already intoxicated (37%) than being purposely intoxicated (19%). The category with the lowest frequency of reports was physical force and harm, with 28% of participants being subjected to at least one tactic.Struckman-Johnson, Cindy, David Struckman-Johnson, and Peter B. Anderson. “Tactics of Sexual Coercion: When Men and Women Won’t Take No for an Answer.” The Journal of Sex Research 40, no. 1 (February 1, 2003): 76–86.

HJH 2015-02-08: Bolded comment policy, to increase the chance of it being read.
HJH 2015-10-31: Added a few more papers, relating to sexual coercion and hostility.