OToPS/MAGIC

MAGIC in reporting statistical results
Notes on Robert Abelson’s Statistics as Principled Argument

MAGIC is an acronym for a set of principles in reporting and evaluating results.

Magnitude
Refers to the size of the effect, and the practical importance of it. Reporting effect sizes is something we should routinely do, in addition to (or maybe even instead of?) reporting statistical significance. There are a variety of different effect sizes available, and they can be converted from one to another (which helps a lot with meta-analysis).

Simply using the significant level can be dangerous, as p value depends on the degree of departure from null hypothesis and sample size. A very lager sample size can almost definitely give you 'significant' results. Besides probabilities (p value) and effect sizes, a third criteria to assess magnitude is confidence limits.

The rubric for small, medium, and large effect sizes is a start, but there also are alternate effect sizes that are worth considering. Evidence Based Medicine developed the Number Needed to Treat (NNT), Number Needed to Harm (NNH), and similar effect sizes for categorical outcomes. Rosenthal’s Binomial Effect Size Display (BESD) and Cohen’s non-overlap measures are other examples of ways of re-expressing effect sizes.

The other key point is the practical significance. If the outcome is life or death, then even small effects have clinical relevance. The protective effects of low-dose aspirin against heart attack are a frequently-cited example; expressed as a point-biserial correlation, the r < .001, but it is statistically significant, and extrapolated across millions of people, it translates into hundreds or thousands of lives saved.

Articulation
How detailed are the analyses? Is there sufficient supporting detail to contextualize the findings? Did the investigator look for possible interactions (with sex, or with other subgroups?). A good degree of articulation adds depth and nuance without undermining the narrative. The opposite would be “slicing the salami” thinly and doing a series of papers that present narrow subsets of analyses or subgroups -- the “least publishable unit.”

Articulation of research is related to two concepts: ticks, the detailed statements of distinct research results, usually stemming from the rejection of a null hypothesis (p < .05); and buts: statements that qualify or constrain ticks.


 * For example, in a statement with significant main effect and interaction effect: "There is a significant main effect on X (a tick), but its magnitude is significantly greater for A than B (a but)."

Generalizability
Are the results likely to replicate? Do they apply to a general population? Or are they likely to be limited to a particular sample? A common critique of a lot of psychology research (especially social psychology and personality work) is that it has been the study of undergraduate psychology majors. Ouch.

When the researchers go to the effort of getting a sample that is more representative of the population that they want to make inferences about, that enhances generalizability (here also sometimes referred to as "external validity"). If one wants to talk about how personality relates to mood disorders, which sample would be more representative? College undergraduates taking a psychology course, or people coming to a clinic seeking counseling and therapy?

Generalizability also is improved when researchers do "replications and extensions" in samples that differ in some way from the samples used in previous work. For example, if most projects have been done in the United States, then doing similar study (i.e., "replication") in a different country (i.e., the "extension") also would be a way of testing the generalizability.

Generalization from single studies can be rather weak, but there are other designs of study that can increase generalizability. For example, meta-analysis is a way to generalize results across studies.

Interest
Is the topic intrinsically interesting? Is there a practical application? Do the authors do a good job of engaging the reader and conveying the value of the work? The ABT idea from Olson is a technique for clarifying the message. Abelson’s idea of “interest” also focuses on the practical importance of the finding.

When research is "interesting" it can generate conversations, which can lead to further research and greater discoveries. Many aspects go into creating interesting research including theoretical interest, surprisingness, and importance. Statistics is not generally thought of as interesting, but with the right research hypothesis and argument, an important story can be conveyed.

One way to make research more surprising, and therefore more interesting, is focusing research efforts into neglected topics, producing results that go against our intuitions or against what seems logical. Albeson gives the example of the Milgram experiment.

An aspect of interest as outlined in Chapter 8 of Abelson et al. is importance, often measured through the number of connections to other relevant issues a topic of study has. The illusion of importance phenomenon demonstrates that importance is subjective: scholars of a specific topic within a subject area might believe the importance is large, but a nonspecialist within the subject area might believe the importance is less due to the knowledge gained not impacting their understanding of other topics. Subjective importance is warranted for those who know an awful lot about a specific subject area!

The key question when assessing the importance of a result is, "What can I learn from this about other things that are also important?"

Credibility
Is the work looking trustworthy? Or is it “fishy?” Abelson describes a continuum from conservative to liberal approaches to analysis. The conservative extreme would be to write down the hypothesis and the analytic plan ahead of time (a priori), only test one primary outcome (or use a conservative analysis to adjust for multiple testing, such as a Bonferroni correction or alpha <.01), and not run or report any additional analyses. The pre-registration movement (ClinicalTrials.gov, and the replication project at OSF) are examples of the conservative approach.

The liberal approach is to do lots of exploration, sensitivity and subgroup analyses, and report it. If framed as “exploratory analysis” or “context of discovery,” it could be okay. Machine learning algorithms and “data mining” usually are taking this more liberal approach.

Where it gets sketchy is when people run lots of analyses (liberal methodology), but misreport it as if it were conservative analysis. Only reporting the most significant results as if they were the a priori hypotheses, or failing to disclose how many analyses were run, is when “liberal” turns into p-hacking.

Some reflections
It is probably not possible to score high on all of the MAGIC facets with a single study. There are limitations to any study, whether it be sample size, representativeness of the sample, quality of measurement, or adequacy of the analyses and presentation, among many other possibilities. Many of the principles are in tension with each other. More expensive methodology drives up the cost of the study and usually leads to a smaller sample size, whereas very large samples use less rigorous methodology out of necessity. This could be expressed as a trade-off between reliability and validity of measurement versus sample size, both of which affect the precision of estimates and thus the statistical power to detect effects. There also can be a risk of sacrificing credibility and conservative approaches to analysis in a rush to hype the "interest" of findings.

Abelson was writing before there was a replication crisis, before OSF existed…. The possibility of sharing code, sharing data, and publishing the analyses make it possible to adopt a more liberal approach to analysis, yet also do it responsibly, with transparency and the opportunity for others to understand and work through the analysis to see if they would arrive at similar conclusions.

Machine learning methods are another development in the field, where the computer automates running a wide range of models, and then looks at both the fit (described as accuracy of prediction, or reducing “bias” in prediction) and the stability of the model across resamplings (often described as the variance of the estimates across k-fold cross validation) as a way of trying to balance “discovery” with reproducibility.