New PNAS paper: Income is a poor way to improve well-being

I have a hobby. I like citing papers that provide strong evidence that is directly in opposition to what the authors claim. Lucky me, because PNAS just published a paper that fits snuggly into this category. It’s titled “Experienced well-being rises with income, even above $75,000 per year”. An even better title would have been “Income is a poor gauge for well-being; don’t bother”.

This is the central figure:

Whoa! The paper doesn’t fail to mention (multiple times) how this is virtually a perfect linear relationship between well-being and log(income). The authors conclude that people with higher income are more satisfied with life and experience more well-being!

The real-life version of this graph

Let’s see what this graph looks like with a linear x-axis and with simulated raw data (see code below). The black dots are simulated individual participants from the study, assuming that statistical assumptions are met. Their exact distribution along the x-axis is not known (the author did not publish raw data), but that doesn’t matter. It only matters how the dots are dispersed for a given income.

You are one of the black dots. You could be one of the other black dots. But income does virtually nothing to lower or improve people’s well-being or life satisfaction. Income explained 1.5% and 4% of the variance in Well-Being and Life Satisfaction respectively (see code below).

Income explained 1.5% and 4% of the variance in Well-Being and Life Satisfaction respectively

Extreme income rise = extreme happiness?

Let’s study the most extreme case: you have an annual income of just $15,000. But now you get to jump into the life of a rich person all the way to the other side of the (linear) graph with an annual income of $480,000. The result of this 32-fold increase in income is expected to be:

  • Life Satisfaction: 62% used to rate higher than you. Now only 35% does! But there’s a 24% chance that you will be less happy than before.
  • Experienced Well-Being: 58% used to rate higher than you. Now only 40% does! But there’s a 33% chance that you will be less happy than before.

Hmm, hardly the extreme happiness rush that one might expect would expect from reading the paper and the ensuing media coverage. It’s a vanishingly small effect, considering the extremeness of this example.


I think it’s fairer to say that we’ve now learned that this paper is solid evidence that income is a really poor gauge for happiness. Someone once wrote a song about this:

money can't buy me love the beatles gif


Computing quantiles and probability of deterioration is simple. The z-score is your score relative to the whole population in units of standard deviations. So just the results for Life Satisfaction were obtained by reading off the extremes of the graph and computing the associated quantiles in R: pnorm(-0.3) and pnorm(0.4). Probability of negative income change is then just pnorm(-0.3, 0.4).

The plots are fairly straightforward too. First, let’s simulate the raw data:

# Data
N = 33391  # Number of participants
incomes = exp(seq(log(15000), log(480000), length.out = N))  # x-axis of Figure 1
trend_wb = seq(-0.2, 0.25, length.out = N)  # Linear, read off Figure 1
trend_ls = seq(-0.3, 0.4, length.out = N)   # Linear, read off Figure 1
z_wb = rnorm(N, trend_wb)  # z-score is standard normal
z_ls = rnorm(N, trend_ls)  # z-score is standard normal

# Merge the data in long format
df = data.frame(
  z = c(rnorm(N, z_wb), rnorm(N, z_ls)),
  outcome = c(rep("Experienced Well-Being", N), rep("Life Satisfaction", N))

Now let’s plot it:

ggplot(df, aes(x = incomes, y = z)) +
  geom_point(alpha = 0.1, size = 0.3) + 
  geom_line(aes(y = c(trend_wb, trend_ls), color = outcome), lwd = 2) + 
  geom_vline(xintercept = 75000, lty = 2, lwd = 1, color = "#555555") + 
  facet_wrap(~outcome) + 
  #scale_x_continuous(trans = "log", breaks = c(15, 30, 60, 120, 240, 480) * 1000) + 
  scale_x_continuous(breaks = c(15, 30, 60, 120, 240, 480) * 1000, labels = scales::label_number_auto()) + 
  labs(x = "Houshold income", y = "z-scored well-being") + 
  theme_bw(13) + 
  theme(axis.text.x = element_text(angle = 90))

If you want to verify other results, you can fit the linear models to the (simulated) data and see that we can approximately replicate the r-values reported in the paper (r = 0.17 and r = 0.09) and that this correspond to 4% and 1.5% variance explained, respectively:

summary(lm(z_ls ~ log(incomes)))
summary(lm(z_wb ~ log(incomes)))

I leave it as an exercise to the reader to filter out nonlow_incomes = incomes[incomes > 30000] and assess the predictive performance (lm(z_ls ~ log(nonlow_incomes)).

mcp: regression with multiple change points

I’ve made an R package, and I’m happy to report that it is likely to arrive on CRAN before Christmas! The last CRAN review asked me to change a comma, so the outlook is good.

mcp has grown quite ambitious and I think it now qualifies as a general-purpose package to analyze change points, superseding the other change point packages in most aspects. You can read a lot more about mcp in the extensive documentation at the mcp website.

This Twitter thread is a more bite-sized introduction:

Followed by this update:


mcp is the consequence of a long sequence of events:

  1. Looking at Complex Span data from a large project I did with researchers from Aarhus University, I saw strong performance discontinuities at a single-person level.
  2. I then learned that performance discontinuities play an important role in the estimation of (human) working memory capacity.
  3. I then learned that people mostly use ad-hoc methods to extract such points, including eye-balling graphs. That made me nervous.
  4. I then remembered a very small example from the book Bayesian Cognitive Modeling on inferring change points. I remember it because I was puzzled that Gibbs samplers could sample such points effectively.
  5. I then applied this to a Poisson model of our data and it worked beautifully. It outperformed all other models of this data.
  6. Around here was the first time I looked for other R packages to identify change points. There are a lot of them, but none could handle the hierarchical model I needed and most of them yielded point estimates rather than posteriors.
  7. As I began writing a paper, decided to apply the model to subitizing as well. However, while subitizing accuracy has a single change point, subitizing reaction times seem to have two or three.
  8. I implemented a three-change-points model, and immediately saw how easy it would be to extend it to N change points. That took around a day. That was the time of the first tweet above.
  9. Without much forethought, I took a brief look into R packages, pkgdown, etc., and before I knew it, it became a package.
  10. I began to see a roadmap for the package around the fall holiday, and spent most evenings there re-factoring so that the unpacking of the formulas was uncoupled from the generation of the JAGS code.
  11. The new architecture was incredibly easy to extend and then things went really fast. In particular, I looked a lot to brms for inspiration on the API. I wanted to make sure that the design could accommodate virtually endless oportunities.
  12. Here we are!

Why “mcp”?

mcp was the first name I came up with. I like it because it means multiple things to me:

  • Multiple Change Points as in “more than one change point”
  • Multiple Change Points as in “multiple kinds of models and change points”.
  • MC for “Markov Chains” to hint at the underlying MCMC sampling. We could give the package the pet name “Markov Chainge Points”.

I think I would have renamed it to “rcp” or “cpr” (“Regression with change points”) if they weren’t taken. “changepoint” is also a great name that’s already taken.

Can you instruct for far transfer? An RCT on N-back variants.

By Sebastian Bergmann Tillner Jensen, Malene Klarborg Holst, Tine Randrup-Thomsen, Camilla Erfurt Brøchner, & Jonas Kristoffer Lindeløv (supervisor). Project conducted as part of the Psychology bachelor at Aalborg University, Denmark.

Can we frame learning situations in such a way that people actually improve their abilities outside the classroom? We believe that the answer could be yes. Following newer school trends in adopting visible learning, as proposed by Professor of Education John Hattie, we believe that an instruction which motivates the pupil to think outside the box and evaluate the general applicability of their skills might be the key in improving learning. But how do we grab this concept from rich real-life situations and move it into the lab?

Spoiler alert: it involves torturing 52 participants with a quad N back task. The experiment, data, and analysis script is here. Read on…

Where do these crazy thoughts come from?

The holy grail in all of teaching is the transfer of learning, i.e., using a narrow teaching material to achieve generalized effects. Transfer of learning was first introduced by Edward Thorndike, Father of Educational Psychology, in 1901 and after many years of research, it is still a hotly debated topic to what extent humans (and other animals) actually can transfer learning. 

In 2008 Jaeggi and colleagues published an article in one extreme of this debate, suggesting the ultimate far transfer: training an N-back task led to large improvements in fluid intelligence. Not surprisingly, this attracted a lot of attention to the N-back task as a holy grail of transfer research – at least in the four years until several large RCTs failed to replicate the findings (Redick et al., 2013, Chooi & Thompson, 2012; see also recent meta-analyses by Melby-Lervåg, Redick & Hulme, 2016 and Soveri et al., 2017).

Put shortly, during the N-back task the participant is presented with a sequence of stimuli, and the task is to decide for each stimulus whether it is the same as the one presented N trials earlier. Try it out yourself:

Entering the lab

Using different versions of the N-back in a pretest-training-posttest design, we created a setup, which allowed us to utilize improvement as a measure of transfer. We split our participants into two identical groups – except we gave them different instructions before the training session: One group received an instruction, based on visible learning, to use the training to improve on the transfer tasks; and the other was directly instructed to improve as much as possible during the training. 

Before revealing the shocking nature of the quad N-back, we first need to dig deeper into the N-back…

Are all N-backs the same?

If you have ever stumbled upon the N-back before, it has probably been used as a measure of working memory (WM). But does the N-back live up to the task of encapsulating the rich nature of WM? We say no… 

Strong evidence points towards the fact that N-back does not transfer to other WM tasks (i.e. complex span, see Redick & Lindsey, 2013) since it primarily taps into updating and interference control.

But wait! There is more… Evidence points towards the notion that the N-back does not always transfer to other versions of itself!

This image has an empty alt attribute; its file name is ESQOVojCSHjbkKAp62Pwr15jmYcFZ1_aceZmmyvBqZX6toXuGFb5xtOjiWvyn5XMpz5SMgbA-CYYG4v8EA0rRkwQNLAIPpKg06JbpkcxOjQO9JA4UI-u7bugLW9E8XzyT8Pj9QG0
Yep. That was pretty much our reaction…

Research on the holy grail of learning is primarily focusing on 2-backs or higher; meaning that updating and mental reallocation of stimuli is a given. But what if we reduce the amount of N to 1? (see Chen, Mitra & Schlaghecken, 2008).

Creating an N-back transfer spectrum

And now what we have all been waiting for… *drumroll*… The quad N-back! Assuming that transfer varies between different versions of the N-back we can change the number of stimuli used, the type of stimuli used as well as N to create a transfer spectrum ranging from near to far relative to the training task. Disclaimer: This has been done relatively blind-sighted, since we found little literature to lean on, but here we go anyways!

Our shot at a transfer spectrum is visualized on the table below, in which the “farness” is based on the number of stimuli similar to the training task – a dual 2-back with position and audio. Pre- and posttest consisted of five different N-backs as can be seen in the table below.

This image has an empty alt attribute; its file name is nback_transfer_spectrum.png
A visualization of our attempted transfer spectrum. GREEN indicates stimuli identical to the training task. RED indicates new stimuli.

Now you may wonder what a quad 1-back looks like. Lo and behold the actually-not-so-terrifying abomination, which is the quad 1-back. Press the four arrows when the feature is identical to the previous stimuli:

This image has an empty alt attribute; its file name is UzYwmnZ_SKOs3uDetvpL5JnLu4_8V4wl1nAc0MJGIdI74Ee2sZFxkkrA8RENPpPwkDCzcbS0EVokiZH7rKUvZdBS0E-i9jF9P_pJKZW2ROv9R5LUzh2p0Y7w-tAEqu6_jk_kBibY

If you are eager to torture yourself with our 1-hour test, you can download our script uploaded here and run it in PsychoPy.

FUN FACT: Our participants actually performed better on the quad 1-back tasks than on the dual 2-back tasks, indicating that we in the future quite ethically can expose our participants to even more stimuli *maniacal laughter*.

If you’re interested in knowing what our data showed, feel free to read the results-section below, and if you’re really, really interested, you can download our full report (in Danish – Google translate is your friend).


We used the BayesFactor R package with default priors. Using default priors is generally not advisable, but this is a student project with a tight timeframe!

As expected, the improvement was quite small during the 10 minutes of training (Δd’ = 0.30 and 0.31 respectively, joint 95% CI [0.05, 0.59]) as revealed by linear regression on the time x instruction interaction. Surprisingly (to us at least), the evidence favored the model that participants who were instructed to focus on the transfer tasks improved just as much as participants who focused on improving on the training (BF01 = 5.1).

Participants who were instructed to focus on transfer had numerically superior improvements in the “nearer” part of the transfer spectrum, but the instruction x task x time interaction in an RM-ANOVA favored the model that the transfer instruction did not cause a superior (different) transfer effect (BF01 = 23.3).

This image has an empty alt attribute; its file name is _iziROhzq6-UfPEmbgCkPpV-FvOpqYxvYoeUQhWLl-uS-Ba3fOx1W5EWVeGPn-safCLCBvsIW1ZwZkvzF2Shg5rTLuPm8_dRZndblW5YPhpkCEnPwBD3cMs_bR4VP3XtkJUlFl8l
Individual and mean d’ over time during training. There were 250 training trials in total, 50 in each bin on the x-axis.
This image has an empty alt attribute; its file name is I5lBc5dspB-X7k34TjVETniImbm1n6hndLrH9lwm6zaCBEFdzPkEE_eK1zh22XszGjI6B2lhPRMojEJIt2NDtkbDlC3alXdrU4yDcQ7UOLwV40evedOTupsYxuRWwYmrDd9FBExD
Distribution of d-prime scores at pretest and posttest. d’ was greater for 1-back than for 2-back.


We used the same test order in pre- and posttest for all participants, likely imposing order effects (which is basically transfer occurring in between tasks in the pretest). As a solution for future studies we suggest to counterbalance the order of task 1 to 5 in pre- and posttest. Another solution could be to create a baseline score in practice-tasks before continuing to the pretest.

Secondly, our transfer spectrum is not self-evidently correctly ordered. We did not take the difference between 1- and 2-backs into consideration when creating the transfer spectrum, as to why we propose future research to choose between the two. One could focus on only the amount of stimuli instead of both the number of stimuli and N-backs.

Thirdly, we did not have enough participants to establish the desired power level. Further, the training session was probably too short. The obvious solution is to collect more participants and prolong the training.

New guide to reaction time analysis

TL;DR: I just published “Reaction Time Distributions – An Interactive Overview“.

There is a big literature on the analysis of Reaction Time data. Everybody can see that reaction times are not normally distributed but there is little consensus about how they are distributed. This has resulted in many advanced mathematical arguments why one or the other distribution is better. Others propose obscure rules-of-thumb ways of deleting or transforming data so that it looks normally distributed.

Meanwhile, the average researcher is left bewildered, often resorting to some variant of the normal distribution which they know and love from other kinds of data. This is worse than any of the alternatives as would be obvious to anyone who did a proper assessment of their model fit, e.g., using a QQ-plot, a Shapiro-Wilk test, or a posterior predictive check.

I think that we need a guide so accessible that it lowers the bar just enough that people jump aboard trying different RT distributions. Long story short, I ended up writing an overview and presented it in a twitter thread:

The easy part: making it.

I myself have been bewildered about analyzing reaction times. It was only when I discovered the flexible distributional regression in brms, that I took a few alternative distributions for a spin on a dataset I was working on. The superior fit immediately did away with all the problems I had been hacking myself out of.

  • The model fits faster and more reliably.
  • You need not transform the data and you need less outlier removal.
  • The model predictions actually look like the data you are modeling.

Naturally, the fit was superior to a Gaussian. A leave-one-out Cross-Validation found that the log-predictive error was halved! Here is a figure from an upcoming paper:

A few weeks later, I spent an evening checking out `shiny` to make interactive plots. Two evenings later, I had written most of the guide/overview. It was so easy. ` shiny ` made interactivity a breeze and ` brms` (and ` rtdists`) had PDFs for most of the distributions. It was just a matter of connecting the two. Another evening for the cheat sheet, and I was ready to put it online (this was the hard part; see below).

Warning: it’s also an opinion.

As the interactivity allowed me to get intuitions about the distributions myself, I grew quite opinionated because some distributions were really hard to get intuitions about.

Image result for warning opinion

Take for example the very popular ex-gaussian. The added exponential decay completely changes the mean and the standard deviation so that, e.g., μ = 0.4 secs and σ = 0.2 secs does not refer to anything identifiable in the distribution. Furthermore, the exponential parameter (λ) is very hard to understand, and in most analyses, I’ve seen it’s discounted. This means that they essentially say “all long RTs are not really RTs but something else” which I find odd. μ systematically underestimates RTs.

We want one parameter to capture most of what RTs are about, and I ended up going with the term “Difficulty” for that parameter cf. Wagenmakers & Brown (2007). I explain why in the overview, but briefly, it allows you to do univariate regression everybody does anyway. I haven’t seen this argument put forward elsewhere so I guess it counts as an opinion.

My favorites generic distributions are the shifted log-normal and the Decision Diffusion model as is also clear from the cheat sheet. Of course, people should choose on a case-by-case basis.

The hard part: hosting it.

The hard part was getting this to run on the web, which required around 6 evenings and a lot of debugging. is very convenient for RStudio users, but the server shuts down after a document has been in use for 25 hours in a month. I choose to set up a shiny server using Google Cloud because they offered a year’s worth of free computing and pretty low charges after that. I followed this guide.

However, I encountered heaps of issues which I’ve explained in the document’s GitHub repo. The current solution is good enough, but some issues remain.

New tutorial on optimal decisions using Utility Theory

TL;DR, here is the tutorial!

There is a wave towards using data more and more in decision making across all levels of business and society. However, people often use data quite informally: look at an Excel sheet or a graph, then make a decision based on your impression. This often works well, but it can be fragile because of our many deep-skin biases as well as a general poor ability to reason about quantities and complex interactions.

Decision theory to the rescue! By adding a few axioms to the basic axioms of probility theory, we can extend statistical modeling to make decisions which maximize utility – whether that utility is happiness, profit, health, public support, or something else entirely. I’m not saying that we should blindly hand over decisions to algorithms, but seeing their limited-worldview pure-quantitative solution can be a nice decision support to keep some of our biases at bay.

I was motivated to write this tutorial to fill in a gap: there is a lack of practical entry-level guides that scale well to complex problems. The entry-level here is someone who just want to update an Excel sheet with new data, and see how that changes the decision. Here’s the accompanying twitter thread:

I made a first draft of this for an elective course in the fall. Then a second draft for my presentation at the Bayes@Lund 2019 conference. And now I finally got to brush it off. OK, I become overly excited when I get to talk about Bayesian inference AND utility at the same time:

Jonas Kristoffer Lindeløv: extending bayes to make optimal decisions

Do We All Have “Impaired” Awareness of Our Abilities?

This is my poster for Neuroscience Day 2019. It is quite provocative, and there are nuances to this story:

  • This is likely a sort of Simpson’s Paradox in reverse, where there is little sensitivity to objective performance within groups (patients vs. healthy), but some sensitivity between groups.
  • I do not dispute that subjective reports reflect real subjective experiences. As such, measures on Quality of Life, emotional distress, etc. are not to be disregarded. But care should be taken to generalize from, e.g., reports of emotional distress to impacts on real abilities.

I do have a very nice dataset coming up from 124 respondents, where we improved substantially on the methodology, e.g., by asking participants to rate their performance in percentiles rather than on an ordinal scale. I plan to merge all of this in a paper.

I used a poster template and design idea which you can read more about in this Twitter thread. This was very much a last-minute panic. In particular, I’d liked to have worked more on the “ammo bar” to the right, but you only have so much time!

Don’t hesitate to contact me and let me know what you think:

Click to download PDF.

Scanner radiation caused 1% of flight-related cancers.

After extended public anxiety about cancer risks associated with back-scatter scanners, EU and U.S. banned them in 2012 and 2013 respectively. But how many people actually developed cancer from these scans before they were banned?

I have yet to find articles that estimate the world-wide mortality using consensus numbers. Most just state that the risk is “negligible” or “truly trivial”. That vague language is not comforting to a pedantic like me, so let’s look at the actual numbers. See the end of this post for a full list of sources and informative infographics.

Risk per scan: 0.3 millionth of a percent

The risk that an individual develops cancer when exposed to 1 microsievert of radiation is around 0.0000041 % for adults. The risk increases approximately linearly with radiation, so four scans quadruples the chance – not more and not less. The risk is only slightly higher (0.0000057 %) when including children, elderly, and heritable effects.

A typical back-scatter scan exposes you to 0.07 microsieverts of radiation (Multiple sources: see end of post.)

Multiply these two numbers and you end up with an elevated risk-per-scan of 0.07 * 0.0000041 % = 0.00000029 %.

Risk due to other radiation sources on your flight

The effect of a single scan exposed passengers to about as much radiation as being outside on the ground for 10 minutes or inside a flying airplane for 1.5 minutes (around 0.07 microsievert). This is due background radiation (most importantly cosmic radiation, i.e., the bombardment of particles from outer space, which decays into X-rays and other stuff when colliding with our atmosphere).

I leave it as an exercise to the reader to figure out whether scanners or background radiation constitute the largest risk factor for cancer on typical flights.

Scenario: camping in a scanner

If you want to increase the chance of developing cancer at some point in your life by 1% (one in a hundred), you would have to find one of the old back-scatter scanners and camp in it for four months straight, 24 hours a day. Every time you leave for the loo or to stretch yourself, you’d have to go back in and stay longer to compensate. And you’d have to have done it before 2013, because they are really hard to get now. Good luck on your adventure.

Image result for MR scanner sleep
Your home for the next four months.

What if we multiply by a few billion?

Even minuscule effects on individuals add up when they apply to a large number of people. In 2012, there were roughly three billion passengers who spend on average two hours per flight, totaling six billion hours in-flight.

Image result for gif passengers plane
Three billion passengers.

If all of them went through one back-scatter scan per flight, the result is that:

In 2012, 8.3 passengers developed cancer because of airport scanner radiation. 664 passengers developed cancer because of the background radiation in-flight but 95 of these would have developed cancer for the same reason anyway, had they stayed on the ground for the same duration. An additional 601 passengers developed cancer while commuting due to “normal” non-flight-related ageing and risk factors.

Sources: see end of post.

That is, scanners, background radiation, and ageing caused 767 cancers of which scanners make up one percent. The numbers above contains some estimates and could be off by a factor of two, i.e., between 0.5 and 2% of flight-related cancers were due to the scanners.

Risk of cancer after back-scatters were banned

In 2017 there were roughly 4 billion passengers, so while modern scanners should cause virtually zero cancers, 918 passengers developed cancer because of background radiation in-air and 832 simply due to regular ageing.

In comparison, 7.28 million of these 2017-passengers would develop cancer in the same year anyway, regardless of whether they flew or not. Flying only added one in 8,000 to that number.

Morale: statistical illiteracy and a note on war and terror

If you care about cancer, please do not waste your time thinking about airport scanners. Neither the old nor the new. If you felt scared, it is not your fault. We humans are notoriously poor at dealing with risk for rare events. We are statistically illiterate. I am too. There are just so many ways we fail that it is hard to count them. But take a look at the availability heuristic, loss aversion, and base-rate fallacy, as a way to get started.

Image result for i am stupid gif

You know about another rare event? Death by terrorism. In the sea of all sources of human suffering, terrorism makes up but a minuscule fraction of almost any other cause (guns, flu, traffic, etc.). The amount of money and time spent on terror prevention is truly staggering in comparison.

It costs around a million dollars per year to have a US or European soldier in war. It currently cost around 700 dollars on average to save a children’s life through the Deworm the World initiative. 1.400 children’s lives each year or one soldier at war?

Appendix 1: Where I got the numbers from

The risk of cancer onset (not necessarily death!) is around 0.0000041% per microsievert (μSv) for adults. It is around 5.7 * 10-6 % per uSv when including children, elderly, and heritable effects. This according to the a 2007-report (see table 1) by the International Commision on Radiological Protection which almost everyone cites in academia. Read more about radiation-induced cancer on Wikipedia.

Backscatter scanner dose is in the order of 0.07 μSv per scan:

This is about the same exposure as when sleeping next to someone for a night or eating a banana.

Background radiation on earth’s surface is in the order of 0.4 μSv per hour:

Background radiation while in flight is in the order of 2.8 μSv per hour, i.e. seven times that on earth’s surface:

  • 2.40 μSv per hour (0.04 μSv per minute) according to multiple sources cited in Mehta (2011).
  • 2.94  μSv per hour according to Enyinna (2016).
  • Many other sources whith full-text behind paywalls put this in the order of 3 μSv

There were approximately 3 billion passengers in 2013 and 4 billion passengers in 2017:

Each flight is around 2.0 hours in recent years (since 2005 at least) when reading off the chart on page 13 of this Boing-report.

There are 0.182 % chance of being diagnosed with cancer each year according to World Cancer Research Fund National. This translates into 0.0000208% chance every hour.

Appendix 2: Quick ways to learn more

History and future of R formula syntax

The R formula syntax is wonderfully condensed yet instructive. Python has basically given up coming up with its own syntax and now just use the `patsy` module to use R syntax in Python.

However, this particular syntax has no name. During twitter interactions the last few days, people have suggested “symbolic model notation”, “abridged model notation”, “Wilkinson notation”, and a few others. I think none of them did a good job of delineating this exact short notation, so I looked into the historical origins and posted this Twitter thread (click to read it all):

A simpler way of understanding (and teaching!) basic statistics

Last week, I published a cheat sheet and a post on how most common statistical tests are simple linear models. This started out as a hobby project last summer, but a few weeks ago, I realized that this was actually really important. So I spent many evenings polishing, and with my heart pounding, I tweeted:

It got a great reception and gathered more than half a million views on twitter within the first day.

On the bright side, this shows that people care about understanding statistics, and communicating it effectively. On the flip side, it may also reflect the fact that too many statistics courses consist of the rote-learning-rules-of-thumb-and-decision-trees which I seek to combat.

I was particularly excited to see support from notable scholars in statistics, including Russel Poldrack, Andy Field, and many others. However, my personal peak was when Andrew Gelman wrote a post about it! Or as my colleague put it:

I have also been extremely pleased that the community has joined in to improve it even further via the GitHub repository. It has been refined a great deal over the last week as a result. Follow the repository on GitHub if you want to stay up to date. Even better, raise an issue or submit a pull request. That would make me so happy!

Future guides

Much of what I demonstrate in that post has been known, published, and taught here and there for quite a while. I think that my main contribution was to lower the bar for understanding it, believing it, and teaching it.

Image result for easy gif

Naturally, the steady stream of likes and retweets has conditioned me to try more of this. Here’s the plan for future notebooks in the expected order of publication:

  • Update my notebook on Bayes Factors and put it on GitHub.
  • Finish a notebook on Utility Theory in time for the Bayes@Lund conference, where I will be presenting it.
  • Do a new notebook on Repeated Measures as mixed models, including RM-ANOVA, Split-plot ANOVA, McNemar, and Friedman.
  • In some way extending the post/cheat sheet (or making a new one) on how three statistical assumptions play out in all of these models.

A book?

I am also contemplating writing a book. There are a lot of good books out there already, and I don’t intend to compete with them. Rather, the book I have in mind should be mind-blowingly short and applied, covering 90% of a traditional textbook, including model checks and Bayesian inference, in 1/20th of the space.

Image result for small book

A third of those pages would have to be code examples and paper-and-pencil tasks so that it’s easily transferable to the real-world problems people encounter.

Other than everything-is-linear approach, there are a few other general-purpose tricks that can be pulled off to radically simplify statistical modeling. Having this all in a condensed yet accessible format would make it easier for the reader to get the larger picture.

SPSS is dying. It’s time to change.

I predict that R overtakes SPSS in yearly citations by 2020. The implications are clear:

  1. If you use SPSS in your business or research, move to R now rather than later.
  2. Do not ask for SPSS competences in job postings. You will scare away the good candidates.
  3. We are doing students a disservice by teaching SPSS. Switch to JASP for simple one-off analyses and R for complex or repeated analyses. Rstudio Desktop is a highly recommended interface to R.

The numbers

The numbers have been clear for a number of years now that SPSS was on the decline. It was very clearly exposed by Robert A. Muenchen in a comprehensive 2016-analysis of the use of data science software. Robert looked at everything from job postings to online queries to academic citations. I have updated two of these analyses to include data from 2017 and 2018: Google Search Trends and citations in the academic literature.

Here, we need to look at the trends rather than the absolute values for reasons I explain in the end of this post. Although R took a small dip in 2018, it is clear that it is getting traction. It is a good guess that R and SPSS will par citation-wise in 2019 and that R will have overtaken SPSS by 2020.

Let’s see why and what it means.

Why SPSS is dying

A few years ago, I wrote a blog about how a new GUI program, JASP, gets most things right, and how that exposing SPSS’s many shortcomings. SPSS simply feels old and unmaintained. Users have been screaming for simple statistics like Cohen’s d, confidence intervals on correlation coefficients, meta-analysis, etc., which has been a mandatory part of many major publication guidelines since 2000. This is not just some science formalia – these statistics are highly informative for industry as well. Despite repeated requests, SPSS has not implemented these, or many other standard statistical methods. SPSS now plans to change the GUI to match what JASP did four years ago.

SPSS simply feels old and unmaintained.

In addition, both industry and science now require greater reproducibility, transparency, and interaction with data. If you have ever tried using SPSS, you will know that it is fundamentally not fit for these.

Why R is surging

R saves you time. First, it is free, saving you (and your collaborators) time by not having to handle licensing and asking for budget approvals.

As a statistician, most of the time is spent pre-processing data before doing the statistics. Since the advent of tidyr in 2014, this has become incredibly easy to do. Perhaps more importantly, it has become much easier to read the code, which facilitates seamless collaboration and empowers you to learn much quicker from examples online. Pre-processing is often a non-linear process where you go back and forth. R is like editing a recipe in a text editor, and SPSS is like having to dictate the whole recipe on tape every time time you add a pinch of salt. JASP, by the way, is much closer to the recipe than the dictaphone.

[Concerning pre-processing,] R is like editing a recipe in a text editor, and SPSS is like having to dictate the whole recipe on tape every time you add a pinch of salt.

When researchers develop novel analysis methods, they will often publish them in user-friendly R-packages even before they publish the accompanying academic paper. For the most part, if you can think if it, it exists and is only one “install.package()” away. Not stumbling into software limitations saves you time.

Perhaps counter-intuitively, it turns out that students like programming (once they get started!) as it helps them better grasp what they are doing to the data than a point-and-click interface. In R, you can load data, pre-process it, and do a mixed model in just five relatively self-explanatory lines of code. It would take 20+ clicks in SPSS. If you want to do it for multiple datasets, you have to go through all that SPSS-clicking again (remember the dictaphone). It is really easy to miss a click and unknowingly get wrong results. Less repetition and debugging means more time saved in R.

After all, statistics is about the interaction and processing of variables. The same is true of programming. Therefore, programming requires less abstraction than graphical user interfaces. Programming is, of course, overkill for one-off analyses with little pre-processing. JASP, and its sibling Jamovi, are free graphical user interfaces to R that fills in this space.

Statistics is about the interaction and processing of variables. The same is true of programming. Therefore, programming requires less abstraction than graphical user interfaces.

Implications for industry and science

Consider SPSS a liability. Either weakly through taking more person-hours to use. Or strongly, through the increased risk of hard-to-detect errors.

Ask for R competences in job postings. If you ask for SPSS competences, you will select for applicants who are not up to date and filter out those who are, because they will want to avoid SPSS.

Consider SPSS a liability.

We should also stop teaching SPSS. Students spend a disproportionally large amount of learning the interface rather than learning the statistics. When they graduate, the cost of SPSS will incentivize them to avoid stats. JASP may be a good start for undergraduates because of the very shallow learning curve and sensible defaults. Then switch to R in the next semester to further empower the students. I have a few ideas (and a cheat sheet!) about how to improve stats teaching which would be easier in R.

Notes about the graph

The Google Trend values for R are likely inflated by the fact that R has a more technical audience which uses the internet more than SPSS users, who do less advanced analyses based on books. As a reflection of actual usage, we should probably just look at the trends which show that R is on the increase and SPSS is on the decline.

The absolute citation numbers in the graph is a bit misleading since it goes down while we know that the annual publication volume is increasing exponentially. It seems that we simply cite the analysis software less frequently than we used to. However, the relative popularity of software packages is still valid and R is close to overtaking SPSS.

To collect these data, I wrote a Google Scholar Scraper here (in R, of course) and I have posted the datasets here. I used Robert Muenchen’s search terms which I found valid. As a side effect, you can use this scraper to collect time-trends in all Google Scholar searches. I just happened to search for statistical packages.

Note that you need to update the full dataset to compare citations by year. For example, Robert found ~300.000 SPSS citations in 2011 where I found ~375.000 using the same search string. Google Scholar improves and more publications are retrospectively put online.