Posted by Austin Fossey

If you work with assessment statistics or just about any branch of social science, you may be familiar with Simpson’s paradox—the idea that data trends between subgroups change or disappear when the subgroups are aggregated. There are hundreds of examples of Simpson’s paradox (and I encourage you to search some on the internet for kicks), but here is a simple example for the sake of illustration.

**Simpson’s Paradox Example**

Let us say that I am looking to get trained as a certified window washer so that I can wash windows on Boston’s skyscrapers. Two schools in my area offer training, and both had 300 students graduate last year. Graduates from School A had an average certification test score of 70.7%, and graduates from School B had an average score of 69.0%. Ignoring for the moment whether these differences are significant, as a student I will likely choose School A due to its higher average test scores.

But here is where the paradox happens. Consider now that I have a crippling fear of heights, which may be a hindrance for my window-washing aspirations. It turns out that School A and School B also track test scores for their graduates based on whether or not they have a fear of heights. The table below reports the average scores for these phobic subgroups.

Notice anything? The average score for people with and without a fear of heights in School B is *higher* than the same groups in School A. The paradox is that School A has a higher average test score overall, yet School B can boast better average test scores for students with a fear of heights and students without a fear of heights. School B’s overall average is lower because they simply had more students with a fear of heights. If we want to test the significance of these differences, we can do so with ANOVA.

**Gaviria and González-Barbera’s ****Steelyard Graph**

Simpson’s paradox occurs in many different fields, but it is sometimes difficult to explain to stakeholders. Tables (like the one above) are often used to

illustrate the subgroup differences, but in the Fall 2014 issue of *Educational Measurement*, José-Luis Gaviria and Coral González-Barbera from the Universidad Complutense de Madrid won the publication’s data visualization contest with their Steelyard Graph, which illustrates Simpson’s Paradox with a graph resembling a steelyard balance. The publication’s visual editor, ETS’s Katherine Furgol Castellano, wrote the discussion piece for the Steelyard Graph, praising Gaviria and González-Barbera for the simplicity of the approach and the novel yet astute strategy of representing averages with balanced levers.

The figure below illustrates the same data from the table above using Gaviria and González-Barbera’s Steelyard Graph approach. The size of the squares corresponds to the number of students, the location on the lever indicates the average subgroup score, and the triangular fulcrum represents the school’s overall average score. Notice how clear it is that the subgroups in School B have higher average scores than their counterparts in School A. The example below has only two subgroups, but the same approach can be used for more subgroups.

Example of Gaviria and González-Barbera’s Steelyard Graph to visualize Simpson’s paradox for subgroups’ average test scores.

**Making a Decision when Faced with Simpson’s Paradox**

When one encounters Simpson’s paradox, decision-making can be difficult, especially if there are no theories to explain why the relational pattern is different at a subgroup level. This is why exploratory analysis often must be driven by and interpreted through a lens of theory. One could come up with arbitrary subgroups that reverse the aggregate relationships, even though there is no theoretical grounding for doing so. On the other hand, relevant subgroups may remain unidentified by researchers, though the aggregate relationship may still be sufficient for decision-making.

For example, as a window-washing student seeing the phobic subgroups’ performances, I might decide that School B is the superior school for teaching the trade, regardless of which subgroup a student belongs to. This decision is based on a theory that a fear of heights may impact performance on the certification assessment, in which case School B does a better job at preparing both subgroups for their assessments. If that theory is not tenable, it may be that School A is really the better choice, but as an acrophobic would-be window washer, I will likely choose School B after seeing this graph . . . as long as the classroom is located on the ground floor.