Foundations of Quantitative Research in Political Science
- In April 2017, president Donald Trump was interviewed by Reuters to discuss some of his accomplishments in his first 100 days in office. At some point in the interview, president Trump handed the journalists a copy of this map of the 2016 presidential election. The areas that are shaded red represent areas where more people voted for Trump than Hillary Clinton and the area shade of blue represent areas where more people voted for Hillary Clinton than Donald Trump. Just by looking at the map, it appears at first glance that president Trump may have won the vote share by an overwhelming margin. Clearly, we can spot more areas shaded red than areas shaded of blue. But this map is misleading. This map is misleading mainly because it was used by the president and conservative media to represent citizens who cast a vote for each candidate. But in reality, it does no such thing. The map actually represents territories or masses of land that voted for Trump. In fact, when we look at this chart, it actually shows that Hillary Clinton won the popular vote share by a slight margin. From this example alone, we can see that although charts and graphs can be seductive and persuasive, they could also be misrepresented in serious ways that could tell us something truthful about the world but without caution, they can be deceptive as well. In this video, I'm going to show you a few examples of ways in which data and data visualizations may be misleading. This video is by no means meant to be a comprehensive overview of this topic. In the course module, we provide you with several links and citations of resources that you can use to dive a bit deeper. This video will help you identify when data, charts and graphs can either be used to mislead and deceive or if you're not careful may cause you to come to some faulty inferences about the data. By at the end of this video, you should be able to begin to understand what to look for when scrutinizing a chart or graph and to know how to spot some of the lies or how a chart could potentially be misread if you're not cautious. But with this in mind, you'll also be able to recognize truth and insight that data visualizations can bring to bear. By recognizing the C4 or truthful graphs, you'll be able to design visualizations in your own work that are both smart and informative. Let's take a look at our first example. In this part of the video, we're going to talk about something that is not necessarily deceitful but can lead you astray if you're not cautious. We're going to consider Simpson's paradox which has all sorts of implications for how we report statistics including the survivability of COVID-19. Let's compare China and Italy and focus on the case fatality rate. The case fatality rate is basically the proportion of the deaths from a disease compared with the total number of people diagnosed with that disease for a specific time period. This chart here breaks down the case fatality rate by age range every 10 years. And if you look at any one of those age brackets here on the x-axis say for 60 year olds, well, it turns out 60 year olds in Italy who have COVID-19 are more likely to survive than 60 year olds with COVID in China. This is the case for 50 year olds, 70 year olds and 80 year olds. In every single age bracket according to this chart, you are more likely to survive COVID-19 if you were in Italy than if you were in China. But if we look at the total percentage of people who died of COVID-19 over here to the far right, we see that you were more likely to survive COVID-19 in China than in Italy. So how can this be? How can it be that for every age bracket people in Italy were more likely to survive COVID-19 than people in China and yet when we aggregate the total number of people who died of COVID-19, it turns out that you were more likely to survive if you were in China. To make sense of this puzzling relationship, let's look at this chart here which shows a breakdown of the percentage of people who were simply diagnosed with COVID-19 by age range every 10 years. And what we see here is that there is a higher proportion of older people in Italy than there are in China, people in their seventies and eighties who were diagnosed with COVID-19 than there are younger patients, say people in their twenties and thirties. So why does this matter? This matters because one fact that we know about COVID-19 is that younger people are more likely to survive than older people. So the fact that Italy has a higher proportion of older patients than China who were diagnosed with COVID-19 means that their survivability rates are pulled down simply because older patients are much less likely to survive than younger patients. In other words, even though Italy is doing a relatively better job than China in terms of the survivability of COVID-19 for any individual age group, as we saw on the first chart, the total percentage of people who die of COVID-19 in Italy is pushed up simply because Italy happens to have a much higher proportion of people who are more likely to die of COVID-19 than China does. And this gets at the heart of Simpson's paradox. Simpson's paradox is the phenomenon of a trend that initially appears in data that are separated by groups and suddenly reverses when the data are aggregated together. The lesson here is that in data and data visualization, a fundamental question is about what level of aggregation you want to focus on. If you put everything together in one chart then you may miss the full story as is the case with Italy and China's handling of the Corona virus. But if you focus too much on one particular group or groupings and don't aggregate it all, for example, if you focus on individual counties or even individual people then you can't really say too much about a trend because the data become too messy. So with data and statistics, you want to aggregate to some degree in order to have a meaningful story about the things you're interested in explaining, but at the same time, note aggregate to the point where you lose the bigger story. It's about being careful about knowing the cause of relationships to know where in the picture it's appropriate to aggregate and to what extent. Okay, let's move on to our final example. Here, I wanna talk more specifically about misleading data visualizations. In particular, I'm going to talk about just some of the ways in which data visualizations may cause people to come to faulty inferences. Specifically, I wanna show you what happens when the chart isn't proportional to the data or when the scales of measurement that the chart uses are incorrect which can lead to charts that distort information. In September 2015, the U.S. Congress held a hearing with the leadership of Planned Parenthood. During this hearing, Congressman Jason Chaffetz showed this chart of just two of the services that Planned Parenthood provides. Even without reading the tiny numbers printed on the corners of each line, the graph that Congressman Chaffetz display clearly shows that Planned Parenthood provides less cancer screening and prevention services and perform more abortions between 2006 and 2013. But this chart is a distortion of the truth. This is mainly because the chart uses a completely different vertical scale for each variable. To see this, let's zoom in on the numbers. We can see that the cancer screening and prevention services that Planned Parenthood provides did indeed drop by about 50% between 2006 and 2013. However, the number of abortions that Planned Parenthood performed between 2006 and 2013 increase from roughly 289,000 to 328,000. If we plot the figures again but this time using a common scale, the resulting chart looks something like this. Let's take another quick look at an example to see what I mean by this. In December 2015, the Obama White House tweeted this chart indicating that Americans were graduating high school at far higher rates than ever before, and the high school graduation rate was at an all-time high. Notice that the graph makes it look like the graduation rate in the 2007 to 2008 school year which is 75% is less than half the graduation rate of the 2013 to 2014 academic year. But this is obviously a misrepresentation of the truth. Another way this graphic misleads is by failing to present all available data. In fact, we have information on American high school graduation rates that go as far back as the 1970s. We can correct for these two distortions and many other distortions like this in a couple of ways. One way to do this is by simply putting the baseline at 0% and the upper limit at 100%. Doing so would create a graphic like this one. Since we also know that we have data on high school graduation rates going back farther in time and across many different presidents, we could do that as well. Notice in this graphic, however, that the baseline is not set to 0%. In this case, that's okay since overall high school graduation rates in the United States do not dip below 70% very often. And the decision to not make the baseline at 0% is transparent. Tweaking the scales or proportions of a chart is just one of many ways that those who wish to conceal the truth do so through charts. The main lesson we should take away from these two examples is to pay close attention to things like scales and legends. If you would like know more about how data visualizations, charts and graphs can lead you to potentially faulty conclusions, I strongly suggest you check out some of the resources we've provided you in the course module. I also strongly recommend that you take the quiz at the end of this video to strengthen your knowledge on this topic. Thank you.