Foundations of Quantitative Research in Political Science
- Almost all of you are familiar with a university's online system that allows you to register for courses. We can see from this screenshot a table that displays relevant information for registering for a particular class. We can even see that this information is organized in a deliberate way. For example, notice in this table that each of these columns correspond to different attributes of a course we might want to register in, such as the time the course takes place, the room location and the instructor of record. Notice also that each individual row corresponds to a unique time and section but for the same class. And in the natural and social sciences, we often use data sets that are organized in a way like this in order to answer empirical questions. In this video, we're going to introduce you to something that is conceptually very similar to this table here but crucial to master in order to successfully do empirical social science. We are going to focus on three learning objectives. First, defining what a data set is and how data sets are commonly organized. Second, identifying what a codebook is and how to use one, and finally to understand conceptually and substantively the differences between a unit of analysis and unit of observation. But before diving into this, let's first start off with some basic definitions. A dataset is simply a collection of data usually organized into one document or table. A dataset consists of two key components. First, the units that we're interested in studying and second, a set of attributes of the units under study, which we call variables. Data sets are organized by rows and columns where the units of observation correspond to rows and variables correspond to columns. Data sets can have information about a great number of things. For example, data sets can be about different countries, different states or different people. In a dataset, these things are the unit of analysis. The entity that frames the focus of study. The unit of analysis is the who or what for which information is analyzed and conclusions made. A data sets about people like in survey data, the unit of analysis is individual people. But we distinguish between the concept of the unit of analysis from the unit of observation. The unit of observation refers to the who or what for which data are measured or collected. To return to our survey data example, the unit of observation is each unique survey respondent. In this case, the rows and data points in survey data correspond to different individuals who participated in the survey. Let's take a look at our first example of a dataset. Here, we have data from a public opinion survey taken by the Chicago Council on Global Affairs. A think tank that analyzes critical global issues. The Chicago Council survey ask a random sample of adult Americans questions about things like their preferences on defense spending, whether they think voters have too much or too little influence on foreign policy decision-making or whether they think globalization has been a good or bad thing for the United States. Notice first that this data set is a survey of individuals with every single bro corresponding to an individual survey respondent. As I mentioned before, this means that our unit of analysis are individual people and the unit of observation is each individual taken in the survey. We can see by scrolling through the dataset that there are a large number of different variables. Many of these variables correspond to a particular question the survey asks the respondent while others are about demographics of a particular respondent, such as their gender, race, age, and whether they are affiliated with a particular political party. How could we know exactly what questions were asked and what the value is located into the cells correspond to. Every good data set comes with something called a codebook. A codebook is simply a manual that gives interested readers information about the dataset such as what the units of observation are, the variables available for analysis and how the variables are measured. For example, we can see in this codebook whether the 2010 Chicago Council survey asks individuals about their preferences on spending on the military. One way to go about doing this is to scroll through the codebook and check to see if these types of questions are asked. And alternative, and perhaps quicker way is to hit Ctrl+F if you use a windows based computer or CMD+F if you use Mac OS and type the word defense to quickly search through questions that mentioned this word. We can see from this example that the survey did indeed ask respondents about their defense spending preferences. The codebook also tells us how our variables of interest are measured. This particular variable whether or not a survey respondent prefers spending on defense to be increased, cut back or kept about the same is a categorical variable. We can see here that our respondent received a one in the dataset if they say that they prefer spending on defense to be increased, a two if they prefer defense spending to be cut back, a three if they prefer defense spending to be kept about the same, and four if they expressed they were not sure. This variable is categorical because it's not clear where the not sure category should be placed in the list of the options that respondents can choose from. We can confirm this when we look at the column in our data set that corresponds to this particular variable. Now, let's take a look at a different dataset. Here we have data from the Correlates of War project on world religions. This data set differs from the one we just examine in a variety of ways. First, notice something about the units of observation. The unit of observation in this case is a country year and not individual people like in our last example. This can be seen by looking at two columns here. The column to the far left tells us the particular year the data were recorded in and this particular column corresponds to particular countries not individuals. The implication of this is that the data we see in the world religions data is aggregate data. In other words, each cell in each column row represents either a number or a percentage of some people that correspond to some religion in a particular country in a specific year. For example, our codebook tells us that chrstcat, the total number of individuals who self identify as Christian Catholics for specific country in a particular year. The cell highlighted here tells us the estimated total number of those who identify as Christian Catholics in the United States in the year 1995. A second implication of aggregate data is that key only test hypotheses about particular countries or country years, not about individuals. The survey data from the last example gave us information about things like an individual's political preferences, their religion, and their personal income. The data under consideration here are not as fine grained. It gives us aggregate indicators of a particular religions in specific countries and specific years, not information about individuals themselves. For example, one hypothesis that we can test with the world religion data set is that countries with higher percentages of people who report being religious have stricter abortion laws than countries with lower percentages of people who report being religious. This is not a hypothesis we could test with individual level survey data. With individual level survey data, we could potentially test the hypothesis that people who report being religious are more likely to support stricter abortion laws than those who report being non-religious. The crucial difference is in what we are comparing. In the first example, we can make comparisons between countries, not individuals. However, with data that has individuals as the unit of analysis, we can make comparisons between individual people, but not between aggregate units like countries. To briefly recap the objective of this module is for you to understand what a data set is and how data sets are commonly organized. Know what a codebook is and how to use one and to understand the conceptual and substantive differences between a unit of analysis and a unit of observation. Having a firm grasp of the substance we covered in this video is crucial for doing any quantitative social science. Knowing what our unit of analysis and units of observation are, allows us to know what kinds of hypotheses we can and cannot test. A data sets codebook allows us to understand our data in a rather efficient and transparent way. I encourage you to go back and review anything you didn't quite get the first time and to reinforce your understanding by answering some of the quiz questions that we provided for you at the end of these videos. We have also collected additional sources about data sets that you can refer to in additional study, thanks.