Political Science Educator: volume 27, issue 1
Reflections
Steven Perry, Rice University
Over the past few years, I have taught a series of undergraduate and graduate courses on policy analysis and methodology, data visualization, and R programing for quantitative analysis. In each of these courses, as an exercise in data exploration and overcoming the challenges of working with real data, I assign my students to explore a seemingly simple public policy question: What happens to the animals dropped off at municipal animal shelters?
As part of the activity, I provide students a dataset of animal records from the animal shelter of a major Southern US city. The dataset contains records on almost 77,000 animals across a three-year time span, and is the epitome of a challenging, unclean dataset. For example, there is no codebook or uniform coding scheme, multiple variables have incorrect or illogical information (such as a variable on an animal’s duration in the shelter with many negative observations), and many values are missing or are mistyped. To further complicate analysis, many observations are identical duplicates, but some apparent duplicates are records of the same animal entering the shelter more than once.
In small teams, students must wade through the data to uncover insights and identify trends about what features and characteristics affect an animal’s likelihood of being adopted, and how long they remain in the shelter. Each group is assigned one specific question to investigate, and by the end of class, the team is responsible for cleaning the specific variables they need for their analysis, generating descriptive or summary statistics, running a model, and creating at least one visual that effectively communicates their results.
While this activity is not focused on traditional methodology-course topics such as statistical modeling, empirical analysis, or model selection, from my perspective learning to overcome the obstacles and challenges of real-world imperfect data is one of the most valuable lessons we can provide our students. In our statistics, methods, and research design courses, we too often focus on the math and programing sides of the project, to the detriment of the other skills students need to actually carry out a data analysis project.
According to Anaconda’s 2022 State of Data Science survey, data science practitioners report spending almost 40% of their time on data preparation and cleaning, an amount far greater than the time they invest in selecting, training, and implementing models (Anaconda, 2022). This is particularly troubling given that data practitioners routinely report that data cleaning is the area of their work that they enjoy the least, and (to some extent) that the received the least formal training on. In addition to acquiring and cleaning data, survey respondents also report spending an additional 29% of a project on reporting, presenting, and visualizing their results and data-driven insights. From these results, it is clear that there is a substantial imbalance between the time practitioners devote to data soft skills, and the attention we give these topics in our standard statistics and methodology curriculum.
As we look to how to best prepare our students to be effective data-literate practitioners outside of our classrooms, we cannot overlook the need to expose our students to these vital skills. If we want to prepare our students to meaningfully engage with data in their own research projects, their careers, or in graduate school, we must make sure that the data knowledge and familiarity they gain inside the classroom effectively mirror what they will encounter later, when working on real-world projects. When we focus solely on data modeling, statistical techniques, and other end-of-the-project analysis, we remove a key opportunity for our students to learn from the struggles of locating, acquiring, and cleaning their own data.
When we only prepare our students to work with data that has already been acquired, cleaned, and sanitized of any coding errors, nonsensical values, and/or mistakes, we do a great disservice to our students and their data literacy skills. Real world data is messy. It has coding errors and mistakes, is incomplete, has missing observations, and often lacks any semblance of a useful codebook. Learning to work with the realities and limitations of real-world data is a skill no less important than effective research design or empirical modeling. We do our students no favors by focusing solely on the analysis portion of a data project, while simultaneously ignoring the hard work of data collection, cleaning, and transformation that are students will need to undertake in any data-related career. As we work to build our student’s skills in quantitative analysis, coding, and general data literacy, it is critically important that we focus not only on the hard skills of statistical analysis and empirical modeling, but also on the soft skills necessary to successfully carry out a data project: how to find data, how to clean it, and how to communicate its insights to others.
References
Anaconda, 2022. 2022 State of Data Science Report. Retrieved June 13, 2023. (https://www.anaconda.com/resources/whitepapers/state-of-data-science-report-2022/).
—
Steven Perry is a Lecturer in Data Visualization for Rice University’s Program in Writing and Communication, where he teaches courses on data visualization, quantitative analysis, and R programing. His research interests include political psychology, individual decision making, and effective data communication.
Published since 2005, The Political Science Educator is the newsletter of the Political Science Education Section of the American Political Science Association. All issues of the The Political Science Educator can be viewed on APSA Connects Civic Education page.
Editors: Colin Brown (Northeastern University), Matt Evans (Northwest Arkansas Community College)
Submissions: editor.PSE.newsletter@gmail.com
APSA Educate has republished The Political Science Educator since 2021. Any questions or corrections to how the newsletter appears on Educate should be addressed to educate@apsanet.org
Educate’s Political Science Educator digital collection



