Data cleaning is the process of verifying and editing data files to address issues of inconsistency and missing information. Errors in data files can appear at any stage of an evaluation, making it difficult to produce reliable data. Data cleaning is a critical step in program evaluation because clients rely on accurate results to inform decisions about their initiatives. Below are six essential steps I include in my data cleaning process to minimize issues during data analysis:

1. Compare the columns of your data file against the columns of your codebook.

Sometimes unexpected columns might appear in your data file or columns of data may be missing. Data collected from providers external to your evaluation team (e.g., school districts) might include sensitive participant information like social security numbers. Failures in software used to collect data can lead to responses not being recorded. For example, if a wireless connection is lost while a file is being downloaded, some information in that file might not appear in the downloaded copy. Unnecessary data columns should be removed before analysis and, if possible, missing data columns should be retrieved.

2. Check your unique identifier column for duplicate values.

An identifier is a unique value used to label a participant and can take the form of a person’s full name or a number assigned by the evaluator. Multiple occurrences of the same identifier in a data file usually indicate an error. Duplicate identifier values can occur when participants complete an instrument more than once or when a participant identifier is mistakenly assigned to multiple records. If participants move between program sites, they might be asked to complete a survey for a second time. Administrators might record a participant’s identifier incorrectly, using a value assigned to another participant. Data collection software can malfunction and duplicate rows of records. Duplicate records should be identified and resolved.

3. Transform categorical data into standard values.

Non-standard data values often appear in data gathered from external data providers. For example, school districts often provide student demographic information but vary in the categorical codes they use. For example, the following table shows a range of values I received from different districts to represent students’ ethnicities:

Hubbard-Graphic

To aid in reporting on participant ethnicities, I transformed these values into the race and ethnicity categories used by the National Center for Education Statistics.

When cleaning your own data, you should decide on standard values to use for categorical data, transform ambiguous data into a standard form, and store these values in a new data column.  OpenRefine is a free tool that facilitates data transformations.

4. Check your data file for missing values.

Missing values occur when participants choose not to answer an item, are absent the day of administration, or skip an item due to survey logic. If missing values are found, apply a code to indicate the reason for the missing data point. For example, 888888 can indicate an instrument was not administered and 999999 can indicate a participant chose not to respond to an item. The use of codes can help data analysts determine how to handle the missing data. Analysts sometimes need to report on the frequency of missing data, use statistical methods to replace the missing data, or remove the missing data before analysis.

5. Check your data file for extra or missing records.

Attrition and recruitment can occur at all stages of an evaluation. Sometimes people who are not participating in the evaluation are allowed to submit data. Check the number of records in your data file against the number of recruited participants for discrepancies. Tracking dates when participants join a project, leave a project, and complete instruments can facilitate this review.

6. Correct erroneous or inconsistent values.

When instruments are completed on paper, participants can enter unexpected values. Online tools may be configured incorrectly and allow illegal values to be submitted. Create a list of validation criteria for each data field and compare all values against this list. de Jonge and van der Loo provide a tutorial for checking invalid data using R.

Data cleaning can be a time-consuming process. These checks can help reduce the time you spend on data cleaning and get results to your clients more quickly.

About the Authors

Aleata Hubbard

Aleata Hubbard box with arrow

Research Associate, WestEd's STEM Program

Aleata Hubbard is a Research Associate in WestEd's STEM program and a doctoral candidate in the Learning Sciences program at Northwestern University. She conducts evaluation and research studies in computer science and mathematics education. Ms. Hubbard also designs and implements data management processes for large-scale efficacy studies of educational interventions. Before joining WestEd, she studied collaborative learning environments and pedagogies to support computer science students and second language learners and spent a year teaching programming skills to aspiring entrepreneurs in Accra, Ghana.

Creative Commons

Except where noted, all content on this website is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Related Blog Posts

Nation Science Foundation Logo EvaluATE is supported by the National Science Foundation under grant number 1841783. Any opinions, findings, and conclusions or recommendations expressed on this site are those of the authors and do not necessarily reflect the views of the National Science Foundation.