This tool can help users to ensure that they have developed a viable plan for collecting all the data necessary to answer each evaluation question and that all data collected will serve a specific, intended purpose.
Data cleaning is the process of verifying and editing data files to address issues of inconsistency and missing information. Errors in data files can appear at any stage of an evaluation, making it difficult to produce reliable data. Data cleaning is a critical step in program evaluation because clients rely on accurate results to inform decisions about their initiatives. Below are six essential steps I include in my data cleaning process to minimize issues during data analysis:
1. Compare the columns of your data file against the columns of your codebook.
Sometimes unexpected columns might appear in your data file or columns of data may be missing. Data collected from providers external to your evaluation team (e.g., school districts) might include sensitive participant information like social security numbers. Failures in software used to collect data can lead to responses not being recorded. For example, if a wireless connection is lost while a file is being downloaded, some information in that file might not appear in the downloaded copy. Unnecessary data columns should be removed before analysis and, if possible, missing data columns should be retrieved.
2. Check your unique identifier column for duplicate values.
An identifier is a unique value used to label a participant and can take the form of a person’s full name or a number assigned by the evaluator. Multiple occurrences of the same identifier in a data file usually indicate an error. Duplicate identifier values can occur when participants complete an instrument more than once or when a participant identifier is mistakenly assigned to multiple records. If participants move between program sites, they might be asked to complete a survey for a second time. Administrators might record a participant’s identifier incorrectly, using a value assigned to another participant. Data collection software can malfunction and duplicate rows of records. Duplicate records should be identified and resolved.
3. Transform categorical data into standard values.
Non-standard data values often appear in data gathered from external data providers. For example, school districts often provide student demographic information but vary in the categorical codes they use. For example, the following table shows a range of values I received from different districts to represent students’ ethnicities:
To aid in reporting on participant ethnicities, I transformed these values into the race and ethnicity categories used by the National Center for Education Statistics.
When cleaning your own data, you should decide on standard values to use for categorical data, transform ambiguous data into a standard form, and store these values in a new data column. OpenRefine is a free tool that facilitates data transformations.
4. Check your data file for missing values.
Missing values occur when participants choose not to answer an item, are absent the day of administration, or skip an item due to survey logic. If missing values are found, apply a code to indicate the reason for the missing data point. For example, 888888 can indicate an instrument was not administered and 999999 can indicate a participant chose not to respond to an item. The use of codes can help data analysts determine how to handle the missing data. Analysts sometimes need to report on the frequency of missing data, use statistical methods to replace the missing data, or remove the missing data before analysis.
5. Check your data file for extra or missing records.
Attrition and recruitment can occur at all stages of an evaluation. Sometimes people who are not participating in the evaluation are allowed to submit data. Check the number of records in your data file against the number of recruited participants for discrepancies. Tracking dates when participants join a project, leave a project, and complete instruments can facilitate this review.
6. Correct erroneous or inconsistent values.
When instruments are completed on paper, participants can enter unexpected values. Online tools may be configured incorrectly and allow illegal values to be submitted. Create a list of validation criteria for each data field and compare all values against this list. de Jonge and van der Loo provide a tutorial for checking invalid data using R.
Data cleaning can be a time-consuming process. These checks can help reduce the time you spend on data cleaning and get results to your clients more quickly.
Metadata and Information Specialist – Internet Scout Research Group
University of Wisconsin-Madison
Director/PI – Internet Scout Research Group
University of Wisconsin-Madison
As most ATE community members are aware, the National Science Foundation requires that all grant applicants provide a one- to two-page data management plan describing how the grantee’s proposal will meet NSF guidelines on the dissemination of grant-funded work. In 2014, NSF added a new requirement to the ATE solicitation mandating that newly funded grantees archive their deliverables with ATE Central.
We were curious to find out more about the materials created within the ATE community. So, when EvaluATE approached us about including questions related to data management planning and archiving in their annual survey of ATE grantees, we jumped at the chance. We had an interest in discovering not only what resources have been created, but also how those resources are disseminated to larger audiences. Additionally, we hoped to discover whether grantees are actively making their materials web accessible to users with disabilities—a practice that ensures access by the broadest possible audience.
The survey responses highlight that the most widely created materials include (not surprisingly) curriculum and professional development materials, with newsletters and journal articles taking up the rear. Other materials created by the ATE community include videos, white papers and reports, data sets, and webinars.
However, although grantees are creating a lot of valuable resources, they may not be sharing them widely and, in some cases, may be unsure of how best to make them available after funding ends. The graphs below illustrate the available of these materials, both currently and after grant funding ends.
Data from the annual survey shows that 65 percent of respondents are aware of accessibility standards—specifically Section 508 of the Rehabilitation Act; however, 35 percent are not. Forty-eight percent of respondents indicated that some or most of their materials are accessible, while another 22 percent reported that all materials generated by their project or center adhere to accessibility standards. Happily, only 1 percent of respondents reported that their materials do not adhere to standards; however, 29 percent are unsure whether their materials adhere to those standards or not.
Many of us struggle with issues related to sustaining our resources, which is part of the reason we are all asked by NSF to create a data management plan. To help PIs plan for long-term access, ATE Central offers an assortment of free services. Specifically, ATE Central supports data management planning efforts, provides sustainability training, and archives materials created by ATE projects and centers, ensuring access to these materials beyond the life of the project or center that created them.
Over the course of analyzing data for a number of projects, I have encountered several situations that can make it hard to work with spreadsheet-based data, especially when working with multiple programs. Today, I would like to share some frustration-saving tips for preparing spreadsheet data for use with programs like Excel, Access, and SPSS.
1) In Excel, use an apostrophe at the beginning of numbers that start with a zero: Excel will usually remove zeros at the beginning of a field, so ZIP Codes or other numbers with leading zeros must be entered with a single apostrophe in front. Do not worry about this with numbers that are not categorical.
2) Minimize the number of columns in your data download: Survey systems like Qualtrics or Survey Monkey usually provide an option to export single-question responses as multiple columns, (i.e., Gender_male, Gender_female, and Gender_other). Using this option can make it difficult to clean and analyze your data later. If the question was a “select all that apply” type question, using multiple columns is appropriate. If the respondent can only select one option, then keep the answers as one column, with distinct codes for each type of response.
3) Use simple column headings when setting up spreadsheets: Both SPSS and Access like to read the first row of spreadsheets as variable names. The second row will always be interpreted as data. To save yourself frustration, instead of putting the question title “GPA” in the first line, and a series of years in the second line, simply have one line of variable names that includes “GPA2014” and “GPA2015.”
4) Avoid special characters: Many computer programs hate special characters (like quotation marks or @ symbols) in variable names or text fields. For example, Access will read quotation marks as delimiters in text fields during some operations, which will trigger errors that can cause the database to fault.
5) Use built-in number formats instead of hand-entering symbols: Avoid hand-entering percent symbols and dollar signs in Excel fields. Instead, enter .xx for percentages and xx.xx for dollars, and then set the number format for the whole column to percentage or currency (drop down menu on the ribbon). Excel will keep the numbers as entered, but will display them properly. Also make sure all cells in a column are set to the same number format.
6) Assign a code to identify missing data: Empty cells are interpreted differently by different programs, depending on the data type. Without going into too much detail as to the reason why, using a number that could never appear in your data (like 99999) is better than using zeros to represent missing data. Use formulas to exclude those codes from any calculations you do in Excel (using the codes above, the Excel formula =averageif(a:a,” <777” would exclude cells in column A that are coded as missing). This is also more upload-friendly to Access or SPSS. Use the word “None” to represent missing qualitative data in text columns. Doing this will speed up error checking and upload/transfer to other programs.
Following these tips can save you time and effort in the long run.
The Formative Assessment Systems for ATE project (FAS4ATE) focuses on assessment practices that serve the ongoing evaluation needs of projects and centers. Determining these information needs and organizing data collection activities is a complex and demanding task, and we’ve used logic models as a way to map them out. Over the next five weeks, we offer a series of blog posts that provide examples and suggestions of how you can make formative assessment part of your ATE efforts. – Arlen Gullickson, PI, FAS4ATE
Week 5 – How am I supposed to keep track of all this information?
I’ve been involved in NSF center and project work for over 17 years now. When it comes to keeping track of information from meetings, having a place to store evaluation data, and tracking project progress, there are a few habits and apps I’ve found particularly useful.
Habit: backing up my files
All the apps I use are cloud-based, so I can access my files anywhere, anytime, with any device. However, I also use Apple’s Time Machine to automatically back up my entire system on a daily basis to an external hard drive. I also have three cloud-based storage accounts (Dropbox, Google Drive, and Amazon Cloud Drive). When the FAS4ATE files on Dropbox were accidentally deleted last year, I could upload my backup copy and we recovered everything with relative ease.
Habit + App: Keeping track of notes with Evernote
I’ve been using Evernote since the beta was released in 2008 and absolutely love it. If you’ve been in a meeting with me and you’ve seen me typing away – I’m not emailing or tweeting – I’m taking meeting notes using Evernote. Notes can include anything: text, pictures, web links, voice memos, etc., and you can attach things like word documents, spreadsheets, etc. Notes are organized in folders and are archived and searchable from any connected devices. There are versions available for all of the popular operating systems and devices, and notes can easily be shared among users. If it’s a weekend and I’m five miles off the coast fishing and you call me about a meeting we had seven years ago, guess what? With a few clicks I can do a search from my phone, find those notes, and send them to you in a matter of seconds. Evernote has both a free, limited version and inexpensive, paid version.
When we first started with the FAS4ATE project, we thought we’d be developing our own cloud-based logic model dashboard-type app. We decided to start by looking at what was out there, so we investigated lots of project management apps like Basecamp. We tried to force Evernote into a logic model format; we liked DoView. However, at this time we’ve decided to go with LucidChart. LucidChart is a web-based diagramming app that runs in a browser and allows multiple users to collaborate and work together in real time. The app allows in-editor chat, comments, and video chat. It is fully integrated with Google Drive and Microsoft Office 2013 and right now appears to be our best option for collaborative (evaluator, PI, etc.) logic model work. You may have seen this short video logic model demonstration.
As we further develop our logic model-based dashboard, we’ll be looking for centers and projects to pilot it. If you are interested in learning more about being a pilot site, contact us by emailing Amy Gullickson, one of our co-PIs, at firstname.lastname@example.org. We’d love to work with you!