Subscribe here for quick access to our latest blog posts. New to RSS feeds? Click here

Blog: Tips for Data Preparation

Posted on June 17, 2015 by  in Blog

EvaluATE Blog Editor, The Evaluation Center at Western Michigan University

Over the course of analyzing data for a number of projects, I have encountered several situations that can make it hard to work with spreadsheet-based data, especially when working with multiple programs. Today, I would like to share some frustration-saving tips for preparing spreadsheet data for use with programs like Excel, Access, and SPSS.

1) In Excel, use an apostrophe at the beginning of numbers that start with a zero: Excel will usually remove zeros at the beginning of a field, so ZIP Codes or other numbers with leading zeros must be entered with a single apostrophe in front. Do not worry about this with numbers that are not categorical.

2) Minimize the number of columns in your data download: Survey systems like Qualtrics or Survey Monkey usually provide an option to export single-question responses as multiple columns, (i.e., Gender_male, Gender_female, and Gender_other). Using this option can make it difficult to clean and analyze your data later. If the question was a “select all that apply” type question, using multiple columns is appropriate. If the respondent can only select one option, then keep the answers as one column, with distinct codes for each type of response.

3) Use simple column headings when setting up spreadsheets: Both SPSS and Access like to read the first row of spreadsheets as variable names. The second row will always be interpreted as data. To save yourself frustration, instead of putting the question title “GPA” in the first line, and a series of years in the second line, simply have one line of variable names that includes “GPA2014” and “GPA2015.”

4) Avoid special characters: Many computer programs hate special characters (like quotation marks or @ symbols) in variable names or text fields. For example, Access will read quotation marks as delimiters in text fields during some operations, which will trigger errors that can cause the database to fault.

5) Use built-in number formats instead of hand-entering symbols: Avoid hand-entering percent symbols and dollar signs in Excel fields. Instead, enter .xx for percentages and xx.xx for dollars, and then set the number format for the whole column to percentage or currency (drop down menu on the ribbon). Excel will keep the numbers as entered, but will display them properly. Also make sure all cells in a column are set to the same number format.

6) Assign a code to identify missing data: Empty cells are interpreted differently by different programs, depending on the data type. Without going into too much detail as to the reason why, using a number that could never appear in your data (like 99999) is better than using zeros to represent missing data. Use formulas to exclude those codes from any calculations you do in Excel (using the codes above, the Excel formula =averageif(a:a,” <777” would exclude cells in column A that are coded as missing). This is also more upload-friendly to Access or SPSS. Use the word “None” to represent missing qualitative data in text columns. Doing this will speed up error checking and upload/transfer to other programs.

Following these tips can save you time and effort in the long run.