Do you know where your data has been?
The saying “rubbish in; rubbish out” really does make sense with data. The tips and hints contained in this eBook will help make sure that your data sets are free from duplications and text errors. You’ll reap the rewards by saving time when you start analysing the data as you will be able to create reports from your clean data set the first time.
In this eBook, we are going to take a look at some tools and tips in Excel which can help you clean up your data to give you a robust starting point for further analysis.
It can be really frustrating after completing some complex analysis of large data sets, only to realise that somewhere along the line you’ve corrupted the data and the result cannot be relied on as the data no longer accurately reflects the original starting point.
To ensure you keep your data ‘clean’ and problem-free we are going to look at the following features in this eBook:
- Spell Check
- Remove duplicate values
- Find and replace text
- Find and replace blank cells
- Check if a cell or part of a cell matches specific text
- Changing the case of text
- Removing spaces and non-printing characters
- Merging and splitting columns
- Reconciling table data by watching key totals
You may find it useful to consider each of these points in turn whenever you are presented with data to analyse.
Before you begin
Before you start to manipulate data, always remember to make a copy of your original dataset and keep it somewhere safe. By doing this, you will always have a copy of the absolute raw data to refer back to at any point and can ensure that your current data set always agrees to the original. Always make sure you make a copy either by
- inserting a separate worksheet marked as an original dataset in the same workbook and perform the data cleansing process in a different worksheet. A good tip here is to change the tab colour as well as naming it to make sure you don’t stray and complete any work in that worksheet.
- Alternatively, save it as a separate file and perform all of the data manipulations in a different workbook.
When you’re happy with your data and satisfied that everything matches, you can then remove the original dataset.
One of the key features that we tend to not focus on when working in Excel is doing a spell check. You need always to remember that different people could potentially be contributing to the data that compiles the dataset and different people will always input data differently.
Consider how often you rely on spell checks in PowerPoint, Outlook or Word documents. The same attention to detail needs to be used when working in Excel.
To perform a spell check, go to the review tab
From within the proofing grouping select Spelling
The spell check looks and works the same as in the rest of the Microsoft Office stack.
An overall spell check is a great place to start to ensure the data in your document is correct.
Remove duplicate values
Once you are satisfied with the spelling of values within your data set, the next step is to ensure that no duplicated information has been included by mistake.
It’s really easy to copy and paste data multiple times into our data set. It’s even easier to make a mistake when combining information and data from a variety of different sources into one giant data set.
There are two main methods for checking for duplicated data.
The first option is going to the Data tab in the ribbon, then going to the Data Tools grouping.
Then select remove duplicates
You will then be asked in which of the available columns you want to check for duplicates. Unless you know you have a specific problem with one field of data, it’s normally advisable to leave all fields ticked so the deduplication tool can check the entire dataset.
Be aware that if you use this data, remove duplicates feature, once you click “ok” this will automatically remove the duplicate lines and a message will be displayed stating how many duplicate items were found.
If across all of the columns selected, none contain exactly the same data a message stating no duplicate values found will be displayed.
Another method of finding duplicate values is to use Conditional Formatting from the Styles area in the Home tab.
To use this feature, select a single column from your dataset, then go to Conditional Formatting, select Highlight Cell Rules, then Duplicate Values.
The difference with this approach is that it will not delete duplicate values. Instead, it will highlight items that are duplicate values.
Using conditional formatting for highlighting typing or posting errors that need to be corrected as opposed to being removed from the data can be useful as you get to review each item and have a chance to correct it. Once the error is corrected, the items will no longer be highlighted as duplicated.
Find and replace
After any obvious spelling mistakes and duplicated data items have been removed or corrected, you then need to begin to look for less noticeable errors and inconsistencies.
For example, the spell check will not highlight the inconsistent use of properly spelt words. Your data may contain “sell/sold/sales” for example, all of which may be correct, but you may want to use a consistent description in the data set to provide a more consistent analysis down the line.
A really useful way of finding inconsistencies is to apply Filters to the entire data set. The filter option is found in the “Editing” section in the “Home” tab.
Once a filter is used for the entire data set, you can click on the down arrow for each data field to review the entries that are used within that field and you can quickly see if there’s anything that looks out of the norm.
In the example below, you can see that some purchase orders have been entered using the initials “PO” followed by the number, others have used the full spelling of “Purchase Order”, so there are inconsistencies in that field.
You can also very easily see if some records have been left blank. If every record should have a Purchase Order number assigned to it, you will now know there’s a problem in this column.
There are two ways of finding and replacing incorrect values, using shortcut keys or from the ribbon menu options.
Firstly copy the value you want to replace using Control C, then use Control F to open the Find and Replace window and use Control V to paste it into the “Find What” box.
Go to the “Replace” tab and enter the correct value you would rather use.
Selecting “Replace” will replace every instance of the incorrect spelling one at a time. Using “Replace All” will correct all of them in one go, and a window will open to tell you how many replacements have been made.
Using control, find and replace is quite an ideal feature when we want to replace data in our data set to ensure you have consistency.
From the Ribbon Menu
If you don’t like using the shortcut keys, you can also access the same feature from the Editing section in the Home tab in the ribbon. The Find and Select menu opens up the same Find and Replace window accessed with the shortcut keys.
Find and replace blank cells
One of the most important things to remember when working with data sets is that Excel does not like blank values. Blank values in rows or blank column headings tend to distort the analysis of our data set when creating pivot tables, V lookups, etc.
Blank cells can be found using the Filter option as shown in an earlier section.
Alternatively, you can select the column you are focusing on and use the shortcut keys Control G then a Go-To window will open up.
Click the Special button and from the Go To Special tab, select Blanks. The blank cells will then be highlighted in the data set. Use Enter to move between the highlighted blank fields and you can add the missing data as required.
The same menu can be found on the ribbon. After highlighting the column of data you want to reference, from the Home tab, select “Find and Select” from the Editing grouping and then the “Go To Special” option to find the same feature window as above.
Always remember that when Excel is not giving you the answer you expect, it’s most likely due to the data. Always have all of the rows as well as column headings populated. Ensure there are no open spaces in-between data because, as soon as Excel finds an open column, it will view the information as two datasets.
Check if a cell or part of a cell matches specific text
When data originating from different sources need to be combined into one data set for further analysis you may encounter difficulties in getting the data into a standard or consistent format.
As an example, if you wanted to create a bank reconciliation at the end of the month to analyse spend by expense category, you would first need to access the data from all of the relevant bank accounts. This may mean obtaining data from different banks, different account types (current account/saving account/credit card etc.) and combining that data into one consolidated list before being able to add in expense categories that can be summarised to show where the money is being spent.
Often downloads from different account types from the same bank are presented with different formats, and that’s before you begin to look at the different styles of data presented by different banks. You may find debits and credits in separate columns, debits being provided as either a positive or negative number, transaction dates shown in different date formats (DD/MM/YY or DD MMM YYYY, etc.)
So especially when you work with something like bank statements, you may need to complete a lot of rework of the data before you arrive at an actual data set that could potentially be used to do analysis.
Normally transaction downloads from the bank would be provided as a CSV file, which means you would need to change this data set to separate the different elements of the information into columns. To do this, you will need to use the text to column feature.
Text to Columns
Splitting data text into columns can be found in the Data tab, in the Data Tools grouping.
Remember to insert additional blank columns into your spreadsheet before you start to ensure that you don’t overwrite existing data with new data that you split up.
Initially, you will need to highlight the range of data you want to split into new columns.
The first step in splitting data into multiple columns is to work out how your data should be divided. Options are available to use a specific character (such as a comma or space for example) or a specific fixed-width (so the break will be placed after a certain number of characters, regardless of what they are.)
A preview of the selected data will be displayed in the wizard window so you can check that the divide is being placed in the correct spot and that your data will look as you expect. You may need to use the scroll bars to look at a large enough sample of data to ensure that you will achieve the correct result.
In this example, the data is separated by a space, so the data will be presented with the word “Nuts” in one column and “7mm” in the second column.
When you are happy with the new data format, click next and finish. The data will be transferred into the new columns in the format you specified.
You may find you need to do something similar with different data fields in each of your individual data sets before you can combine them into one large data set for further analysis.
Changing the case of text
If you are required to analyse data for people who are quite attentive to detail, then should your data contain inconsistencies in format, it could mean that the recipient is more focused on the fact that there’s no consistency in your data rather than on the actual data or trends you are revealing.
Using consistent capitalisation in the text may not seem very important; however, it can provide significant distractions and cause key data results to be overlooked or, at worst, your final report not being trusted.
So if data sets are received using a mixture of upper and lower case text, or include gaps in different places, for example, it’s worthwhile getting to a point where a consistent data format can be used. Taking care of the data that you begin with will ensure a much more professional output.
Using the Upper function
If you want to change the data to have capital letters, you can use the Upper function.
Enter =upper(the cell reference containing the text you want to convert)
Then the text will appear in capital letters.
If you enter the upper formula at the top of a column of data, Excel will automatically assume that you potentially want to do exactly the same to all of the data in the column below.
Using the Lower function
Changing the text to lower case can be achieved by using the =lower(cell ref) formula
Using the Proper function
The proper formula will ensure that proper grammar and initial capitalisations are used.
So all the first letters of each word is a capital letter, and the rest of them are all in lowercase.
A good tip when adding a manipulated column into a data set is to highlight the column in a different colour, so it stands out from the rest, and it becomes obvious which columns are the ones that contain the data you want to use going forward.
After you have completed all the data manipulations needed to get your data into the format that you want, copy the data into the original data set and paste it as a value. By pasting the reworked data as a value means that the data can no longer be affected by the formula that you have been using, and the value becomes fixed. You can then remove the highlighted column as it is no longer needed.
Removing unwanted spaces and non-printing characters
One of the very sneaky little parts of Excel is the fact that when we import data from an external data source, we sometimes unintentionally bring through non-printed characters or spaces where we don’t want them.
When reviewing a massive dataset, it can be really difficult to see extra spaces, trailing zeroes, etc. in the text. A clever way to find additional characters or spaces is to insert a pivot table to summarise the data. Differences in the text are generally much easier to spot in a summarised format.
The Pivot table option can be found in the Insert tab, from the Tables grouping.
You need to highlight your data set, choose a location for your pivot table and click “OK.”
If you select each field one by one then the pivot table will summarise all of the existing entries in that field so you can see the different values that exist.
Alternatively, if you then drag in your data fields one by one into the “Rows” column and select a count of the same field in the Value field as shown. The count field also helps ensure that you manage to find and correct all items.
A pivot table can be used to identify any unusual or inconsistent entries that do not follow the same format as the majority of the data set that you have perhaps been unable to spot with a quick visual check of your data.
You can then go back and use the Find and Replace options we looked at earlier to correct the entries and refresh the Pivot table until your data is completely consistent.
The trim function can remove additional spaces before, in between and after text in a data set. This is really useful if you end up with additional spaces in your data.
Enter =trim(the cell you want to trim)
So if you have more than a single space in between words, the Trim function will remove the unnecessary spaces. As before, Excel will copy the formula to the bottom of your data. Create a new pivot table on this trimmed data column to ensure you are happy with the new format then copy the new fields and paste over the original ones as a value to ensure you continue to use the corrected data going forward.
It is especially important to trim data if you plan to use V lookup functions at a later stage because Excel sees data within the V lookup the same as with a pivot table so it will see these values as different entries and may produce incorrect results.
Merging data into one value
Earlier we considered bringing data into Excel from external data sources and how to separate values into different columns. You may find, however, that at some point in your data cleanse process and specifically when trying to get two different sets of data to match formats so you can join them together for analysis as a combined data set that you actually need to combine data into one field.
The concatenate formula helps you combine different elements to make a single entry. You can combine as many different elements as you need to achieve the single text string, each element needs to be separated by a comma in the concatenate formula.
If you need your combined data string to include spaces or special characters you need to include them with double quotes “ “ around the character. You can use direct cell references to identify the data you need to include as shown below.
Tip: Whenever you use a formula within Excel always refer to the helpful tips that are provided when you create the formula as in some instances Regional settings may mean that some keystrokes may be different for your environment. For example, I need to use a comma in between the different elements of the concatenate formula, but that may not be correct for the specific regional setting you may use.
Lookup formulas are really useful if the key data you need to analyse is spread across different data sets that are in completely different formats and are completely impractical to combine into one complete set of data.
The lookup formula can help you add specific additional items of data from one set into another set, without bringing across the entire data set. They help you “look for” certain values that are related to an existing value that you already have.
For example in the data sample shaded grey, we are missing a surname, which is included in the data below and leave days are only included in the top data set.
For a lookup formula to work, you need to have two identical fields in both data sets. In this example, it is the first name, Dinah. The formula effectively looks for the value Dinah in row 13 and then returns the related value it finds in the second column in row 13 (the surname). The final element of the formula dictates what to do if the initial value can’t be found, so in this example, the value zero will be used.
Enter =vlookup(known value,data range to look in,column offset)
Lookup formulas like this can be used across different worksheets or even different Excel files.
A Vlookup formula will look vertically down the first row of the data range you specify to find the common field, before counting horizontally across the row to return the value you request.
A Hlookup formula will look horizontally across a range of columns, before counting vertically down the specified number of cells to find the data you need.
A key point to note with the Lookup formula is that the initial common field must match exactly. If a space is present either before or after the name, then the lookup logic won’t find an exact match and the formula won’t work. It’s really important to complete all of your data cleansing techniques before you try to use the lookup formula to find additional data to include in your data set.
Before beginning the analysis of your completed data sets it is always a good idea to create check totals, such as sums of key values or counts of the data to ensure you don’t lose any data during your analysis.
You can define your check totals and counts as “Watch Windows”. This option can be found in the “Formulas” Tab and you can create as many as you need to keep a watch over these key pieces of information to ensure that your data set remains complete.
Select the cells you have entered your check totals and sums into and create a watch.
This creates an additional window that will appear in front of your existing worksheet that will constantly display the value you want to watch.
This means that no matter what you do to your data, to cleanse, summarise or analyse it, you can very easily keep an eye on the overall total of your data and quickly see if something you do affects the balances within your data. Control Z to undo your last few changes until the watch total returns to the values you want to maintain.
These are some of the key features that I’ve used across the years, to cleanse data in readiness for further analysis. I find that bank statements are generally the most difficult to combine because there’s just no consistency between account types or banks. But by using all of these features that we’ve considered in this ebook, it’s quite easy to get to a single data set and then actually start analyzing the information accordingly.
It’s so important to spend the time upfront, ensuring that your data is in the correct, consistent format before beginning to analyse the data and adding graphs, tables, summaries etc. The saying “rubbish in; rubbish out” really does make sense with data. There is nothing worse than spending several hours creating lovely reports, only to find that your initial data has been corrupted or only a portion of it has been included because of some odd data format differences that you didn’t see in the first instance.
One of the other neat tricks that have been learned the hard way is to add totals or field counts to the data set. As long as those totals or counts remain the same all the way through your data analysis, then you can be sure that you continue to include all of the original data in your summaries or reports.
I hope this gives you some great tips which will help you rework (not manipulate) your data so that you can move on to produce some great analysis.