Course: DATA 201 - Thinking with Data

DATA 201 | Data Wrangling

Assignment 3: Data Wrangling

Screenshot of the raw data (with the empty spaces highlighted)

Omitting

Omitting Irrelevant Attributes

The first thing I did was to omit any attributes that were either empty or had useless information. This included the attributes “name”, “keywords”, “language”, and “currency symbol”. I decided to keep the attributes “occasion” and “currency” because although the data is very small, I thought They could still be meaningful to the research being done. This step addresses the data quality issues of missing data. What I did using Excel was click on the columns I wanted to omit and press delete. This could have also been done using OpenRefine under edit column, and clicking “remove column”. Removing unnecessary data and attributes is important because it increases the quality of the data, especially since they will have no use in the analysis process.

Omitting Useless entities

After Omitting the attributes, I went ahead and omitted any entities that did not provide enough useful information. I did not omit entities that had at least half of the information for each attribute, as I think it could still be useful for analysis. Entities that don't have any information or missing the majority of their attributes are useless in any form of analysis. The reason for the missing information is most likely due to Data Entry Errors and Distillation Errors. I am addressing these errors by removing the unwanted rows/entities.

Formatting

Many of the attributes have formatting issues. A general example I have above is the attribute “ id”. I’m not quite sure if it’s because of conditional formatting but the zeros under the attribute “id” were missing, so I just went over and corrected the formatting issue. Observation alone tells me that there should be five characters for all the ids so I made the fix. Intuition and observation plays an important role in data wrangling.

Essentially formatting issues are factors that affect the quality of data. Formatting issues make it so that the data is bad/unusable for analysis. It’s important to mediate these issues so that the research being done is accurate, and you put forth accurate information.

Doubling Data Entry Errors Side Note

I went over the ID for all of the entities, and every single one is a discrete number, which led me to believe that there were no Doubling Data Entry Errors (however, I realized I could still be wrong). So, I checked with OpenRefine using the facets/filters for duplicate data, but I did not find any issues. Double Data issues affect the quality of the data because it can manipulate the analysis process and come up with a biased and inaccurate conclusion. Moreover, I don’t believe there’s an easy way of finding Doubling Data Errors with Excel.

Proper Casing

As I go over the dataset, I make more formatting corrections for every attribute. Formatting issues included Data Entry Errors, and also Distillation Errors. Moreover, formatting wise, I utilized both “Proper Case” from Excel and “Title Case” from OpenRefine to get a unison format for each entity input. I made this decision because presenting an appealing dataset affects its quality. Analyzing a dataset that’s all capitalized will tire out and confuse researchers, especially if the dataset is much larger.

Proper Spacing

The spacing issues I had to fix were consecutive whitespace(double spacing) and any leading and trailing spaces. They are most likely due to Data Entry Errors(participant errors). What I found really useful for this is using the “Trim” function on Excel, not only that but also using Ctrl + H(Find and Replace), and searching up the double spaces and replacing them with a single space. I found that it was more convenient to use OpenRefine’s common transforms “trim” and “collapse” than Excel, so I made the switch. This type of error is hard to mediate with the human eye alone, as we struggle to identify spacing issues. Improper spacing affects whether technology will regard two inputs as discrete. Saying two things are different even though they are not will make the research and analysis process inaccurate.

Clustering

While I was fixing formatting, my biggest problem with Excel was that It was really hard to make sure categorical data like names were exactly the same or in unison (free from misspellings, errors, and correct punctuation). With Excel, you rely too much on observation to decide whether the inputs are the same, and the human eye can miss so much. I eventually got frustrated and switched over to OpenRefine to tackle this issue. Clustering in OpenRefine is able to identify whether inputs are saying the same things by using multiple functions, some of which are Fingerprinting and Metrophone. Fingerprinting strips away unimportant details to identify similarities, and Metrophone identifies similarities by how closely words sound. I used clustering to create a unified name for the categories/inputs. By using clustering I was able to rectify some data quality issues the biggest being Erroneous Values, and Entity Resolution. By having the dataset have unison inputs it will make analysis, aggregating and gathering statistics so much more efficient and reliable.

Correcting any spelling errors

Spelling errors are human errors (Data Entry Errors). It’s a researcher's job to identify these errors and correct them to what the participants actually mean; failing to do so can jeopardize research and analysis being done as two categories spelt differently (although mean the same) are considered two discrete things and any conclusion made will be inaccurate and unreliable. Checking for errors can be done through Excel with observation and (find and replace), however the easiest route would most likely be utilizing the metophone function and fingerprinting in OpenRefine.

Broadening Categories

Depending on how much of a detailed research you are doing you would probably want to skip this process, however with my judgement I figured that a lot of the inputs under the attribute “event” could be categorized into one, such as the inputs “Cafe Lunch” and “Lunch”. I put these two data inputs into the broad category of just “Lunch” since they essentially mean the same thing. Much like clustering, when you are wrangling data, you want to categorize inputs that mean the same thing into one category. This process helps with analysis in that quantifying the data will be easier. Moreover, statistics will be simplified since instead of having both categories “Cafe Lunch” and “Lunch” you will only have to deal with one variable “Lunch”.

Narrowing Data Inputs and Expanding to Other Columns

If the data inputs are too specific to its cells it may be hard to analyze them. An example I have is the attribute “physical description”; looking at the inputs within this column, the information is both overwhelming and confusing. The dataset looked like it needed more attributes. Looking at cells, I realized not everyone contained the same amount of information, so I used the “split into several columns” button in OpenRefine to segregate the information. Once the information was segregated I manually split them into several attributes in Excel these included “type”, “visual”, “visual_descriptor”, and “type_size”. Separating the data into its different attributes will make it easier to look at every aspect of the data. Moreover, this process will help with creating statistics and aggregating data, especially since not all the information in the raw cells for “physical description” followed the same data type. (example. “CARD;4.5X7.5;”)

Combining Data Attributes

I found that the attributes “Currency” and “Currency Symbol” could be one attribute, so I used the “apply to all identical cells” to add the symbol on the side of each currency type. Although not an important detail, I thought I would stay true to the raw data provided and, later on, maybe help with the analysis process, especially since the attribute “currency symbol” was part of the raw data. Moreover, I did not find myself facing any Data/Attribute integration issues when I did this since it was a pretty straightforward process.

Getting Rid of Excess Punctuation, Symbols, and Abbreviations

By getting rid of excess punctuation, symbols, and abbreviations from the dataset, I greatly increased the quality of the dataset. Entity Resolutions issues are a serious problem because there’s always a possibility that two inputs are stating the same thing, however, we may categorize them as different. This will make analysis and statistics misleading and inaccurate. I tackled this issue mostly by clustering. However, Excel’s Find and Replace was also very helpful. Much like the reasoning for Clustering, this process is important so that the inputs are actually in unison and state the same categorical inputs. If the dataset is unison and accurate, quantifying and categorizing entities will be easier during analysis.

Identifying and Omitting Clear Outliers (inputs that did not make sense)

As I said previously, a lot of data cleaning is reliant on one's judgment. It’s technology’s job to identify anomalies, but ultimately, it is our decision to make the final call. Looking at the attribute “date,” we find that some of the data makes no sense at all. I don’t think a date with the year “191” is viable for analysis, so I fixed it by removing them. Normally, the computer would not see this since it follows all the parameters for a date. Common sense is important in data cleaning because it addresses issues such as this.

Ensuring the data inputs all follow the same data type

When I kept going back and forth between OpenRefine and Excel, I came across the issue of my numerical data not following the format of “Number” under Excel. The green arrows along “id”, “page count”, and “dish count” all indicate the numbers were stated as text. I was able to fix the issue by using the “value” function. However, this isn’t an issue with the raw data provided. It’s still important to make sure that all your data follows the same data type. This is important because you can potentially lose some information when you are analyzing your dataset. One other way to fix this type of issue in Excel is to go under “home” and above the category “number” and change its format. (Once again, this process is relevant to Data Entry Errors.) One way to reduce these kinds of errors before even receiving the raw data is to make sure the participants are well-informed about the information the research team requires.

Screenshot sample of the final cleaned dataset

I learned that data wrangling is not a straightforward process that is all done through steps. It requires a lot of trial and error and most importantly repetition. I went back and forth from OpenRefine and Excel to make sure that the data was as clean as I could have made it.

An extra detail I failed to mention is that I changed the format for the attribute “status” to have everything Uppercase. Strictly aesthetic reasonings only.

Search This Blog

Connect with Me!

Data Analytics Portfolio

Course: DATA 201 - Thinking with Data | Data Wrangling