In today’s blog, We are going to learn about data analysis, and the various processes involved with it by utilizing examples taken from my GitHub repository, on Exploratory Data Analysis of the Titanic data set, done in a Jupyter Notebook, which can be viewed using this link. According to Wikipedia, Data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusions and supporting decision-making. In simplified terms, “Data analysis is the process of looking into the historical data of an organization, and analyze it with a particular aim in mind, that is, to draw potential facts and information and support decision-making process. Whatever decision we take in our lives, is by remembering what happened last time. Thus, data analysis greatly influences the decision-making process” There are mainly 5 steps involved in the process of data analysis, they are: — The first step towards any sort of data analysis is to ask the right question(s) from the given data. Identifying the objective of the analysis, it becomes easier to decide on the type(s) of data we will be needing to draw conclusions. The objective behind analyzing the Titanic data set is to find out the factors which contributed to a person’s chance of survival on board the titanic. “Data wrangling, sometimes referred to as data munging, or Data Pre-Processing, is the process of gathering, assessing, and cleaning of “raw” data into a form suitable for analysis.” After identifying the objective behind our analysis, the next step is to collect the necessary data required by us to draw appropriate conclusions. There are various methods by which we can collect data. Some of which are: — For this blog post, we will use the titanic data set uploaded in kaggle.com for our analysis. Firstly, Lets import all the libraries and the ‘train.csv’ data set we will be needing throughout our analysis. After the data has been gathered, stored in a supported format, and assigned to a variable in Python. It’s time to gain some high-level overview of the type of data we are dealing with. It includes gaining information such as: — The above output shows the name of columns, along with several non-null values and the data type of each column. From the above output, it is clear that Age, Cabin, and Embarked columns have missing values, which we need to deal with in Data Cleaning stage. Let’s look at what information do these features (columns) represent: — Data cleaning is the process of detecting and correcting missing, or inaccurate records from a data set. In this process, data present in the “raw” form (having missing, or inaccurate values) are cleaned appropriately so that the output data is void of missing and inaccurate values. Since no two data sets are same, therefore the method of tackling missing and inaccurate values vary greatly between data sets, but most of the time, we either fill up the missing values or remove the feature which cannot be worked upon. Fun Fact: Data Analysts usually spend about 70% of their time cleaning data. In the Titanic data set, as noticed before, the age column has some missing values, which we will now deal with. Age Age column has a mean of 29.69 and a standard deviation of 14.52. That means it’s not possible to simply fill the missing values as the mean value as the standard deviation is very high. So we will need a workaround. That is, we will generate a list of random numbers equal to the length of missing values, between (mean-standard deviation) and (mean+standard deviation). Then we can fill up the missing values in the Data Frame with that of those in the list. So, Age column has been dealt with and all missing values have been replaced by random ages between (mean — standard deviation, mean + standard deviation)
Once the data is collected, cleaned, and processed, it is ready for Analysis. As you manipulate data, you may find you have the exact information you need, or you might need to collect more data. During this phase, you can use data analysis tools and software which will help you to understand, interpret, and derive conclusions based on the requirements. As the Titanic data set is now cleaned, we will now do some example EDA’s on it.
Thus, from the above visualization, we can infer that females were given priority during the rescue operation due to their low mortality count as compared to males. 2. Find out whether class of a person contributed to its likelihood of survival. From the above visualization, we can infer that people belonging to the upper class were given the highest priority during the rescue operation, followed by middle, and lower classes. Lower classes also had the highest mortality count. Exploring the data, SibSp and Parch columns generally show the number of relatives a passenger has on board, so SibSp and Parch, combined as relatives would make more sense. This step is known as Feature Engineering. Where we modify or make new features out of existing ones to better explain our analysis. NOTE: For full data analysis, please view the Jupyter notebook file from this link. After the analysis phase is completed, the next step is to interpret our analysis and draw conclusions from it. As we interpret the data, there are 3 key questions which should be asked by us: —
From the analysis of the titanic data set (link), we were able to find out the major factors which contributed to a person’s chance of survival.
STEP 5: COMMUNICATING RESULTSSourceNow that data has been explored, conclusions have been drawn; it’s time to communicate your findings to the concerned people or communicating to mass employing data storytelling, writing blogs, making presentations or filing reports. Great communication skills are a plus in this stage since your findings need to be communicated in a proper way to other people. A Fun FactThe five steps of data analysis are not followed linearly, it is actually non-linear in nature. To explain this, let’s consider an example: — Supposedly, you have done your analysis, drawn conclusions, then suddenly you find the possibility of representing a feature in a better way, or to construct a new feature out of other features present in the data set; thus, you would go back to step 3, perform feature engineering, and again perform the EDA with the new features added. Thus, it is not always possible to follow these steps linearly. CONCLUSIONThe amount of data generated by organizations per day around the world is in the range of zettabytes; and till now, it remains underutilized. Data Analysis can help organizations gain useful insights from their data, and influence a better decision-making process. |