Many times our dataset comes in a way that does not suit for manipulation. That is why we use various reshaping methods like Pivot, melt, etc. so that our dataset can be converted into a shape that is suitable for analysis. For that, we need to set some standards to organize data values within a dataset.
Principles Of Tidy Data
- Columns represent seprate variables.
- Rows represent individual observations.
- Observational units form tables.
In this dataset, we can see that treatment A and treatment B does not represent two separate variables, they represent two values for a variable treatment. So it should be converted into a tidy table.
The problem we are trying to fix:
- Columns containing values, instead of variables.
We can use pd.melt() function to solve this problem.
Melting data is the process of converting columns of our data into rows of data. For melting dataframes, we use the melt function of pandas module.
To melt our dataset, first of all, we have to specify our dataframe that we want to melt in the frame.
Next, we specify the columns that we want do not want to melt in id_vars. The resulted data is melted dataframe that has a column treatment for treatment name and value for their respective values.
Melt has a parameter value_vars to specify which column you want to melt if we don’t specify this parameter the value_vars will take all the variables that are not specified in the id_vars parameter.
We can change the name of the melted columns by using the value_name and var_name parameters.
Pivoting is just opposite of melting. In melting we turned columns into rows but in pivoting we turn unique values into separate columns.
In Pivoting we take all the unique entries from a column and make a separate column for each unique entry.
Note: – One of the main reasons for pivoting is to turn data from analysis friendly shape to reporting friendly shape. OR
When our dataset violates tidy data principle that is, rows contain observation. OR
When multiple variables stored in the same column.
Parameters For Pivot Method
- index: The column we do not want to change in the pivot.
- columns: The columns we want to pivot into new columns.
- values: Values to fill in the column created from the pivot.
Pivoting Multiple Columns
If we do not specify the values column then the rest of the remaining columns are used as the values.
The resulted dataframe has multi-index columns with id and response specified by gender.
The Pivot method not always work. for example:
In this dataset, we have a duplicate value of ‘tmin’ for ‘2010-2-2’. So if we will use pivot on this dataset we will get ValueError saying that there are duplicate entries in the dataset and cannot be pivoted because when adding values to the ‘tmin’ column in the pivoted dataset compiler will get confused as to what value should be given to the ‘tmin’ for ‘2010-2-2’.
To solve this kind of problem we use the pivot table method instead of the pivot.
It has a parameter which specifies how to deal with duplicate values. for instance, we can aggregate the duplicate values by taking their average.
The pivot table method is similar to the pivot.
The aggfunc parameter tells what should be happened when there are multiple values. Here we are using the mean function from the numpy library. The default value for the aggfunc parameter is mean.
conclusion: While melting takes a set of columns and turns it into a single column, Pivoting will create a new column for each unique value in a specified column.
Beyond Melt And Pivot
Melting and pivoting are the basic tools that we need to reshape our dataset. Now we’ll discuss another common problem which is when columns contain multiple bits of information.
This dataset shows the effect of Tuberculosis disease on each country’s population for a particular year.
Here sex and age group are the two variables stored in the same column that is, mo14 represents male from 0 to 14 years and m1524 represents male 15 to 24 years old.
This is a problem as age and sex are two different variables that are stored in a single variable.
So we will first melt the dataset and then we will parse the string to separate age and sex.
This gives us a new dataset with a new column as sex.
Note: – there was nothing wrong inherently wrong about original dataset but it is inconvenient for data analysis as we would not be able to predict a model where age and sex are independent.
Stacking And Unstacking
we have seen how to pivot and set multiple indexes in a dataframe for example.
The pivot method won’t work with a multilevel index dataframe directly. So for that, we use the 2nd level of the index that is gender in this case and move it to the column using unstack.
here we can see that this is a similar result to the pivot method. The main difference is that now we have hierarchical columns i.e id and response are the parent of two different trees and F, M are their child.
The opposite of unstacking is stacking. With stacking, we can move the level of hierarchical columns to the index and make a hierarchical index.
suppose we want the gender to be the outermost and treatment to be the innermost level. For that, we use swap level method.