Pivoting is just opposite of melting. In melting we turned columns into rows but in pivoting we turn unique values into separate columns.
In Pivoting we take all the unique entries from a column and make a separate column for each unique entry.
Note: – One of the main reasons for pivoting is to turn data from analysis friendly shape to reporting friendly shape. OR
When our dataset violates tidy data principle that is, rows contain observation. OR
When multiple variables stored in the same column.
Parameters For Pivot Method
- index: The column we do not want to change in the pivot.
- columns: The columns we want to pivot into new columns.
- values: Values to fill in the column created from the pivot.
Pivoting Multiple Columns
If we do not specify the values column then the rest of the remaining columns are used as the values.
The resulted dataframe has multi-index columns with id and response specified by gender.
The Pivot method not always work. for example:
In this dataset, we have a duplicate value of ‘tmin’ for ‘2010-2-2’. So if we will use pivot on this dataset we will get ValueError saying that there are duplicate entries in the dataset and cannot be pivoted because when adding values to the ‘tmin’ column in the pivoted dataset compiler will get confused as to what value should be given to the ‘tmin’ for ‘2010-2-2’.
To solve this kind of problem we use the pivot table method instead of the pivot.
It has a parameter which specifies how to deal with duplicate values. for instance, we can aggregate the duplicate values by taking their average.
The pivot table method is similar to the pivot.
The aggfunc parameter tells what should be happened when there are multiple values. Here we are using the mean function from the numpy library. The default value for the aggfunc parameter is mean.
conclusion: While melting takes a set of columns and turns it into a single column, Pivoting will create a new column for each unique value in a specified column.
Beyond Melt And Pivot
Melting and pivoting are the basic tools that we need to reshape our dataset. Now we’ll discuss another common problem which is when columns contain multiple bits of information.
This dataset shows the effect of Tuberculosis disease on each country’s population for a particular year.
Here sex and age group are the two variables stored in the same column that is, mo14 represents male from 0 to 14 years and m1524 represents male 15 to 24 years old.
This is a problem as age and sex are two different variables that are stored in a single variable.
So we will first melt the dataset and then we will parse the string to separate age and sex.
This gives us a new dataset with a new column as sex.
Note: – there was nothing wrong inherently wrong about original dataset but it is inconvenient for data analysis as we would not be able to predict a model where age and sex are independent.