Pandas.Dataframe.groupby( by=None , axis=0 , level=None , as_index=True , sort=True , group_keys=True , squeeze=False)
The groupby function is used to group data by entries in a particular column. This function is mostly used on a categorical column to group the dataframe by the categories in that column.
The groupby function does the same work as Boolean indexing for example.
This is the Titanic dataset which contains the details of the passengers along with a survived column which tells whether the passenger survived or not.
If we apply Boolean indexing on this titanic dataset to get the total number of passengers of pclass 1.
The problem with Boolean indexing is, we need to know that 1 is an entry in the pclasss column. But if we do not know that 1 is in the column or not we could use the groupby function which will return the dataset grouped by the entries in that column. For ex.
As we can see groupby function returned the total number of passengers from every class.
Groupby function involves 3 steps.
In splitting, we split the data set into a group of rows containing values from a particular column.
df.groupby(‘pclass’), this will help to create a groupby object that is, this will give us a split dataset according to all the types of entries in pclass.
As we can see here we didn’t get the dataset but a groupby object.
Out of these, the split is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to one of the following.
After obtaining the groupby object we apply some aggregation or transformation or filtration. To tell python what to do with all the groups obtained.
In aggregation, we compute summary statistics for each group. some examples:
- Compute group sums or means
- Compute group sizes/counts
Various aggregation function that we can use are:
- mean(): gives the average of each group.
- std(): gives the standard deviation each of the group.
- sum(): gives the sum of all entries of each group.
- first(): gives the first element of each group.
- last(): gives the last element of each group.
- min(): gives the min of each group.
- max(): gives the max of each group.
Instead of just using one aggregation method we can use several aggregation methods simultaneously by using agg() method which takes a list of aggregation functions as a parameter.
When we are using max the values returned for string variables are the ones which appear for the maximum times. While, when using mean or std functions there is no value return for string variables.
As we can see it returns a multilevel column that is, for each column, there are two sub-columns sums and max.
Instead of aggregation after grouping, we can apply the transform method instead, this changes the specified entry according to the function applied without changing the index.
In transformation, we perform some group-specific computations and return a like-indexed object.
We use the transform method after the grouping to get the transformed result Some examples:
- Standardize data (zscore) within a group.
- Filling NAs within groups with a value derived from each group.
As we can see it returns the zscore (the distance of an individual entry from the mean of its group in terms of standard deviation, it is mainly used to convert the values in the same range or sometimes to find outliers.) of all the numerical columns.
This zscore function will accept a series and returns the zscore value of that series.
Note: – In some cases split by, combine operations do not neatly fold into aggregations or transformation. For those cases, we use apply().
This new transformation is too complicated for transform(), So we use apply instead.
In filtration, we discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discard data that belong to groups with only a few members
- Filter out data based on the group sum or mean.
Groupby function returns a groupby object which has an attribute groups that is a dictionary. Its keys are the elements of the column on which groupby function is applied that is all the groups and values are the corresponding rows from the original dataframe.
So we can iterate over that dictionary and carry out the computation. In the example above we have calculated the average fare of men who traveled in a particular class. So we have filtered our data by applying a condition that we want to calculate the average fare prices of men only.
Multi index with Groupby
We can create a multi index dataframe with the help of groupby, by passing a list of columns as a parameter to the groupby function.
Here we are grouping on the dataset with two columns.
Multi-index dataframe means that now we have 2 levels of the index. This dataframe contains pclass as level 1 index and survived as level 2 index.
Selecting with groupby
We can select a column of our choice in the resultant dataset by passing a list of columns after the groupby function.
Now we can see that only two columns are present in the resultant dataset.
Groupby object attributes
The groups attribute is a dict whose keys are the computed groups and corresponding values being the axis labels belonging to each group.
Calling python len() function on this groups attribute returns the length of the group.
This shows that there are 3 groups in our resultant dataset.
Note: – We can also use get_group method to get a particular group.
We can also use a Boolean condition to group our dataset. In that case the resultant dataset will be divided in True and False as shown in the example below:
Groupby also works with some plotting methods. For example, line plot
with groupby we can easily plot the differences between the individual group.