__Groupby__

__Groupby__

Pandas.Dataframe.**groupby**( by=None , axis=0 , level=None , as_index=True , sort=True , group_keys=True , squeeze=False)

**Groupby** function is used to group data by entries in a particular column. This function is mostly used on a categorical column to group the dataframe by the categories in that column.

**Groupby** function does the same work as Boolean indexing for example.

This is the Titanic dataset which contains the details of the passengers along with a survived column which tells whether the passenger survived or not.

If we apply Boolean indexing on this titanic dataset to get the total number of passengers of pclass 1.

The problem with Boolean indexing is, we need to know that 1 is an entry in the pclasss column. But if we do not know that 1 is in the column or not, we could use the** groupby** function which will return the dataset grouped by the entries in that column. Forex:

As we can see **groupby** function returned the total number of passengers from every class.

**Groupby** function involves 3 steps.

- Splitting
- Applying
- Combining

__Splitting__

__Splitting__

In splitting, we split the data set into a group of rows containing values from a particular column.

df.**groupby**(‘pclass’), this will help to create a **groupby** object that is, this will give us a splited dataset according to all the types of entries in pclass.

As we can see here we didn’t get the dataset but a **groupby** object.

Out of these, the split is the most straightforward. In fact, in many situations, we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to one of the following.

__Applying__

__Applying__

After obtaining **groupby** object we apply some aggregation or transformation or filtration. To tell python what to do with all the groups obtained.

** Aggregation**

In aggregation, we compute a summary statistics for each group. some examples:

- Compute group sums or means
- Compute group sizes/counts

Various aggregation function that we can use are:

- mean(): gives the average of each group.
- std(): gives the standard deviation each of the group.
- sum(): gives the sum of all entries of each group.
- first(): gives the first element of each group.
- last(): gives the last element of each group.
- min(): gives the min of each group.
- max(): gives the max of each group.

Instead of just using one aggregation method we can use several aggregation methods simultaneously by using agg() method which takes a list of aggregation functions as a parameter.

when we are using max the values returned for string variables are the ones which appear for the maximum times. While, when using mean or std functions there is no value return for string variables.

As we can see it returns a multilevel column that is, for each column, there are two sub-columns sum and max.

** Transformation**

Instead of aggregation after grouping, we can apply the transform method instead, this changes the specified entry according to the function applied without changing the index.

In transformation, we perform some group-specific computations and return a like-indexed object.

We use the transform method after the grouping to get the transformed result Some examples:

- Standardize data (zscore) within a group.
- Filling NANs within groups with a value derived from each group.

As we can see it returns the zscore (the distance of an individual entry from the mean of its group in terms of standard deviation, it is mainly used to convert the values in the same range or sometimes to find outliers.) of all the numerical columns.

This zscore function will accept a series and returns the zscore value of that series.

**Note**: – In some cases split by, combine operations do not neatly fold into aggregations or transformation. For those cases, we use apply().

This new transformation is too complicated for transform(), So we use apply instead.__ __

** Filtration**

In filtration, we discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

- Discard data that belong to groups with only a few members.
- Filter out data based on the group sum or mean.

**Groupby** function returns a **groupby** object which has an attribute groups that is a dictionary. Its keys are the elements of the column on which **groupby **function is applied that is all the groups and values are the corresponding rows from the original dataframe.

So we can iterate over that dictionary and carry out the computation. In the example above we have calculated the average fare of men who traveled in a particular class. So we have filtered our data by applying a condition that we want to calculate the average fare prices of men only.

__Multi-index with Groupby__

__Multi-index with Groupby__

We can create a multi-index dataframe with the help of **groupby**, by passing a list of columns as a parameter to the **groupby** function.

Here we are grouping on a dataset with two columns.

Multi-index dataframe means that now we have 2 levels of the index. This dataframe contains pclass as level 1 index and survived as level 2 index.

__Selecting with groupby__

__Selecting with groupby__

We can select a column of our choice in the resultant dataset by passing a list of columns after the** groupby** function.

Now we can see that only two columns are present in the resultant dataset.

__Groupby object attributes__

__Groupby object attributes__

The groups attribute is a dict whose keys are the computed groups and corresponding values being the axis labels belonging to each group.

Calling python len() function on this groups attribute returns the length of the group.

This shows that there are 3 groups in our resultant dataset.

Note: – We can also use **get_group** method to get a particular group.

__Boolean Groupby__

__Boolean Groupby__

We can also use a Boolean condition to group our dataset. In that case, the resultant dataset will be divided into True and False as shown in the example below:

__Plotting __

__Plotting__

**Groupby** also works with some plotting methods. For example, line plot

with **groupby **we can easily plot the differences between the individual group.

Here we can easily see that the price for class 1 is very high.