Categorical data

Categoricals are a pandas data type  A categorical variable takes on a limited and usually fixed, number of possible values. for example, Gender, blood group, ratings, etc. All values of categorical data are either in categories or np.nan. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. 

The categorical data type is useful in the following cases −

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
  • As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

Object Creation

We can create a Series of Categorical values by specifying dtype=”category ” while creating a Series.

Here the number of categories passed is 4 and all are of the object type.

pd.Categorical

We can also use Categorical method to create categorical series for that we just have to pass the list of categorical values as an argument to pd.Categorical

Dataframe

Similar to the previous section where a single column was converted to categorical, all columns in a DataFrame can be batch converted to categorical either during or after construction.

This can be done during construction by specifying dtype=”category” in the DataFrame constructor:

describe()

Describe method can be used to get various information about the categorical column like count, top, freq, etc.

Properties of Categories

Categorical data has categories and an ordered property, which list their possible values and whether the ordering matters or not. These properties are exposed as s.cat.categories and s.cat.ordered. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

When we add new categories to an existing one then we have to pass ordered = True to make them ordered categories.

Renaming Categories

We can rename the categories by 2 methods

  • passing the list of names as an argument to rename_categories().
  • by assigning new values to the Series.cat.categories.

we can also use a dictionary instead of the list.

Appending and Removing Categories

We can use two very useful functions for appending and removing categories that are,  add_categories() method and remove_categories() method respectively.

Reordering

We can do reordering of the categories as we want via the cat.reorder_categories() and the cat.set_categories() methods. For cat.reorder_categories(), all old categories must be included in the new categories and no new categories are allowed.

Comparison

Comparing categorical data with other objects is possible in three cases −

  • comparing equality (== and !=) to a list-like object (list, Series, array, …) of the same length as the categorical data.
  • all comparisons (==, !=, >, >=, <, and <=) of categorical data to another categorical Series, when ordered==True and the categories are the same.
  • all comparisons of a categorical data to a scalar.

All other comparisons, especially “non-equality” comparisons of two categoricals with different categories or a categorical with any list-like object, will raise a TypeError.

One Hot Encoding

The basic strategy is to convert each category value into a new column and assign a 1 or 0(True/False) value to the column. This has the benefit of not weighting a value improperly.

There are many libraries out there that support one-hot encoding but the simplest one is using pandas’ .get_dummies method.

This function is named this way because it creates dummy/indicator variables (1 or 0). There are mainly three arguments important here, the first one is the DataFrame you want to encode on, second being the columns argument which lets you specify the columns you want to do encoding on, and third, the prefix argument which lets you specify the prefix for the new columns that will be created after encoding.

Here we have used the default prefix and as we can see it has created 3 new columns each having entries as 0 or 1. Here values have been assigned according to there entry in the previous state( before one hot encoding) as the first value in the previous state was “a” so the A_a column has its first entry as 1 and rest of the new columns have 0. Same goes for all the other columns.

This technique is really important when we are making any Machine Learning model.

Close Menu