pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data.
here we have used info method to get the details of the data frame and we can see that some columns do have missing values. So now we will be discussing some of the summary statistic methods for dataframe and will see how they handle the missing values.
The most basic analysis we can do is to count the number of unique values in our data Count method returns the number of nonnull entries of the column.
The mean method computes an average of series and the average of dataframe column wise ignoring the null entries.
All Series and Dataframe statistical methods ignore null entries.
The std method returns standard deviation of the nonnull entries of the series. If the mean measures the tendency of the central value of the measurement the standard deviation measures its spread.
If Mean is at the center of the bell curve then the standard deviation is at the width of the bell curve.
The mean is not the only value used for centrality in the statistic. Median also compute the centrality.
The median method is used to compute the median
Here 3.0 tell us that 50% of the pclass values has a value less than 3.0 and rest have a value larger than 3.0.
A median is a special example of quantile. If q is the number between 0 and 1. The q’s quantile of the dataset is a numerical value that splits the data into two sets one with the fraction q of smaller observation and other with larger observation.
The value is the same as the median. As media is equal to 0.5 quantiles. Collectively, the quantiles are called percentiles using percentages between 0 and 100 rather than a fraction between 0 and 1.
The median is the 0.5 quantiles or 50% percentile of the dataset as verified using method median() and quantile() with q=0.5.
The quantile method computes the median by default. If given the fraction q it returns the qth quantiles.
Note- Median is a very useful statistic especially in presence of outliers where it is more robust than mean.
IQR (Inter Quantile Range)
The quantile method also accepts the value from list or arrays with values between 0 and 1 for instance.
It returns the inter-quartile range or IQR between 1 quarter and 3 quarter quantile. We see the IQR for pclass is between 2.0 and 3.0 which means half of the values lie in this range.
The Range is the interval between the smallest and largest observation these are calculated by min and max method.
The two methods can also return value for the string column, Natively the first and the last string of the column when sorted in alphabetically order.
we can draw the Box plot for the dataset which depicts the quantiles outliers and median graphically.
This shows the range and inter-quartile range graphically.
We can use .describe to know the summary statistic of the dataframe.
- 50% quantile (median)
The most basic analysis we can do is to count the number of unique values in our dataset and for that, we use value_count method.
dropna=False indicates that we do not want any entry to be dropped that is, it will also count the missing value.
This value counts method will give us the value of every unique entry in the column in descending order as shown here each entry indicates the number of times it has appeared. note that the entries are in descending order.