DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame
We can also say that a DataFrame is a collection of Series. So we can also pass the index or column label to a DataFrame.
Data science involves processing, analyzing, and visualizing data. While some tools like Microsoft Excel allow us to perform basic data science tasks, they’re limited to the functionality built in to the user interface. If you want to work with datasets that aren’t structured like a spreadsheet or create entirely new data visualizations from scratch, you’ll need to become proficient in programming. Instead of using a program written by others that can solve a narrow set of tasks, you can create your own programs that can solve your specific problems.
Programming involves organizing a collection of instructions into a program for a computer to carry out. To express these instructions, we use a programming language like Python which has a Pandas library that contains a DataFrame data structure which helps to perform various data analysis task.
Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data
Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.
One of the main advantages of using pandas DataFrames instead of Numpy is that DataFrams allows you to have columns of various data types.
We can also perform basic data analysis task with excel spreadsheets also but pandas DataFrame allows much more functions which are easy to perform and many types of visualization in comparison to excel spreadsheets.
We can create DataFrame from various methods.
class pandas.DatFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
data: It can contain dict, a list like object, array etc.
index: Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns: column label to use for resulting DataFrame. by default (0,1,2,3,4…) that is, RangeIndex.
dtype: Data type to force. Only a single dtype is allowed.
From Dict of ndarray
DataFrame can also be created from a dict. In this case, the keys of the dictionary are converted into columns of DataFrame and the corresponding values of keys are converted into entries of columns.
If no value is provided for any row then that row will be filled with NaN (Not a Number).
From Dict of Series
When using dict of Series the resultant DataFrame will be same and same rules are applied for this case also.
When creating DataFrame with the help of series the resulting DataFrame will have the same index as of the series and a column name with the name of the series.
Column Selection, Addition, and Deletion
Just like a dict, We can do the selection addition and deletion in a DataFrame with almost same syntax.
When performing selection we give the DataFrame name and the column name in a square bracket, It will provide all the rows of that column. We can also perform slicing and boolean indexing on a DataFrame.
When performing addition we give the DataFrame name and the column name that we want to add in the square bracket and provide the list of items that we want to add in that column.
If a single value is given then it will be broadcasted to each row.
When performing deletion we simply give the DataFrame name and the column name in the square bracket to the del.
Indexing and Slicing
Indexing and Selecting is the process of Selecting particular data from a dataframe. There are mainly three ways in which we can select data from a DataFrame.
- Square Brackets
- .loc accessor
- .iloc accessor
The loc accessor is used for the indexing and slicing of DataFrame on the basis of labels, which means that we have to specify rows and columns based on their row and column label. We can also pass a boolean condition in it.
.iloc accessor is used to select data on the basis of position (from 0 to length-1 of the axis). Here we use Index instead of labels. It may be used with a boolean array. .iloc will raise IndexError if a requested indexer is out of bounds, except slice indexer which allows out of bounds indexes.
for further information on indexing and slicing please visit( http://blog.robofied.com/indexing-and-selecting-data/ )
Transposing can be performed by using T attribute of a DataFrame. It means to simply convert the columns into row labels and the column name will be given as ( 0,1,2,3….)