Merge, Join and Concatenate
Pandas provide various facilities to easily combine dataframes, series or panel objects with the help of Merge, Join and Concatenate functions.
Loading Multiple Files
But first of all to start combining dataframes we have to load or import multiple files. To do that we can use glob function which will do the pattern matching of the string passed as shown in the example:
Here we can see that glob function matches all the files in the directory starting with names and ending with .csv. After that, we have created a list of loaded files by list comprehensions.
Now we can combine these loaded files.
- .append() is a method for series and dataframes.
- It’s invocation is (.df1.append(df2)).
- It stacks rows of df2 below df1.
As the append function doesn’t sort the resultant dataset so we have used sort=True to do that otherwise you might get a warning or sometimes an error.
- Concatenation does the same work as append but with more flexibility as it provides more functionality then append.
- It’s invocation is pd.concat([df1,df2]).
- It takes an iterable as the first argument so we can give a dictionary, list, tuple etc.
- It can combine both row wise or column wise, we just have to specify in the axis argument of concat. By default, it is column-wise.
- Concat also do the sorting, so we have to pass sort = True as the parameter.
Concatenate with multi index
Concat function has a keys argument which takes the values as a list of strings which would be provided as an index to all the concatenated dataframes. for example:
If we are using a dictionary as an iterable in the first argument then the keys of the dictionary will automatically be considered as the values in the keys argument.
Join is used to combine rows of multiple tables. Joining table means meaningfully gluing indexed rows together.
- Outer join means the union of index set (all labels, no repeater).
- Missing fields are filled with NaNs.
- An outer join has all the indices of both tables but without repetition like a set union.
- An inner join is basically the intersection of the of the index set (only common labels).
Concatenation and Inner Join
By default, the concat method do the outer join.
So as shown the inner join contains only those entries which have a common index in both the dataframes.
When we perform an outer join then all the row indices from the joined dataframes are present but without repetition i.e when a row occur in one dataframe but not in the other the missing column values are filled with NaNs.
Very often we want to align multiple columns or columns other than index. This is done by merging datframes.
Merge extends concat with the ability to align multiple columns.
Merging dataframes is same as an inner join as it also returns only the rows which occur in both dataframes.
By default, merge uses all the columns common to both dataframes to merge i.e the rows of the merged datafame consist of all rows where the entries in all the columns are same.
Here we can see that the merge has been performed on all the column and all the rows having the same entries in both datasets are returned.
On argument takes the name of the column on which we want to perform merge i.e only entries from that column will be matched.
The resultant dataframe has _x and _y as the suffix indicating the origin.
We can think it as suppose we are concatenating two dataframes which have the selected column as their index and the resulted dataframe has that column only for once.
Merging on Multiple Columns
This is the property of merge which differs it from concat i.e merging on multiple columns.
We can do this by giving a list of column names to on.
We can use suffixes argument to change the _x, _y that comes at the end of nonmerged columns.
We just have to provide a list of names that we want as the suffix, to the suffixes argument.
The merge operation performs inner join by default. It extracts the rows that match in the joining column of both the dataframes and it glues them together in the joined dataframe.
Merging with Left Join
In the left join, all the rows of the left dataframe in the merged dataframe are kept.
For rows in the left dataframe with matches in the right dataframe.
- Non-joining columns of right dataframe are appended to left dataframe.
For rows in the left dataframe with no match in the right dataframe
- Non-joining columns are filled with null value.
Merging With Right Join
The right join does the same work as the left join, It just takes all the values from the right dataframe.
Merging With Outer Join
The union of all rows from the left and right dataframe can be preserved with an outer join.
The .Join method performs left join by default.
Which Method You Should Use?
As we have seen many methods to join or combine our dataframes so this is an obvious question that which one should we use?
The answer to this question is simple. Use the simplest tool that works.
Which means if you need to simply stack two series or dataframes vertically use append.
If need more flexible stacking that is, stacking many rows horizontally or vertically. Or simple inner/outer joins on indexes.
The join method also works on indexes but it gives more flexibility in left and right join.
The Merge tool is the powerful tool for joining when we need to join on multiple columns.