Merge, Join and Concatenate

Pandas provide various facilities to easily combine dataframes, series or panel objects with the help of Merge, Join and Concatenate functions.

 

combining Robofied

Loading Multiple Files

But first of all to start combining dataframes we have to load or import multiple files. To do that we can use glob function which will do the pattern matching of the string passed as shown in the example:

loading Robofied

Here we can see that glob function matches all the files in the directory starting with names and ending with .csv. After that, we have created a list of loaded files by list comprehensions.

Now we can combine these loaded files.

Appending

  • .append() is a method for series and dataframes.
  • It’s invocation is (.df1.append(df2)).
  • It stacks rows of df2 below df1.

append Robofied

As the append function doesn’t sort the resultant dataset so we have used sort=True to do that otherwise you might get a warning or sometimes an error.

Concatenation

  • Concatenation does the same work as append but with more flexibility as it provides more functionality then append.
  • It’s invocation is pd.concat([df1,df2]).
  • It takes an iterable as the first argument so we can give a dictionary, list, tuple etc.
  • It can combine both row wise or column wise, we just have to specify in the axis argument of concat. By default, it is column-wise.
  • Concat also do the sorting, so we have to pass sort = True as the parameter.

concatenate Robofied

Concatenate with multi index

Concat function has a keys argument which takes the values as a list of strings which would be provided as an index to all the concatenated dataframes. for example:

multi index Robofied

If we are using a dictionary as an iterable in the first argument then the keys of the dictionary will automatically be considered as the values in the keys argument.

Join

Join is used to combine rows of multiple tables. Joining table means meaningfully gluing indexed rows together.

Outer join

  • Outer join means the union of index set (all labels, no repeater).
  • Missing fields are filled with NaNs.
  • An outer join has all the indices of both tables but without repetition like a set union.

Inner join

  • An inner join is basically the intersection of the of the index set (only common labels).

Concatenation and Inner Join

By default, the concat method do the outer join.

 

concat inner join Robofied

So as shown the inner join contains only those entries which have a common index in both the dataframes.

When we perform an outer join then all the row indices from the joined dataframes are present but without repetition i.e when a row occur in one dataframe but not in the other the missing column values are filled with NaNs.

Merging Dataframes

Very often we want to align multiple columns or columns other than index. This is done by merging datframes.

Merge extends concat with the ability to align multiple columns.

Merging dataframes is same as an inner join as it also returns only the rows which occur in both dataframes.

By default, merge uses all the columns common to both dataframes to merge i.e the rows of the merged datafame consist of all rows where the entries in all the columns are same.

 

merge Robofied

Here we can see that the merge has been performed on all the column and all the rows having the same entries in both datasets are returned.

Merging ON

On argument takes the name of the column on which we want to perform merge i.e only entries from that column will be matched.

The resultant dataframe has _x and _y as the suffix indicating the origin.

We can think it as suppose we are concatenating two dataframes which have the selected column as their index and the resulted dataframe has that column only for once.

on Robofied

 

Merging on Multiple Columns

This is the property of merge which differs it from concat i.e merging on multiple columns.

We can do this by giving a list of column names to on.

multiple columns merge Robofied

Suffixes

We can use suffixes argument to change the _x, _y that comes at the end of nonmerged columns.

We just have to provide a list of names that we want as the suffix, to the suffixes argument.

suffixes Robofied

 

Joining Datarames

The merge operation performs inner join by default. It extracts the rows that match in the joining column of both the dataframes and it glues them together in the joined dataframe.

Merging with Left Join

In the left join, all the rows of the left dataframe in the merged dataframe are kept.

For rows in the left dataframe with matches in the right dataframe.

  • Non-joining columns of right dataframe are appended to left dataframe.

For rows in the left dataframe with no match in the right dataframe

  • Non-joining columns are filled with null value.

merge left join Robofied

Merging With Right Join

The right join does the same work as the left join, It just takes all the values from the right dataframe.

merge right join Robofied

Merging With Outer Join

The union of all rows from the left and right dataframe can be preserved with an outer join.

merge outer join Robofied

Using.Join()

       how=left

The .Join method performs left join by default.

left join Robofied

 how=right

 right join Robofied

how=inner inner join Robofied

how=outer
outer join Robofied

Which Method You Should Use?

As we have seen many methods to join or combine our dataframes so this is an obvious question that which one should we use?

The answer to this question is simple. Use the simplest tool that works.

Which means if you need to simply stack two series or dataframes vertically use append.

If need more flexible stacking that is, stacking many rows horizontally or vertically. Or simple inner/outer joins on indexes.

Use pd.concat()

The join method also works on indexes but it gives more flexibility in left and right join.

The Merge tool is the powerful tool for joining when we need to join on multiple columns.

Leave a Comment

Your email address will not be published. Required fields are marked *