How to merge and join data using pandas

Published on Aug. 22, 2023, 12:17 p.m.

To merge and join data using pandas, you can use the merge() and join() functions. Here’s how:

  1. merge(): This function is used to merge two DataFrames based on a common column or index. It takes two DataFrames as input and returns a merged DataFrame. Here’s an example:
import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'employee': ['John', 'Lisa', 'Mike', 'Mary'],
                    'department': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['John', 'Lisa', 'Mike', 'Mark'],
                    'salary': [50000, 60000, 55000, 70000]})

# Merge the two DataFrames on the 'employee' column
merged_df = pd.merge(df1, df2, on='employee')

print(merged_df)

This will output a merged DataFrame with columns ‘employee’, ‘department’, and ‘salary’ based on the ‘employee’ column.

  1. join(): This function is used to join two DataFrames based on their index. It is similar to merge(), but instead of merging on a column, it merges on the index. Here’s an example:
import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'department': ['Accounting', 'Engineering', 'HR']},
                    index=['John', 'Lisa', 'Mike'])

df2 = pd.DataFrame({'salary': [50000, 60000, 70000]},
                    index=['John', 'Lisa', 'Mark'])

# Join the two DataFrames on their index
joined_df = df1.join(df2)

print(joined_df)

This will output a joined DataFrame with columns ‘department’ and ‘salary’ based on the index. Note that there is no value for ‘salary’ for the ‘Mike’ row, since there is no corresponding index in df2.

Both merge() and join() functions have many options to customize the behavior of the operation, such as how to handle missing values or how to merge on multiple columns. You can refer to the pandas documentation for more information.

Tags: