How to check for duplicates in pandas?

Published on Aug. 22, 2023, 12:18 p.m.

To check for duplicates in pandas, you can use the duplicated() method on a pandas DataFrame. duplicated() returns a boolean Series with True and False values that describe which rows in the DataFrame are duplicated and not. By default, duplicated() considers all columns when looking for duplicates, but you can use the subset parameter to specify a subset of columns to check for duplicates. Here are a few examples:

  1. Check if any row is a duplicate of another row in the DataFrame:

import pandas as pd

create a sample dataframe

df = pd.DataFrame({‘A’: [1, 2, 3, 3], ‘B’: [‘a’, ‘b’, ‘a’, ‘d’]})

check for duplicates

duplicate_rows = df.duplicated()

print(duplicate_rows)


This will output a boolean Series with True values for the duplicate rows:

0 False
1 False
2 False
3 True
dtype: bool


2. Check if any row is a duplicate of another row based on a subset of columns:

check for duplicates based on column ‘A’ only

duplicate_rows = df.duplicated(subset=[‘A’])

print(duplicate_rows)


This will output a boolean Series with True values for the rows where column A is a duplicate:

0 False
1 False
2 False
3 True
dtype: bool


Note that the `duplicated()` method does not modify the DataFrame in place, rather it returns a boolean Series. If you want to remove the duplicate rows, you can use the `drop_duplicates()` method, which removes all but the first occurrence of each duplicated row:

remove duplicates based on column ‘A’

df = df.drop_duplicates(subset=[‘A’])

print(df)


This will output the DataFrame with the duplicates removed:

A B
0 1 a
1 2 b
2 3 a


If you want to keep the last occurrence of each duplicated row, you can set the `keep` parameter to `'last'`:

remove duplicates based on column ‘A’, keeping the last occurrence

df = df.drop_duplicates(subset=[‘A’], keep=’last’)

print(df)


This will output the DataFrame with only the last occurrence of each duplicate row remaining:

A B
0 1