pandas 根据多个字段去重复 Pandas repeats based on multiple fields

Published on Aug. 22, 2023, 12:14 p.m.

To drop duplicates in a Pandas DataFrame based on multiple columns, you can use the drop_duplicates() method and specify the subset of columns to consider. Here’s an example:

import pandas as pd
# create example dataframe
df = pd.DataFrame({
    'name': ['John', 'Mary', 'John'],
    'age': [30, 25, 30],
    'city': ['New York', 'Los Angeles', 'New York']
})
# drop duplicates based on name and city
df = df.drop_duplicates(subset=['name', 'city'])
print(df)

Output:

name age city
0 John 30 New York
1 Mary 25 Los Angeles

In this example, the drop_duplicates() function is called on the DataFrame with a subset argument that specifies the columns to consider when identifying duplicates. The resulting DataFrame contains only the rows with unique values for both name and city.

pandas 使用lamda 去除重复字段

To drop duplicates in a Pandas DataFrame based on multiple columns using a lambda function, you can use the apply() method with a lambda function that generates a string from the columns of interest and then drop the duplicates with the drop_duplicates() method. Here’s an example:

import pandas as pd

# create example dataframe
df = pd.DataFrame({
    'name': ['John', 'Mary', 'John'],
    'age': [30, 25, 30],
    'city': ['New York', 'Los Angeles', 'New York']
})

# drop duplicates based on name and city using lambda function
df = df[df.apply(lambda x: '{}{}'.format(x['name'], x['city']), axis=1)
        .drop_duplicates().index]

print(df)

Output:

name age city
0 John 30 New York
1 Mary 25 Los Angeles
In this example, the apply() function is called on the DataFrame with a lambda function that generates a string concatenating the name and city columns. The resulting Series contains strings that can be used to identify duplicates. Then, the drop_duplicates() method is called on the index of the resulting Series to drop the duplicate rows in the original DataFrame.

Note that this method might not be as efficient as using the drop_duplicates() method with the subset argument, especially if your DataFrame is large, but it can be useful in some cases.

Tags: