How to group and aggregate data using pandas

Published on Aug. 22, 2023, 12:17 p.m.

To group and aggregate data using pandas, you can use the groupby() method followed by an aggregate function. Here are the steps:

  1. Load the data into a pandas DataFrame using read_csv() or read_excel() functions.
  2. Group the data by one or more columns using the groupby() method.
  3. Apply an aggregate function to the grouped data using the agg() method. You can use built-in functions like sum(), mean(), max(), min(), or provide your own custom function.
  4. Reset the index of the resulting DataFrame using the reset_index() method.

Here’s an example of how to group and aggregate a sample DataFrame using pandas:

import pandas as pd

# Load data into a DataFrame
df = pd.read_csv('data.csv')

# Group data by a column and calculate the sum of another column for each group
grouped_df = df.groupby('group_column').agg({'aggregate_column': 'sum'})

# Reset the index
grouped_df = grouped_df.reset_index()

# Print the resulting DataFrame
print(grouped_df)

In this example, we group the data by the ‘group_column’ column and calculate the sum of the ‘aggregate_column’ for each group. We then reset the index of the resulting DataFrame.

You can also group by multiple columns by providing a list of column names to the groupby() method:

grouped_df = df.groupby(['group_column1', 'group_column2']).agg({'aggregate_column': 'sum'})

This will group the data by both ‘group_column1’ and ‘group_column2’ columns.

There are many other aggregation functions you can use with agg() method like count(), first(), last(), median(), nunique(), etc. You can refer to the pandas documentation for more information on all the available functions.

Tags: