Grouping and Aggregating with Pandas

Grouping and Aggregating with Pandas

Grouping and aggregation are two powerful functionalities provided by pandas to easily summarize and analyze data. Let's dive into how to use these functionalities.

1. Grouping:

The primary method for grouping data in pandas is the groupby() method. It splits the data into groups based on some criteria.

Example:

Consider a simple DataFrame:

import pandas as pd

data = {
    'Department': ['HR', 'IT', 'IT', 'Sales', 'HR'],
    'Employee': ['John', 'Mike', 'Anna', 'Samantha', 'Chris'],
    'Salary': [5000, 7000, 6200, 5500, 5300]
}

df = pd.DataFrame(data)

Group by Department:

grouped = df.groupby('Department')

2. Aggregation:

Once you've created a GroupBy object, you can compute aggregate values such as sum, mean, max, min, etc.

Example:

Sum of salaries in each department:

grouped['Salary'].sum()

You can use the agg() method to perform multiple aggregations at once:

grouped['Salary'].agg(['sum', 'mean', 'min', 'max'])

3. Combining GroupBy with other functionalities:

Example:

Get the highest-paid employee in each department:

def top_salary(s):
    return s.sort_values(ascending=False).iloc[0]

grouped['Salary'].agg(top_salary)

4. Advanced Aggregations:

Using the agg() method, you can specify which aggregations to apply to each column:

grouped.agg({
    'Salary': ['mean', 'sum', 'max'],
    'Employee': 'count'
})

You can also define custom aggregation functions:

def range_salary(s):
    return s.max() - s.min()

grouped['Salary'].agg(range_salary)

5. Resetting Index:

By default, the grouped columns become indices in the aggregated dataframe. To reset the indices, you can use reset_index():

grouped['Salary'].sum().reset_index()

6. Grouping by Multiple Columns:

You can group by multiple columns by passing a list of columns:

data['Year'] = [2021, 2022, 2021, 2022, 2021]
df = pd.DataFrame(data)

grouped_multiple = df.groupby(['Department', 'Year'])
grouped_multiple['Salary'].sum()

Grouping and aggregating are essential tools when analyzing data in pandas. Depending on the complexity of your dataset and the type of analysis you want to perform, you can combine these functionalities in numerous ways to extract meaningful insights from your data.


More Tags

coronasdk scheduledexecutorservice runtimeexception sqlresultsetmapping pie-chart substitution alter-table mocking http-status-code-404 extract-text-plugin

More Programming Guides

Other Guides

More Programming Examples