GROUP BY

Learning Focus

Use this lesson to understand GROUP BY with practical syntax and examples.

Concept Overview

GROUP BY transforms detail rows into summary rows by grouping records that share the same key(s), then applying aggregate functions (like COUNT, SUM, AVG, MIN, MAX) to each group.

Why is it important?

Reporting: metrics per region, per customer, per product
Analytics: cohort summaries, conversion rates, revenue by segment
Data quality: detect duplicates and outliers by grouping
Performance: correct grouping reduces data movement and lets the database aggregate efficiently

Where does it fit?

GROUP BY sits between row-level filtering (WHERE) and group-level filtering (HAVING). It is the core technique for aggregation and is commonly combined with joins, conditional aggregates (FILTER), and ordering.

Syntax & Rules

Core Syntax

SELECT group_key_1, group_key_2, ...,
       aggregate_fn(expr) AS metric
FROM table_name
WHERE row_filters
GROUP BY group_key_1, group_key_2, ...
ORDER BY metric DESC;

Available Options / Parameters

Clause/Feature	What it does	PostgreSQL notes
`WHERE`	filters rows before grouping	reduces work and changes results
`GROUP BY`	defines grouping keys	every non-aggregated selected column must be grouped
`HAVING`	filters groups after aggregation	use for aggregate predicates
`FILTER (WHERE ...)`	conditional aggregate	PostgreSQL feature: `SUM(x) FILTER (WHERE cond)`
`ORDER BY`	sorts final groups	use `NULLS LAST` when aggregates can be NULL

Key Rules and Considerations

Every selected column must either be an aggregate (SUM(amount)) or appear in GROUP BY.
Use WHERE to remove rows you do not want to include in the aggregates.
Use HAVING for conditions on aggregated values.
Conditional aggregates are usually cleaner with FILTER than with CASE expressions.

Step-by-Step Examples

Example 1: Basic Grouping (Beginner)

CREATE TABLE orders (
  order_id    bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id bigint NOT NULL,
  region      text NOT NULL,
  status      text NOT NULL,
  amount      numeric(12,2) NOT NULL CHECK (amount >= 0),
  created_at  date NOT NULL
);

INSERT INTO orders (customer_id, region, status, amount, created_at) VALUES
  (10, 'US',   'paid',    25.00, '2026-01-01'),
  (10, 'US',   'paid',    10.00, '2026-01-02'),
  (11, 'EU',   'paid',    50.00, '2026-01-03'),
  (12, 'EU',   'failed',  20.00, '2026-01-03'),
  (13, 'APAC', 'paid',    40.00, '2026-01-04'),
  (13, 'APAC', 'pending', 15.00, '2026-01-05');

SELECT region, COUNT(*) AS order_count
FROM orders
GROUP BY region
ORDER BY region;

Expected output:

 region | order_count
--------+------------
 APAC   | 2
 EU     | 2
 US     | 2
(3 rows)

Explanation:

GROUP BY region produces one output row per region.
COUNT(*) counts all rows in each region.

Example 2: Multiple Aggregates per Group (Intermediate)

SELECT
  region,
  COUNT(*) AS orders_total,
  SUM(amount) AS amount_total,
  AVG(amount) AS avg_amount
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY region
ORDER BY amount_total DESC;

Expected output:

 region | orders_total | amount_total | avg_amount
--------+-------------+--------------+--------------------
 APAC   | 2           | 55.00        | 27.5000000000000000
 EU     | 2           | 70.00        | 35.0000000000000000
 US     | 2           | 35.00        | 17.5000000000000000
(3 rows)

Explanation:

WHERE filters rows before aggregation.
SUM and AVG ignore NULLs, but amount is NOT NULL here.

Example 3: Conditional Aggregates with FILTER (Advanced)

PostgreSQL's FILTER lets you compute multiple conditional metrics in one pass.

SELECT
  region,
  COUNT(*) AS orders_total,
  COUNT(*) FILTER (WHERE status = 'paid') AS orders_paid,
  SUM(amount) FILTER (WHERE status = 'paid') AS revenue_paid,
  SUM(amount) FILTER (WHERE status = 'failed') AS failed_amount
FROM orders
GROUP BY region
ORDER BY region;

Expected output:

 region | orders_total | orders_paid | revenue_paid | failed_amount
--------+-------------+------------+-------------+--------------
 APAC   | 2           | 1          | 40.00       |
 EU     | 2           | 1          | 50.00       | 20.00
 US     | 2           | 2          | 35.00       |
(3 rows)

Explanation:

FILTER applies only to the aggregate it is attached to.
SUM(...) returns NULL when there are no matching rows (for example, no failed orders in US).

Example 4: Grouping by Time Buckets (Advanced)

SELECT
  date_trunc('month', created_at::timestamptz) AS month,
  COUNT(*) AS orders
FROM orders
GROUP BY month
ORDER BY month;

Expected output:

 month                  | orders
------------------------+-------
 2026-01-01 00:00:00+00 | 6
(1 row)

Explanation:

date_trunc() is a common way to group by month/day/hour.
Use a stable bucket definition and consistent time zone conventions for production reporting.

Practical Use Cases

1) Revenue by region for dashboards

SELECT region, SUM(amount) AS revenue
FROM orders
WHERE status = 'paid'
GROUP BY region
ORDER BY revenue DESC;

2) Customer-level activity summaries

SELECT customer_id, COUNT(*) AS orders
FROM orders
GROUP BY customer_id;

3) Detect duplicates by grouping

SELECT email, COUNT(*) AS copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

4) Operational monitoring (fail rate)

SELECT
  region,
  COUNT(*) FILTER (WHERE status = 'failed') AS failed,
  COUNT(*) AS total
FROM orders
WHERE created_at >= current_date - 7
GROUP BY region;

5) Inventory totals by warehouse

SELECT warehouse_id, SUM(quantity_on_hand) AS stock
FROM inventory
GROUP BY warehouse_id;

Common Mistakes & Troubleshooting

1) Selecting a non-aggregated column that is not grouped

Wrong SQL:

SELECT region, customer_id, COUNT(*)
FROM orders
GROUP BY region;

Typical error:

ERROR:  column "orders.customer_id" must appear in the GROUP BY clause or be used in an aggregate function

Fix:

SELECT region, COUNT(*)
FROM orders
GROUP BY region;

2) Using HAVING for row-level filters

Wrong approach:

SELECT region, COUNT(*)
FROM orders
GROUP BY region
HAVING created_at >= '2026-01-01';

Bad outcome:

created_at is a row-level column and this is not the right place for filtering.

Fix:

SELECT region, COUNT(*)
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY region;

3) Confusing NULL results in SUM/AVG

Bad outcome:

Conditional SUM(...) FILTER (...) can return NULL when no rows match.

Fix:

SELECT COALESCE(SUM(amount) FILTER (WHERE status = 'failed'), 0) AS failed_amount
FROM orders;

4) Inflated metrics after joins

Wrong approach:

-- If each order has multiple items, SUM(order.amount) after joining items overcounts.
SELECT SUM(o.amount)
FROM orders o
JOIN order_items i ON i.order_id = o.order_id;

Fix:

Aggregate at the correct grain (orders) before joining, or aggregate item-level revenue instead.

Debugging checklist:

Confirm the grouping keys match the business question.
Verify WHERE filters are applied before grouping.
Ensure every selected non-aggregate column appears in GROUP BY.
Check NULL behavior for conditional aggregates.
If joins are involved, confirm the join does not multiply rows unexpectedly.

Best Practices

✅ Filter early with WHERE to reduce work and improve correctness. ❌ Avoid grouping huge raw tables without a time window.
✅ Alias aggregate outputs (SUM(amount) AS revenue). ❌ Avoid unlabeled aggregates in reporting queries.
✅ Use FILTER for conditional metrics in PostgreSQL. ❌ Avoid repeating the same query multiple times with slightly different WHERE clauses.
✅ Validate results with a small sample and sanity checks. ❌ Avoid trusting aggregates without spot-checking inputs.
✅ Watch out for join multiplication when aggregating. ❌ Avoid summing fact columns after a join that multiplies rows.

Hands-On Practice

Use this setup for the exercises:

CREATE TABLE practice_orders (
  order_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  region   text NOT NULL,
  status   text NOT NULL,
  amount   numeric(12,2) NOT NULL
);

INSERT INTO practice_orders (region, status, amount) VALUES
  ('US', 'paid', 10.00),
  ('US', 'failed', 5.00),
  ('EU', 'paid', 20.00),
  ('EU', 'paid', 30.00);

Exercise 1 (Easy): Orders per region

Task: Return the number of orders per region.

-- Your SQL here

Solution:

SELECT region, COUNT(*) AS orders
FROM practice_orders
GROUP BY region
ORDER BY region;

Exercise 2 (Medium): Paid revenue per region

Task: Return paid revenue per region.

-- Your SQL here

Solution:

SELECT region, SUM(amount) FILTER (WHERE status = 'paid') AS revenue_paid
FROM practice_orders
GROUP BY region
ORDER BY region;

Exercise 3 (Advanced): Paid order rate per region

Task: Return total orders, paid orders, and the paid ratio per region.

-- Your SQL here

Solution:

SELECT
  region,
  COUNT(*) AS total_orders,
  COUNT(*) FILTER (WHERE status = 'paid') AS paid_orders,
  (COUNT(*) FILTER (WHERE status = 'paid')::numeric / COUNT(*)) AS paid_ratio
FROM practice_orders
GROUP BY region
ORDER BY region;

Connection to Other Concepts

Concept	Why it matters
Aggregate functions	`GROUP BY` is the common way to compute aggregates per key
`WHERE`	filters rows before grouping
`HAVING`	filters groups after aggregation
Joins	grouping after joins requires grain awareness
ROLLUP / GROUPING SETS	advanced group-by constructs for subtotals
NULL Values	affects conditional sums and counts

Visual Learning Diagram

flowchart TD
  A[Raw Rows] --> B[WHERE]
  B --> C[GROUP BY]
  C --> D[Aggregates]
  D --> E[HAVING]
  D --> F[ORDER BY]
  C --> G[Conditional Aggregates (FILTER)]

  classDef allNodes fill:#3e3e3e,stroke:#ffffff,stroke-width:2px,color:#f5f5f5
  classDef highlight fill:#3e3e3e,stroke:#ffffff,stroke-width:4px,color:#f5f5f5
  class A,B,C,D,E,F,G allNodes
  class C highlight

Common Pitfalls

Pitfall	Consequence	Prevention
Selecting non-grouped columns	query error	group those columns or aggregate them
Filtering too late	slow and wrong metrics	use `WHERE` before `GROUP BY`
Wrong grouping key	misleading KPIs	validate with sample rows
Conditional SUM returns NULL	confusing dashboards	use `COALESCE` where appropriate
Join multiplication	inflated totals	aggregate at correct grain

Quick Reference

SELECT k, COUNT(*) FROM t GROUP BY k;
SELECT k, SUM(v) FROM t WHERE status = 'paid' GROUP BY k;
SELECT a, b, AVG(score) FROM t GROUP BY a, b;
SELECT k, SUM(v) FILTER (WHERE ok) FROM t GROUP BY k;
SELECT k, COUNT(*) FROM t GROUP BY k HAVING COUNT(*) > 10;

What's Next

Next: HAVING Clause - Continue to the next concept with incremental complexity.
Module Overview - Return to this module index and choose another related lesson.

Concept Overview​

Why is it important?​

Where does it fit?​

Syntax & Rules​

Core Syntax​

Available Options / Parameters​

Key Rules and Considerations​

Step-by-Step Examples​

Example 1: Basic Grouping (Beginner)​

Example 2: Multiple Aggregates per Group (Intermediate)​

Example 3: Conditional Aggregates with FILTER (Advanced)​

Example 4: Grouping by Time Buckets (Advanced)​

Practical Use Cases​

1) Revenue by region for dashboards​

2) Customer-level activity summaries​

3) Detect duplicates by grouping​

4) Operational monitoring (fail rate)​

5) Inventory totals by warehouse​

Common Mistakes & Troubleshooting​

1) Selecting a non-aggregated column that is not grouped​

2) Using HAVING for row-level filters​

3) Confusing NULL results in SUM/AVG​

4) Inflated metrics after joins​

Best Practices​

Hands-On Practice​

Exercise 1 (Easy): Orders per region​

Exercise 2 (Medium): Paid revenue per region​

Exercise 3 (Advanced): Paid order rate per region​

Connection to Other Concepts​

Visual Learning Diagram​

Common Pitfalls​

Quick Reference​

What's Next​

Concept Overview

Why is it important?

Where does it fit?

Syntax & Rules

Core Syntax

Available Options / Parameters

Key Rules and Considerations

Step-by-Step Examples

Example 1: Basic Grouping (Beginner)

Example 2: Multiple Aggregates per Group (Intermediate)

Example 3: Conditional Aggregates with FILTER (Advanced)

Example 4: Grouping by Time Buckets (Advanced)

Practical Use Cases

1) Revenue by region for dashboards

2) Customer-level activity summaries

3) Detect duplicates by grouping

4) Operational monitoring (fail rate)

5) Inventory totals by warehouse

Common Mistakes & Troubleshooting

1) Selecting a non-aggregated column that is not grouped

2) Using HAVING for row-level filters

3) Confusing NULL results in SUM/AVG

4) Inflated metrics after joins

Best Practices

Hands-On Practice

Exercise 1 (Easy): Orders per region

Exercise 2 (Medium): Paid revenue per region

Exercise 3 (Advanced): Paid order rate per region

Connection to Other Concepts

Visual Learning Diagram

Common Pitfalls

Quick Reference

What's Next