Skip to main content

GROUP BY

Learning Focus

Use this lesson to understand GROUP BY with practical syntax and examples.

Concept Overview

GROUP BY transforms detail rows into summary rows by grouping records that share the same key(s), then applying aggregate functions (like COUNT, SUM, AVG, MIN, MAX) to each group.

Why is it important?

  • Reporting: metrics per region, per customer, per product
  • Analytics: cohort summaries, conversion rates, revenue by segment
  • Data quality: detect duplicates and outliers by grouping
  • Performance: correct grouping reduces data movement and lets the database aggregate efficiently

Where does it fit?

GROUP BY sits between row-level filtering (WHERE) and group-level filtering (HAVING). It is the core technique for aggregation and is commonly combined with joins, conditional aggregates (FILTER), and ordering.


Syntax & Rules

Core Syntax

SELECT group_key_1, group_key_2, ...,
aggregate_fn(expr) AS metric
FROM table_name
WHERE row_filters
GROUP BY group_key_1, group_key_2, ...
ORDER BY metric DESC;

Available Options / Parameters

Clause/FeatureWhat it doesPostgreSQL notes
WHEREfilters rows before groupingreduces work and changes results
GROUP BYdefines grouping keysevery non-aggregated selected column must be grouped
HAVINGfilters groups after aggregationuse for aggregate predicates
FILTER (WHERE ...)conditional aggregatePostgreSQL feature: SUM(x) FILTER (WHERE cond)
ORDER BYsorts final groupsuse NULLS LAST when aggregates can be NULL

Key Rules and Considerations

  • Every selected column must either be an aggregate (SUM(amount)) or appear in GROUP BY.
  • Use WHERE to remove rows you do not want to include in the aggregates.
  • Use HAVING for conditions on aggregated values.
  • Conditional aggregates are usually cleaner with FILTER than with CASE expressions.

Step-by-Step Examples

Example 1: Basic Grouping (Beginner)

CREATE TABLE orders (
order_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
customer_id bigint NOT NULL,
region text NOT NULL,
status text NOT NULL,
amount numeric(12,2) NOT NULL CHECK (amount >= 0),
created_at date NOT NULL
);

INSERT INTO orders (customer_id, region, status, amount, created_at) VALUES
(10, 'US', 'paid', 25.00, '2026-01-01'),
(10, 'US', 'paid', 10.00, '2026-01-02'),
(11, 'EU', 'paid', 50.00, '2026-01-03'),
(12, 'EU', 'failed', 20.00, '2026-01-03'),
(13, 'APAC', 'paid', 40.00, '2026-01-04'),
(13, 'APAC', 'pending', 15.00, '2026-01-05');

SELECT region, COUNT(*) AS order_count
FROM orders
GROUP BY region
ORDER BY region;

Expected output:

 region | order_count
--------+------------
APAC | 2
EU | 2
US | 2
(3 rows)

Explanation:

  • GROUP BY region produces one output row per region.
  • COUNT(*) counts all rows in each region.

Example 2: Multiple Aggregates per Group (Intermediate)

SELECT
region,
COUNT(*) AS orders_total,
SUM(amount) AS amount_total,
AVG(amount) AS avg_amount
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY region
ORDER BY amount_total DESC;

Expected output:

 region | orders_total | amount_total | avg_amount
--------+-------------+--------------+--------------------
APAC | 2 | 55.00 | 27.5000000000000000
EU | 2 | 70.00 | 35.0000000000000000
US | 2 | 35.00 | 17.5000000000000000
(3 rows)

Explanation:

  • WHERE filters rows before aggregation.
  • SUM and AVG ignore NULLs, but amount is NOT NULL here.

Example 3: Conditional Aggregates with FILTER (Advanced)

PostgreSQL's FILTER lets you compute multiple conditional metrics in one pass.

SELECT
region,
COUNT(*) AS orders_total,
COUNT(*) FILTER (WHERE status = 'paid') AS orders_paid,
SUM(amount) FILTER (WHERE status = 'paid') AS revenue_paid,
SUM(amount) FILTER (WHERE status = 'failed') AS failed_amount
FROM orders
GROUP BY region
ORDER BY region;

Expected output:

 region | orders_total | orders_paid | revenue_paid | failed_amount
--------+-------------+------------+-------------+--------------
APAC | 2 | 1 | 40.00 |
EU | 2 | 1 | 50.00 | 20.00
US | 2 | 2 | 35.00 |
(3 rows)

Explanation:

  • FILTER applies only to the aggregate it is attached to.
  • SUM(...) returns NULL when there are no matching rows (for example, no failed orders in US).

Example 4: Grouping by Time Buckets (Advanced)

SELECT
date_trunc('month', created_at::timestamptz) AS month,
COUNT(*) AS orders
FROM orders
GROUP BY month
ORDER BY month;

Expected output:

 month                  | orders
------------------------+-------
2026-01-01 00:00:00+00 | 6
(1 row)

Explanation:

  • date_trunc() is a common way to group by month/day/hour.
  • Use a stable bucket definition and consistent time zone conventions for production reporting.

Practical Use Cases

1) Revenue by region for dashboards

SELECT region, SUM(amount) AS revenue
FROM orders
WHERE status = 'paid'
GROUP BY region
ORDER BY revenue DESC;

2) Customer-level activity summaries

SELECT customer_id, COUNT(*) AS orders
FROM orders
GROUP BY customer_id;

3) Detect duplicates by grouping

SELECT email, COUNT(*) AS copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

4) Operational monitoring (fail rate)

SELECT
region,
COUNT(*) FILTER (WHERE status = 'failed') AS failed,
COUNT(*) AS total
FROM orders
WHERE created_at >= current_date - 7
GROUP BY region;

5) Inventory totals by warehouse

SELECT warehouse_id, SUM(quantity_on_hand) AS stock
FROM inventory
GROUP BY warehouse_id;

Common Mistakes & Troubleshooting

1) Selecting a non-aggregated column that is not grouped

Wrong SQL:

SELECT region, customer_id, COUNT(*)
FROM orders
GROUP BY region;

Typical error:

ERROR:  column "orders.customer_id" must appear in the GROUP BY clause or be used in an aggregate function

Fix:

SELECT region, COUNT(*)
FROM orders
GROUP BY region;

2) Using HAVING for row-level filters

Wrong approach:

SELECT region, COUNT(*)
FROM orders
GROUP BY region
HAVING created_at >= '2026-01-01';

Bad outcome:

  • created_at is a row-level column and this is not the right place for filtering.

Fix:

SELECT region, COUNT(*)
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY region;

3) Confusing NULL results in SUM/AVG

Bad outcome:

  • Conditional SUM(...) FILTER (...) can return NULL when no rows match.

Fix:

SELECT COALESCE(SUM(amount) FILTER (WHERE status = 'failed'), 0) AS failed_amount
FROM orders;

4) Inflated metrics after joins

Wrong approach:

-- If each order has multiple items, SUM(order.amount) after joining items overcounts.
SELECT SUM(o.amount)
FROM orders o
JOIN order_items i ON i.order_id = o.order_id;

Fix:

  • Aggregate at the correct grain (orders) before joining, or aggregate item-level revenue instead.

Debugging checklist:

  1. Confirm the grouping keys match the business question.
  2. Verify WHERE filters are applied before grouping.
  3. Ensure every selected non-aggregate column appears in GROUP BY.
  4. Check NULL behavior for conditional aggregates.
  5. If joins are involved, confirm the join does not multiply rows unexpectedly.

Best Practices

  • ✅ Filter early with WHERE to reduce work and improve correctness. ❌ Avoid grouping huge raw tables without a time window.
  • ✅ Alias aggregate outputs (SUM(amount) AS revenue). ❌ Avoid unlabeled aggregates in reporting queries.
  • ✅ Use FILTER for conditional metrics in PostgreSQL. ❌ Avoid repeating the same query multiple times with slightly different WHERE clauses.
  • ✅ Validate results with a small sample and sanity checks. ❌ Avoid trusting aggregates without spot-checking inputs.
  • ✅ Watch out for join multiplication when aggregating. ❌ Avoid summing fact columns after a join that multiplies rows.

Hands-On Practice

Use this setup for the exercises:

CREATE TABLE practice_orders (
order_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
region text NOT NULL,
status text NOT NULL,
amount numeric(12,2) NOT NULL
);

INSERT INTO practice_orders (region, status, amount) VALUES
('US', 'paid', 10.00),
('US', 'failed', 5.00),
('EU', 'paid', 20.00),
('EU', 'paid', 30.00);

Exercise 1 (Easy): Orders per region

Task: Return the number of orders per region.

-- Your SQL here

Solution:

SELECT region, COUNT(*) AS orders
FROM practice_orders
GROUP BY region
ORDER BY region;

Exercise 2 (Medium): Paid revenue per region

Task: Return paid revenue per region.

-- Your SQL here

Solution:

SELECT region, SUM(amount) FILTER (WHERE status = 'paid') AS revenue_paid
FROM practice_orders
GROUP BY region
ORDER BY region;

Exercise 3 (Advanced): Paid order rate per region

Task: Return total orders, paid orders, and the paid ratio per region.

-- Your SQL here

Solution:

SELECT
region,
COUNT(*) AS total_orders,
COUNT(*) FILTER (WHERE status = 'paid') AS paid_orders,
(COUNT(*) FILTER (WHERE status = 'paid')::numeric / COUNT(*)) AS paid_ratio
FROM practice_orders
GROUP BY region
ORDER BY region;

Connection to Other Concepts

ConceptWhy it matters
Aggregate functionsGROUP BY is the common way to compute aggregates per key
WHEREfilters rows before grouping
HAVINGfilters groups after aggregation
Joinsgrouping after joins requires grain awareness
ROLLUP / GROUPING SETSadvanced group-by constructs for subtotals
NULL Valuesaffects conditional sums and counts

Visual Learning Diagram

flowchart TD
A[Raw Rows] --> B[WHERE]
B --> C[GROUP BY]
C --> D[Aggregates]
D --> E[HAVING]
D --> F[ORDER BY]
C --> G[Conditional Aggregates (FILTER)]

classDef allNodes fill:#3e3e3e,stroke:#ffffff,stroke-width:2px,color:#f5f5f5
classDef highlight fill:#3e3e3e,stroke:#ffffff,stroke-width:4px,color:#f5f5f5
class A,B,C,D,E,F,G allNodes
class C highlight

Common Pitfalls

PitfallConsequencePrevention
Selecting non-grouped columnsquery errorgroup those columns or aggregate them
Filtering too lateslow and wrong metricsuse WHERE before GROUP BY
Wrong grouping keymisleading KPIsvalidate with sample rows
Conditional SUM returns NULLconfusing dashboardsuse COALESCE where appropriate
Join multiplicationinflated totalsaggregate at correct grain

Quick Reference

SELECT k, COUNT(*) FROM t GROUP BY k;
SELECT k, SUM(v) FROM t WHERE status = 'paid' GROUP BY k;
SELECT a, b, AVG(score) FROM t GROUP BY a, b;
SELECT k, SUM(v) FILTER (WHERE ok) FROM t GROUP BY k;
SELECT k, COUNT(*) FROM t GROUP BY k HAVING COUNT(*) > 10;

What's Next