SELECT DISTINCT

Learning Focus

Use this lesson to understand SELECT DISTINCT with practical syntax and examples.

Concept Overview

SELECT DISTINCT removes duplicate rows from a query result and returns only unique values for the selected columns.

This is useful when your source data contains repeated values (either naturally, or because your query introduces duplicates through joins). DISTINCT helps you answer questions like "Which departments exist?" or "Which customers placed at least one paid order?" without writing a full aggregation.

Why is it important?

Data quality: de-duplicates results so reports and downstream logic are not inflated by repetition
Analysis accuracy: enables correct unique lists and unique counts (COUNT(DISTINCT ...))
Performance: can reduce result size and network transfer (but DISTINCT itself has a cost)
Business intelligence: common for unique customers, unique categories, unique regions

Where does it fit?

SELECT DISTINCT is part of DQL (Data Query Language). It sits between basic SELECT and advanced aggregation (GROUP BY):

If you only need unique values, DISTINCT is often the simplest tool.
If you need metrics per group (counts, sums, averages), use GROUP BY.

PostgreSQL also provides a PostgreSQL-native extension called DISTINCT ON for "first row per key" patterns (covered in the advanced example).

Syntax & Rules

Basic Syntax Structure

SELECT DISTINCT column1, column2, ...
FROM table_name
WHERE condition
ORDER BY column1;

Available Options / Parameters

Option	What it does	Notes
`DISTINCT`	removes duplicate rows from the SELECT list	duplicates are based on the full selected row
`DISTINCT ON (expr, ...)`	keeps the first row per key	PostgreSQL-specific; requires matching ORDER BY prefix
`COUNT(DISTINCT col)`	counts unique non-NULL values	common for metrics
`SUM(DISTINCT col)`	sums unique values	supported, but be careful with semantics

Key Rules and Considerations

DISTINCT applies to the entire selected row. With multiple columns, uniqueness is the combination.
NULL values are treated as equal for DISTINCT purposes (you get one NULL in the output per distinct set).
DISTINCT often requires PostgreSQL to sort or hash data. On large datasets it can be expensive.
For SELECT DISTINCT, PostgreSQL requires ORDER BY expressions to appear in the select list.
For DISTINCT ON, the ORDER BY must start with the same expressions as DISTINCT ON (...).

Step-by-Step Examples

Example 1: Basic Single Column DISTINCT (Beginner)

Create a table with duplicates, then select unique values.

-- Sample table
CREATE TABLE employees (
  employee_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  full_name   text NOT NULL,
  department  text NOT NULL,
  salary      numeric(12,2) NOT NULL CHECK (salary >= 0)
);

-- Sample data (duplicates in department)
INSERT INTO employees (full_name, department, salary) VALUES
('John Smith',   'IT',       75000.00),
('Jane Doe',     'HR',       65000.00),
('Bob Johnson',  'IT',       80000.00),
('Alice Brown',  'Finance',  70000.00),
('Charlie Davis','IT',       75000.00);

-- Query: unique departments
SELECT DISTINCT department
FROM employees
ORDER BY department;

Expected output:

 department
------------
 Finance
 HR
 IT
(3 rows)

Explanation:

DISTINCT department collapses repeated department values into one row per department.
ORDER BY department produces stable, readable output.

Example 2: Multiple Column DISTINCT (Intermediate)

DISTINCT applies to the combination of the selected columns.

-- Query: unique (department, salary) combinations
SELECT DISTINCT department, salary
FROM employees
ORDER BY department, salary;

Expected output:

 department | salary
------------+----------
 Finance    | 70000.00
 HR         | 65000.00
 IT         | 75000.00
 IT         | 80000.00
(4 rows)

Explanation:

IT appears twice because there are two distinct salary values in IT.
This is not "distinct per column"; it is distinct per row combination.

Example 3: DISTINCT with WHERE Clause (Intermediate)

Filter first, then de-duplicate the remaining rows.

-- Query: unique departments where salary is above 70000
SELECT DISTINCT department
FROM employees
WHERE salary > 70000
ORDER BY department;

Expected output:

 department
------------
 IT
(1 row)

Explanation:

WHERE salary > 70000 narrows the input set.
DISTINCT is applied after filtering.

Example 4: Real-World Scenario + PostgreSQL DISTINCT ON (Advanced)

DISTINCT ON solves "latest row per key" problems without a self-join. It is PostgreSQL-specific.

-- Sample orders table (multiple orders per customer)
CREATE TABLE orders (
  order_id    bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id bigint NOT NULL,
  status      text NOT NULL CHECK (status IN ('pending', 'paid', 'cancelled')),
  created_at  timestamptz NOT NULL
);

INSERT INTO orders (customer_id, status, created_at) VALUES
(101, 'paid',    '2026-03-01 10:00:00+00'),
(101, 'paid',    '2026-03-03 10:00:00+00'),
(102, 'pending', '2026-03-02 09:00:00+00'),
(102, 'paid',    '2026-03-04 12:00:00+00'),
(103, 'cancelled','2026-03-03 08:30:00+00');

-- Query: latest order per customer
SELECT DISTINCT ON (customer_id)
  customer_id,
  order_id,
  status,
  created_at
FROM orders
ORDER BY customer_id, created_at DESC;

Expected output:

 customer_id | order_id | status    | created_at
-------------+----------+-----------+---------------------------
 101         | 2        | paid      | 2026-03-03 10:00:00+00
 102         | 4        | paid      | 2026-03-04 12:00:00+00
 103         | 5        | cancelled | 2026-03-03 08:30:00+00
(3 rows)

Explanation:

DISTINCT ON (customer_id) groups by customer_id and keeps one row per customer.
ORDER BY customer_id, created_at DESC selects the latest row per customer.
The ORDER BY must start with the DISTINCT ON expressions.

Example 5: Counting Unique Values (Advanced)

Counting distinct values is a common reporting requirement.

-- Query: how many unique customers have at least one paid order?
SELECT COUNT(DISTINCT customer_id) AS paid_customers
FROM orders
WHERE status = 'paid';

Expected output:

 paid_customers
---------------
 2
(1 row)

Explanation:

COUNT(DISTINCT customer_id) counts unique customers.
WHERE status = 'paid' ensures only paid orders contribute.

Practical Use Cases

1. Customer Analysis (E-commerce)

Business problem: Find unique customers who purchased in the last 30 days.

SELECT DISTINCT customer_id
FROM orders
WHERE created_at >= now() - interval '30 days'
  AND status = 'paid'
ORDER BY customer_id;

2. Product Catalog Reporting (Analytics)

Business problem: List all unique categories to build a filter menu.

SELECT DISTINCT category
FROM products
WHERE active = true
ORDER BY category;

3. Multi-Region Ops (SaaS)

Business problem: Identify unique (city, region) combinations with active customers.

SELECT DISTINCT city, region
FROM customers
WHERE deleted_at IS NULL
ORDER BY region, city;

4. Data Quality Auditing (ETL)

Business problem: Detect unexpected email domains.

SELECT DISTINCT split_part(email, '@', 2) AS email_domain
FROM customers
WHERE email IS NOT NULL
ORDER BY email_domain;

5. Latest Event Per Device (IoT)

Business problem: For each device, fetch the newest telemetry record.

SELECT DISTINCT ON (device_id)
  device_id,
  recorded_at,
  temperature_c
FROM sensor_readings
ORDER BY device_id, recorded_at DESC;

Common Mistakes & Troubleshooting

Mistake 1: Misunderstanding multi-column DISTINCT

Wrong assumption:

-- You might think this returns unique names and unique departments separately.
SELECT DISTINCT full_name, department
FROM employees;

Bad outcome: it returns unique combinations, not per-column uniqueness.

Fix:

SELECT DISTINCT department FROM employees;
-- or
SELECT DISTINCT full_name FROM employees;

Mistake 2: Using DISTINCT to "fix" a join bug

Wrong SQL:

-- DISTINCT hides duplication caused by an incorrect join.
SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
JOIN order_items i ON i.order_id = o.order_id;

Bad outcome: query appears to work, but hides row multiplication and can still be slow.

Fix: validate join keys and aggregate at the right level.

SELECT c.customer_id
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

Mistake 3: ORDER BY column not in SELECT list (PostgreSQL DISTINCT)

Wrong SQL:

SELECT DISTINCT department
FROM employees
ORDER BY salary DESC;

Error message: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

Fix: either order by the selected column, or include the ordering expression in the SELECT list.

SELECT DISTINCT department
FROM employees
ORDER BY department;

If you truly need "department ordered by max salary", use GROUP BY:

SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department
ORDER BY max_salary DESC;

Mistake 4: DISTINCT ON ordering does not match

Wrong SQL:

SELECT DISTINCT ON (customer_id)
  customer_id, order_id, created_at
FROM orders
ORDER BY created_at DESC, customer_id;

Error message: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Fix: make the ORDER BY start with the DISTINCT ON expressions.

SELECT DISTINCT ON (customer_id)
  customer_id, order_id, created_at
FROM orders
ORDER BY customer_id, created_at DESC;

Debugging Tips

Verify whether duplicates are in the data or introduced by joins.
Start with SELECT ... LIMIT 10 and remove DISTINCT to see raw rows.
If you are using DISTINCT ON, confirm that ORDER BY starts with the DISTINCT ON key.
Use EXPLAIN if DISTINCT becomes slow on large tables (see module 18).

Best Practices

✅ Always use DISTINCT when you truly need unique values (not to hide query bugs).
✅ Always use ORDER BY with DISTINCT when output order matters.
✅ Always prefer DISTINCT ON for "latest row per key" patterns (with correct ORDER BY).
✅ Always consider indexes for DISTINCT-heavy query patterns (key columns and ordering columns).
✅ Always consider schema constraints (UNIQUE) to prevent duplicates instead of cleaning them in queries.
❌ Avoid using DISTINCT as a band-aid for incorrect joins.
❌ Avoid DISTINCT on very large datasets without checking the execution plan.
❌ Avoid NOT IN patterns with NULL-producing subqueries; prefer EXISTS when excluding sets.

Hands-On Practice

Exercise 1 (Easy)

Task: Return all unique departments from employees.

-- Write your query here

Solution:

SELECT DISTINCT department
FROM employees
ORDER BY department;

Exercise 2 (Medium)

Task: Count unique customers who have at least one paid order.

-- Write your query here

Solution:

SELECT COUNT(DISTINCT customer_id) AS paid_customers
FROM orders
WHERE status = 'paid';

Exercise 3 (Advanced)

Task: Return the latest order per customer using DISTINCT ON.

-- Write your query here

Solution:

SELECT DISTINCT ON (customer_id)
  customer_id, order_id, status, created_at
FROM orders
ORDER BY customer_id, created_at DESC;

Connection to Other Concepts

Related lesson	Why it matters
SELECT Statement	DISTINCT is a modifier on SELECT results
WHERE Clause	Filter first to reduce the workload before de-duplication
GROUP BY	Use GROUP BY when you need metrics per group
Joins	Incorrect joins are a common source of duplicates
Index Strategy	Indexes can reduce DISTINCT cost for common patterns
Window Functions	Alternative approach for "latest per group" using row_number

Visual Learning Diagram

flowchart TD
  A[Basic SELECT] --> B{Need unique rows?}
  B -->|Yes| C[SELECT DISTINCT]
  B -->|No| D[Standard SELECT]
  C --> E[Single column DISTINCT]
  C --> F[Multi-column DISTINCT]
  C --> G[DISTINCT ON (PostgreSQL)]
  F --> H[COUNT(DISTINCT ...)]
  C --> I[Performance check: EXPLAIN]
  I --> J[Index strategy]
  G --> K[Alternative: Window functions]

  classDef allNodes fill:#3e3e3e,stroke:#ffffff,stroke-width:2px,color:#f5f5f5
  classDef highlight fill:#3e3e3e,stroke:#ffffff,stroke-width:4px,color:#f5f5f5

  class A,B,D,E,F,G,H,I,J,K allNodes
  class C highlight

Common Pitfalls

Pitfall	Consequence	Prevention
Using DISTINCT to hide join duplication	Metrics and performance remain wrong	Fix join keys or aggregate at correct grain
Assuming DISTINCT applies per column	Unexpected duplicates remain	Remember: DISTINCT applies to full row combination
DISTINCT on large tables without filters	Slow sorts/hashes	Filter first and add supporting indexes
DISTINCT ON with incorrect ORDER BY prefix	Query fails	Order by key first: `ORDER BY key, ts DESC`
ORDER BY expression not in SELECT list with DISTINCT	Query fails	Order by selected columns or use GROUP BY

Quick Reference

SELECT DISTINCT col FROM t;
SELECT DISTINCT col1, col2 FROM t;
SELECT COUNT(DISTINCT col) FROM t;
SELECT DISTINCT ON (key) key, other FROM t ORDER BY key, ts DESC;
SELECT col, MAX(x) FROM t GROUP BY col; -- alternative to order by non-selected columns

What's Next

Previous: SELECT Statement - Review the previous lesson to reinforce context.
Next: WHERE Clause - Continue to the next concept with incremental complexity.
Module Overview - Return to this module index and choose another related lesson.

Concept Overview​

Why is it important?​

Where does it fit?​

Syntax & Rules​

Basic Syntax Structure​

Available Options / Parameters​

Key Rules and Considerations​

Step-by-Step Examples​

Example 1: Basic Single Column DISTINCT (Beginner)​

Example 2: Multiple Column DISTINCT (Intermediate)​

Example 3: DISTINCT with WHERE Clause (Intermediate)​

Example 4: Real-World Scenario + PostgreSQL DISTINCT ON (Advanced)​

Example 5: Counting Unique Values (Advanced)​

Practical Use Cases​

1. Customer Analysis (E-commerce)​

2. Product Catalog Reporting (Analytics)​

3. Multi-Region Ops (SaaS)​

4. Data Quality Auditing (ETL)​

5. Latest Event Per Device (IoT)​

Common Mistakes & Troubleshooting​

Mistake 1: Misunderstanding multi-column DISTINCT​

Mistake 2: Using DISTINCT to "fix" a join bug​

Mistake 3: ORDER BY column not in SELECT list (PostgreSQL DISTINCT)​

Mistake 4: DISTINCT ON ordering does not match​

Debugging Tips​

Best Practices​

Hands-On Practice​

Exercise 1 (Easy)​

Exercise 2 (Medium)​

Exercise 3 (Advanced)​

Connection to Other Concepts​

Visual Learning Diagram​

Common Pitfalls​

Quick Reference​

What's Next​

Concept Overview

Why is it important?

Where does it fit?

Syntax & Rules

Basic Syntax Structure

Available Options / Parameters

Key Rules and Considerations

Step-by-Step Examples

Example 1: Basic Single Column DISTINCT (Beginner)

Example 2: Multiple Column DISTINCT (Intermediate)

Example 3: DISTINCT with WHERE Clause (Intermediate)

Example 4: Real-World Scenario + PostgreSQL DISTINCT ON (Advanced)

Example 5: Counting Unique Values (Advanced)

Practical Use Cases

1. Customer Analysis (E-commerce)

2. Product Catalog Reporting (Analytics)

3. Multi-Region Ops (SaaS)

4. Data Quality Auditing (ETL)

5. Latest Event Per Device (IoT)

Common Mistakes & Troubleshooting

Mistake 1: Misunderstanding multi-column DISTINCT

Mistake 2: Using DISTINCT to "fix" a join bug

Mistake 3: ORDER BY column not in SELECT list (PostgreSQL DISTINCT)

Mistake 4: DISTINCT ON ordering does not match

Debugging Tips

Best Practices

Hands-On Practice

Exercise 1 (Easy)

Exercise 2 (Medium)

Exercise 3 (Advanced)

Connection to Other Concepts

Visual Learning Diagram

Common Pitfalls

Quick Reference

What's Next