Skip to main content

SELECT DISTINCT

Learning Focus

Use this lesson to understand SELECT DISTINCT with practical syntax and examples.

Concept Overview

SELECT DISTINCT removes duplicate rows from a query result and returns only unique values for the selected columns.

This is useful when your source data contains repeated values (either naturally, or because your query introduces duplicates through joins). DISTINCT helps you answer questions like "Which departments exist?" or "Which customers placed at least one paid order?" without writing a full aggregation.

Why is it important?

  • Data quality: de-duplicates results so reports and downstream logic are not inflated by repetition
  • Analysis accuracy: enables correct unique lists and unique counts (COUNT(DISTINCT ...))
  • Performance: can reduce result size and network transfer (but DISTINCT itself has a cost)
  • Business intelligence: common for unique customers, unique categories, unique regions

Where does it fit?

SELECT DISTINCT is part of DQL (Data Query Language). It sits between basic SELECT and advanced aggregation (GROUP BY):

  • If you only need unique values, DISTINCT is often the simplest tool.
  • If you need metrics per group (counts, sums, averages), use GROUP BY.

PostgreSQL also provides a PostgreSQL-native extension called DISTINCT ON for "first row per key" patterns (covered in the advanced example).


Syntax & Rules

Basic Syntax Structure

SELECT DISTINCT column1, column2, ...
FROM table_name
WHERE condition
ORDER BY column1;

Available Options / Parameters

OptionWhat it doesNotes
DISTINCTremoves duplicate rows from the SELECT listduplicates are based on the full selected row
DISTINCT ON (expr, ...)keeps the first row per keyPostgreSQL-specific; requires matching ORDER BY prefix
COUNT(DISTINCT col)counts unique non-NULL valuescommon for metrics
SUM(DISTINCT col)sums unique valuessupported, but be careful with semantics

Key Rules and Considerations

  • DISTINCT applies to the entire selected row. With multiple columns, uniqueness is the combination.
  • NULL values are treated as equal for DISTINCT purposes (you get one NULL in the output per distinct set).
  • DISTINCT often requires PostgreSQL to sort or hash data. On large datasets it can be expensive.
  • For SELECT DISTINCT, PostgreSQL requires ORDER BY expressions to appear in the select list.
  • For DISTINCT ON, the ORDER BY must start with the same expressions as DISTINCT ON (...).

Step-by-Step Examples

Example 1: Basic Single Column DISTINCT (Beginner)

Create a table with duplicates, then select unique values.

-- Sample table
CREATE TABLE employees (
employee_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
full_name text NOT NULL,
department text NOT NULL,
salary numeric(12,2) NOT NULL CHECK (salary >= 0)
);

-- Sample data (duplicates in department)
INSERT INTO employees (full_name, department, salary) VALUES
('John Smith', 'IT', 75000.00),
('Jane Doe', 'HR', 65000.00),
('Bob Johnson', 'IT', 80000.00),
('Alice Brown', 'Finance', 70000.00),
('Charlie Davis','IT', 75000.00);

-- Query: unique departments
SELECT DISTINCT department
FROM employees
ORDER BY department;

Expected output:

 department
------------
Finance
HR
IT
(3 rows)

Explanation:

  • DISTINCT department collapses repeated department values into one row per department.
  • ORDER BY department produces stable, readable output.

Example 2: Multiple Column DISTINCT (Intermediate)

DISTINCT applies to the combination of the selected columns.

-- Query: unique (department, salary) combinations
SELECT DISTINCT department, salary
FROM employees
ORDER BY department, salary;

Expected output:

 department | salary
------------+----------
Finance | 70000.00
HR | 65000.00
IT | 75000.00
IT | 80000.00
(4 rows)

Explanation:

  • IT appears twice because there are two distinct salary values in IT.
  • This is not "distinct per column"; it is distinct per row combination.

Example 3: DISTINCT with WHERE Clause (Intermediate)

Filter first, then de-duplicate the remaining rows.

-- Query: unique departments where salary is above 70000
SELECT DISTINCT department
FROM employees
WHERE salary > 70000
ORDER BY department;

Expected output:

 department
------------
IT
(1 row)

Explanation:

  • WHERE salary > 70000 narrows the input set.
  • DISTINCT is applied after filtering.

Example 4: Real-World Scenario + PostgreSQL DISTINCT ON (Advanced)

DISTINCT ON solves "latest row per key" problems without a self-join. It is PostgreSQL-specific.

-- Sample orders table (multiple orders per customer)
CREATE TABLE orders (
order_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
customer_id bigint NOT NULL,
status text NOT NULL CHECK (status IN ('pending', 'paid', 'cancelled')),
created_at timestamptz NOT NULL
);

INSERT INTO orders (customer_id, status, created_at) VALUES
(101, 'paid', '2026-03-01 10:00:00+00'),
(101, 'paid', '2026-03-03 10:00:00+00'),
(102, 'pending', '2026-03-02 09:00:00+00'),
(102, 'paid', '2026-03-04 12:00:00+00'),
(103, 'cancelled','2026-03-03 08:30:00+00');

-- Query: latest order per customer
SELECT DISTINCT ON (customer_id)
customer_id,
order_id,
status,
created_at
FROM orders
ORDER BY customer_id, created_at DESC;

Expected output:

 customer_id | order_id | status    | created_at
-------------+----------+-----------+---------------------------
101 | 2 | paid | 2026-03-03 10:00:00+00
102 | 4 | paid | 2026-03-04 12:00:00+00
103 | 5 | cancelled | 2026-03-03 08:30:00+00
(3 rows)

Explanation:

  • DISTINCT ON (customer_id) groups by customer_id and keeps one row per customer.
  • ORDER BY customer_id, created_at DESC selects the latest row per customer.
  • The ORDER BY must start with the DISTINCT ON expressions.

Example 5: Counting Unique Values (Advanced)

Counting distinct values is a common reporting requirement.

-- Query: how many unique customers have at least one paid order?
SELECT COUNT(DISTINCT customer_id) AS paid_customers
FROM orders
WHERE status = 'paid';

Expected output:

 paid_customers
---------------
2
(1 row)

Explanation:

  • COUNT(DISTINCT customer_id) counts unique customers.
  • WHERE status = 'paid' ensures only paid orders contribute.

Practical Use Cases

1. Customer Analysis (E-commerce)

Business problem: Find unique customers who purchased in the last 30 days.

SELECT DISTINCT customer_id
FROM orders
WHERE created_at >= now() - interval '30 days'
AND status = 'paid'
ORDER BY customer_id;

2. Product Catalog Reporting (Analytics)

Business problem: List all unique categories to build a filter menu.

SELECT DISTINCT category
FROM products
WHERE active = true
ORDER BY category;

3. Multi-Region Ops (SaaS)

Business problem: Identify unique (city, region) combinations with active customers.

SELECT DISTINCT city, region
FROM customers
WHERE deleted_at IS NULL
ORDER BY region, city;

4. Data Quality Auditing (ETL)

Business problem: Detect unexpected email domains.

SELECT DISTINCT split_part(email, '@', 2) AS email_domain
FROM customers
WHERE email IS NOT NULL
ORDER BY email_domain;

5. Latest Event Per Device (IoT)

Business problem: For each device, fetch the newest telemetry record.

SELECT DISTINCT ON (device_id)
device_id,
recorded_at,
temperature_c
FROM sensor_readings
ORDER BY device_id, recorded_at DESC;

Common Mistakes & Troubleshooting

Mistake 1: Misunderstanding multi-column DISTINCT

Wrong assumption:

-- You might think this returns unique names and unique departments separately.
SELECT DISTINCT full_name, department
FROM employees;

Bad outcome: it returns unique combinations, not per-column uniqueness.

Fix:

SELECT DISTINCT department FROM employees;
-- or
SELECT DISTINCT full_name FROM employees;

Mistake 2: Using DISTINCT to "fix" a join bug

Wrong SQL:

-- DISTINCT hides duplication caused by an incorrect join.
SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
JOIN order_items i ON i.order_id = o.order_id;

Bad outcome: query appears to work, but hides row multiplication and can still be slow.

Fix: validate join keys and aggregate at the right level.

SELECT c.customer_id
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);

Mistake 3: ORDER BY column not in SELECT list (PostgreSQL DISTINCT)

Wrong SQL:

SELECT DISTINCT department
FROM employees
ORDER BY salary DESC;

Error message: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

Fix: either order by the selected column, or include the ordering expression in the SELECT list.

SELECT DISTINCT department
FROM employees
ORDER BY department;

If you truly need "department ordered by max salary", use GROUP BY:

SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department
ORDER BY max_salary DESC;

Mistake 4: DISTINCT ON ordering does not match

Wrong SQL:

SELECT DISTINCT ON (customer_id)
customer_id, order_id, created_at
FROM orders
ORDER BY created_at DESC, customer_id;

Error message: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Fix: make the ORDER BY start with the DISTINCT ON expressions.

SELECT DISTINCT ON (customer_id)
customer_id, order_id, created_at
FROM orders
ORDER BY customer_id, created_at DESC;

Debugging Tips

  1. Verify whether duplicates are in the data or introduced by joins.
  2. Start with SELECT ... LIMIT 10 and remove DISTINCT to see raw rows.
  3. If you are using DISTINCT ON, confirm that ORDER BY starts with the DISTINCT ON key.
  4. Use EXPLAIN if DISTINCT becomes slow on large tables (see module 18).

Best Practices

✅ Always use DISTINCT when you truly need unique values (not to hide query bugs).
✅ Always use ORDER BY with DISTINCT when output order matters.
✅ Always prefer DISTINCT ON for "latest row per key" patterns (with correct ORDER BY).
✅ Always consider indexes for DISTINCT-heavy query patterns (key columns and ordering columns).
✅ Always consider schema constraints (UNIQUE) to prevent duplicates instead of cleaning them in queries.
❌ Avoid using DISTINCT as a band-aid for incorrect joins.
❌ Avoid DISTINCT on very large datasets without checking the execution plan.
❌ Avoid NOT IN patterns with NULL-producing subqueries; prefer EXISTS when excluding sets.

Hands-On Practice

Exercise 1 (Easy)

Task: Return all unique departments from employees.

-- Write your query here

Solution:

SELECT DISTINCT department
FROM employees
ORDER BY department;

Exercise 2 (Medium)

Task: Count unique customers who have at least one paid order.

-- Write your query here

Solution:

SELECT COUNT(DISTINCT customer_id) AS paid_customers
FROM orders
WHERE status = 'paid';

Exercise 3 (Advanced)

Task: Return the latest order per customer using DISTINCT ON.

-- Write your query here

Solution:

SELECT DISTINCT ON (customer_id)
customer_id, order_id, status, created_at
FROM orders
ORDER BY customer_id, created_at DESC;

Connection to Other Concepts

Related lessonWhy it matters
SELECT StatementDISTINCT is a modifier on SELECT results
WHERE ClauseFilter first to reduce the workload before de-duplication
GROUP BYUse GROUP BY when you need metrics per group
JoinsIncorrect joins are a common source of duplicates
Index StrategyIndexes can reduce DISTINCT cost for common patterns
Window FunctionsAlternative approach for "latest per group" using row_number

Visual Learning Diagram

flowchart TD
A[Basic SELECT] --> B{Need unique rows?}
B -->|Yes| C[SELECT DISTINCT]
B -->|No| D[Standard SELECT]
C --> E[Single column DISTINCT]
C --> F[Multi-column DISTINCT]
C --> G[DISTINCT ON (PostgreSQL)]
F --> H[COUNT(DISTINCT ...)]
C --> I[Performance check: EXPLAIN]
I --> J[Index strategy]
G --> K[Alternative: Window functions]

classDef allNodes fill:#3e3e3e,stroke:#ffffff,stroke-width:2px,color:#f5f5f5
classDef highlight fill:#3e3e3e,stroke:#ffffff,stroke-width:4px,color:#f5f5f5

class A,B,D,E,F,G,H,I,J,K allNodes
class C highlight

Common Pitfalls

PitfallConsequencePrevention
Using DISTINCT to hide join duplicationMetrics and performance remain wrongFix join keys or aggregate at correct grain
Assuming DISTINCT applies per columnUnexpected duplicates remainRemember: DISTINCT applies to full row combination
DISTINCT on large tables without filtersSlow sorts/hashesFilter first and add supporting indexes
DISTINCT ON with incorrect ORDER BY prefixQuery failsOrder by key first: ORDER BY key, ts DESC
ORDER BY expression not in SELECT list with DISTINCTQuery failsOrder by selected columns or use GROUP BY

Quick Reference

SELECT DISTINCT col FROM t;
SELECT DISTINCT col1, col2 FROM t;
SELECT COUNT(DISTINCT col) FROM t;
SELECT DISTINCT ON (key) key, other FROM t ORDER BY key, ts DESC;
SELECT col, MAX(x) FROM t GROUP BY col; -- alternative to order by non-selected columns

What's Next