Aggregate Functions
COUNT, SUM, AVG, MIN, MAX with GROUP BY and HAVING for grouped data analysis.
๐ฏLearning Objectives
- Use
COUNT(*),COUNT(col), andCOUNT(DISTINCT col)and explain the differences between them - Apply
SUM,AVG,MIN, andMAXto summarise numerical data - Explain how aggregate functions treat
NULLvalues - Write
GROUP BYqueries that produce one output row per unique group - Filter aggregated results using
HAVING - Explain the difference between
WHERE(filters rows before grouping) andHAVING(filters groups after grouping) - Order results by an aggregated value using an alias in
ORDER BY
๐Theory
1. What Are Aggregate Functions?
So far, every query has returned one row per row in the table. Aggregate functions are different โ they collapse multiple rows into a single summary value. They are the foundation of analytical queries.
The five core aggregate functions in SQL:
| Function | Returns |
|---|---|
COUNT(*) | Total number of rows in the group |
COUNT(col) | Number of non-NULL values in col |
SUM(col) | Sum of all non-NULL values |
AVG(col) | Average of all non-NULL values |
MIN(col) | Smallest non-NULL value |
MAX(col) | Largest non-NULL value |
SQL
-- How many films are in the database?
SELECT COUNT(*) FROM film;
-- What are the cheapest, most expensive, and average rental rates?
SELECT MIN(rental_rate), MAX(rental_rate), AVG(rental_rate)
FROM film;When used without GROUP BY, aggregate functions collapse the entire table into one row.
2. COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)
These three forms behave differently when NULL values are present:
SQL
-- COUNT(*): counts every row, including those with NULLs in any column
SELECT COUNT(*) FROM customer; -- 599 rows total
-- COUNT(col): counts only rows where that column is NOT NULL
SELECT COUNT(email) FROM customer; -- could be < 599 if some emails are NULL
-- COUNT(DISTINCT col): counts unique non-NULL values
SELECT COUNT(DISTINCT rating) FROM film; -- 5 (G, PG, PG-13, R, NC-17)Tip: Use COUNT(*) to count rows. Use COUNT(col) to count how many rows have a value in that specific column. Use COUNT(DISTINCT col) to count unique values. Mixing them up is a common source of subtle errors.
3. NULL Behaviour in Aggregate Functions
All aggregate functions except COUNT(*) silently ignore NULL values. This is usually what you want, but you need to be aware of it.
SQL
-- Suppose 10 films have a NULL rental_duration.
-- AVG ignores those 10 rows when computing the average.
SELECT AVG(rental_duration) FROM film;
-- Result is the average of the 990 non-NULL values, not all 1000.
-- To include NULLs in an average (treating them as 0),
-- use COALESCE to substitute a value before averaging.
SELECT AVG(COALESCE(rental_duration, 0)) FROM film;Warning: Be careful when the number of NULL rows is significant. A column with 40% NULL values will produce an average that represents only 60% of the data โ potentially misleading.
4. GROUP BY โ One Row Per Group
GROUP BY divides the result set into groups of rows that share the same values in the specified columns. Each group is then collapsed into a single output row by the aggregate function(s).
SQL
-- How many films belong to each rating category?
SELECT rating, COUNT(*) AS film_count
FROM film
GROUP BY rating;Result:
Code
rating | film_count
-------+-----------
G | 178
PG | 194
PG-13 | 223
R | 195
NC-17 | 210Rule: Every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function. The database cannot know which row's value to display for an ungrouped column.
SQL
-- ERROR: title is not in GROUP BY and not aggregated
SELECT rating, title, COUNT(*)
FROM film
GROUP BY rating; -- PostgreSQL will raise an error here5. GROUP BY with Multiple Columns
You can group by more than one column. The database forms a group for every unique combination of the grouped columns.
SQL
-- Count films grouped by both rating AND rental_duration
SELECT rating,
rental_duration,
COUNT(*) AS film_count
FROM film
GROUP BY rating, rental_duration
ORDER BY rating, rental_duration;Each row in the output represents a unique (rating, rental_duration) pair โ for example, all G-rated films with a 3-day rental duration form one group.
6. HAVING โ Filtering Groups
WHERE filters individual rows before grouping. HAVING filters the computed groups after grouping. You cannot use WHERE to filter by an aggregate result.
SQL
-- Which customers have rented more than 30 times?
-- WRONG: WHERE cannot reference COUNT(*)
SELECT customer_id, COUNT(*) AS rental_count
FROM rental
WHERE COUNT(*) > 30 -- ERROR: aggregate functions not allowed in WHERE
GROUP BY customer_id;
-- CORRECT: use HAVING
SELECT customer_id, COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 30
ORDER BY rental_count DESC;Tip: A useful mental model for the order of operations in a grouped query: FROM โ WHERE โ GROUP BY โ aggregate functions computed โ HAVING โ SELECT list evaluated โ ORDER BY. Understanding this order explains why WHERE cannot reference aggregates but HAVING can.
7. WHERE and HAVING Together
WHERE and HAVING serve different purposes and can appear in the same query:
SQL
-- Among films longer than 60 minutes,
-- which rating categories have a combined rental rate total above 100?
SELECT rating,
COUNT(*) AS film_count,
SUM(rental_rate) AS total_rate
FROM film
WHERE length > 60 -- filters rows BEFORE grouping
GROUP BY rating
HAVING SUM(rental_rate) > 100 -- filters groups AFTER grouping
ORDER BY total_rate DESC;WHERE length > 60 removes short films before any grouping happens โ those films do not participate in the SUM. Then HAVING SUM(rental_rate) > 100 keeps only the rating categories whose total (computed from the remaining films) exceeds 100.
8. Ordering by Aggregated Values
You can reference aggregate results in ORDER BY using either the function directly or the alias defined in the SELECT list.
SQL
-- Most popular ratings by film count
SELECT rating,
COUNT(*) AS film_count,
ROUND(AVG(length), 1) AS avg_length_min
FROM film
GROUP BY rating
ORDER BY film_count DESC; -- alias works in ORDER BYYou can also sort by multiple columns โ aggregate and non-aggregate โ in ORDER BY:
SQL
SELECT rating,
rental_duration,
COUNT(*) AS film_count
FROM film
GROUP BY rating, rental_duration
ORDER BY film_count DESC, rating ASC;๐ปCode Examples
Example A โ Film Count and Average Length per Rating
SQL
-- Summarise the film catalogue by rating.
-- Show how many films each rating has and the average film length.
-- Include only ratings that have more than 150 films.
-- Sort by film count descending.
SELECT rating,
COUNT(*) AS film_count,
ROUND(AVG(length), 1) AS avg_length_min,
MIN(length) AS shortest_min,
MAX(length) AS longest_min
FROM film
GROUP BY rating
HAVING COUNT(*) > 150
ORDER BY film_count DESC;What to notice:
- Five aggregate functions on the same grouped data simultaneously.
ROUND(AVG(length), 1)rounds to one decimal place for readability.HAVING COUNT(*) > 150keeps only ratings with more than 150 films.
Example B โ Top Rental Customers
SQL
-- Find customers who have rented at least 30 items.
-- Show customer_id and their total rental count.
-- Sort by rental count descending to see the most active customers first.
SELECT customer_id,
COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) >= 30
ORDER BY rental_count DESC;What to notice:
- The source table is
rental, notcustomerโ the rental record exists for each transaction. GROUP BY customer_idgroups all rows with the same customer ID.HAVING COUNT(*) >= 30filters out customers with fewer than 30 rentals.- The result shows only customer IDs โ we will join to the
customertable for names in lesson 31.
Example C โ Combined WHERE, GROUP BY, and HAVING
SQL
-- Among films rated PG, PG-13, G, R, or NC-17 with rental rate above $0.99,
-- show the total combined rental rate per rating,
-- but only for ratings where that total exceeds $200.
-- Also show the count and the average rate per film.
SELECT rating,
COUNT(*) AS film_count,
ROUND(AVG(rental_rate), 2) AS avg_rate,
ROUND(SUM(rental_rate), 2) AS total_rate
FROM film
WHERE rental_rate > 0.99 -- filter rows before grouping
GROUP BY rating
HAVING SUM(rental_rate) > 200 -- filter groups after grouping
ORDER BY total_rate DESC;What to notice:
WHERE rental_rate > 0.99removes all $0.99 films from the dataset before groups are formed.GROUP BY ratingthen groups the remaining films.HAVING SUM(rental_rate) > 200keeps only the groups whose total (computed without the $0.99 films) exceeds 200.- All three aggregate functions (
COUNT,AVG,SUM) operate on the already-filtered data.
โ๏ธPractice Tasks
Write a single query that returns four numbers from the film table:
- Total number of films
- Lowest rental rate
- Highest rental rate
- Average rental rate (rounded to 2 decimal places)
Label each column with a meaningful alias.
Hint
All four aggregates in one SELECT with no GROUP BY. This collapses the entire table into one row.
SQL
SELECT COUNT(*) AS total_films,
MIN(rental_rate) AS cheapest,
MAX(rental_rate) AS most_expensive,
ROUND(AVG(rental_rate), 2) AS avg_rate
FROM film;Write a query that shows, for each film rating, the count of films and the average film length (rounded to 1 decimal place). Order the results by average length descending. Include all rating categories.
Then modify the query to show only rating categories where the average length is above 110 minutes.
Hint
For the full listing, use GROUP BY rating ORDER BY avg_length DESC. To filter by the average, add HAVING ROUND(AVG(length), 1) > 110 โ you cannot use WHERE here because the average is computed after grouping.
SQL
-- Full listing
SELECT rating,
COUNT(*) AS film_count,
ROUND(AVG(length), 1) AS avg_length
FROM film
GROUP BY rating
ORDER BY avg_length DESC;
-- Only ratings with avg length > 110
SELECT rating,
COUNT(*) AS film_count,
ROUND(AVG(length), 1) AS avg_length
FROM film
GROUP BY rating
HAVING ROUND(AVG(length), 1) > 110
ORDER BY avg_length DESC;Write a query on the rental table that finds all customers with more than 25 rentals. Show customer_id and a column called rental_count. Sort by rental_count descending.
Then extend the query: add a second condition to HAVING that also requires the customer to have rented before a specific date. Use MAX(rental_date) > '2005-08-01' to keep only customers whose most recent rental was after August 1 2005.
Hint
Two conditions in HAVING are connected with AND. You can use multiple aggregate functions in the same HAVING clause.
SQL
-- Basic version
SELECT customer_id,
COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 25
ORDER BY rental_count DESC;
-- Extended version with date condition
SELECT customer_id,
COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 25
AND MAX(rental_date) > '2005-08-01'
ORDER BY rental_count DESC;โ ๏ธCommon Mistakes
Non-aggregated column in SELECT without GROUP BY
SELECT rating, title, COUNT(*) FROM film GROUP BY rating raises an error because title is neither in GROUP BY nor wrapped in an aggregate. The database cannot decide which title to show for the group of all PG films. Either add title to GROUP BY or remove it from SELECT.
Using WHERE to filter aggregate results
WHERE COUNT(*) > 30 is a syntax error โ aggregate functions are not allowed in WHERE. Use HAVING COUNT(*) > 30 instead. WHERE filters individual rows before grouping; HAVING filters groups after the aggregates are computed.
COUNT(*) vs COUNT(col) confusion
COUNT(*) counts all rows including those with NULL in any column. COUNT(email) counts only rows where email is not NULL. Using COUNT(col) when you meant COUNT(*) silently produces a smaller number whenever NULL values exist in that column.
Forgetting ROUND on AVG
AVG(rental_rate) returns a high-precision decimal like 2.9800000000000000. Always ROUND() the result to a sensible number of decimal places before displaying it. Two decimal places (ROUND(AVG(rental_rate), 2)) is standard for monetary values.
Referencing a SELECT alias in HAVING
HAVING film_count > 150 โ where film_count is a SELECT alias for COUNT(*) โ works in PostgreSQL but is not standard SQL. For portability and clarity, prefer HAVING COUNT(*) > 150 directly. In ORDER BY, using aliases is standard and preferred.
Grouping by a column not in SELECT
SELECT COUNT(*) FROM film GROUP BY rating is valid โ you can group by a column you do not select. But the result will have one row per rating with no label showing which rating it is, making the output useless. Always include grouped columns in the SELECT list.
๐Instructor Notes
How to run this lesson (~80 min)
- [10 min] Motivation. Ask: "We know there are 1 000 films. Without looking them all up, how many films are rated PG? What's the average rental price?" Students will want to use a spreadsheet โ perfect. Explain that
GROUP BYis SQL's pivot table. RunSELECT COUNT(*) FROM filmfirst to anchor the idea: this is an aggregate over the whole table. - [20 min] Live code: sections 1โ3. Build up from
COUNT(*)throughCOUNT(col)toCOUNT(DISTINCT col). Run all three oncustomer.emailand show the comparison. Then demonstrateAVG,MIN,MAX,SUMonfilm.rental_ratein one query. - [20 min] Live code: sections 4โ6. Add
GROUP BY ratingto the film count query. Show the error when you addtitlewithout grouping. Then introduceHAVINGโ first run the broken version withWHERE COUNT(*) > 150to show the error, then fix it withHAVING. - [10 min] Live code: section 7. Do Example C live. Trace through the order of operations on the board:
WHEREremoves rows โGROUP BYforms groups โ aggregates run โHAVINGfilters groups โORDER BYsorts the final result. This diagram is worth drawing explicitly. - [15 min] Task 1 in class. Task 1 is a single-row summary โ most students get it in 3โ4 minutes. Prompt: "Can you also add the median length? Try
percentile_cont(0.5) WITHIN GROUP (ORDER BY length)." (This is a bonus discovery rather than a requirement.) - [5 min] Assign homework. Tasks 2 and 3 go home. Remind students: if the error says "must appear in the GROUP BY clause", their
SELECTlist has a column that is not grouped or aggregated.
Discussion questions
- "Why can't you use a column alias from
SELECTinside aWHEREclause, but you can inORDER BY? What does this tell you about when each clause is evaluated?" - "If a film has
rental_duration = NULL, does it count towardCOUNT(*)? Does it count towardAVG(rental_duration)? Why is the answer different for each?" - "Imagine you want the top customer per store. How would you modify the GROUP BY to achieve that?"