MOO WebTech
Data Aggregationtheory-practiceintermediate

Aggregate Functions

COUNT, SUM, AVG, MIN, MAX with GROUP BY and HAVING for grouped data analysis.

80 minL29

๐ŸŽฏLearning Objectives

  • Use COUNT(*), COUNT(col), and COUNT(DISTINCT col) and explain the differences between them
  • Apply SUM, AVG, MIN, and MAX to summarise numerical data
  • Explain how aggregate functions treat NULL values
  • Write GROUP BY queries that produce one output row per unique group
  • Filter aggregated results using HAVING
  • Explain the difference between WHERE (filters rows before grouping) and HAVING (filters groups after grouping)
  • Order results by an aggregated value using an alias in ORDER BY

๐Ÿ“–Theory

1. What Are Aggregate Functions?

So far, every query has returned one row per row in the table. Aggregate functions are different โ€” they collapse multiple rows into a single summary value. They are the foundation of analytical queries.

The five core aggregate functions in SQL:

FunctionReturns
COUNT(*)Total number of rows in the group
COUNT(col)Number of non-NULL values in col
SUM(col)Sum of all non-NULL values
AVG(col)Average of all non-NULL values
MIN(col)Smallest non-NULL value
MAX(col)Largest non-NULL value
SQL
-- How many films are in the database?
SELECT COUNT(*) FROM film;

-- What are the cheapest, most expensive, and average rental rates?
SELECT MIN(rental_rate), MAX(rental_rate), AVG(rental_rate)
FROM film;

When used without GROUP BY, aggregate functions collapse the entire table into one row.

2. COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

These three forms behave differently when NULL values are present:

SQL
-- COUNT(*): counts every row, including those with NULLs in any column
SELECT COUNT(*) FROM customer;               -- 599 rows total

-- COUNT(col): counts only rows where that column is NOT NULL
SELECT COUNT(email) FROM customer;           -- could be < 599 if some emails are NULL

-- COUNT(DISTINCT col): counts unique non-NULL values
SELECT COUNT(DISTINCT rating) FROM film;     -- 5 (G, PG, PG-13, R, NC-17)

Tip: Use COUNT(*) to count rows. Use COUNT(col) to count how many rows have a value in that specific column. Use COUNT(DISTINCT col) to count unique values. Mixing them up is a common source of subtle errors.

3. NULL Behaviour in Aggregate Functions

All aggregate functions except COUNT(*) silently ignore NULL values. This is usually what you want, but you need to be aware of it.

SQL
-- Suppose 10 films have a NULL rental_duration.
-- AVG ignores those 10 rows when computing the average.
SELECT AVG(rental_duration) FROM film;
-- Result is the average of the 990 non-NULL values, not all 1000.

-- To include NULLs in an average (treating them as 0),
-- use COALESCE to substitute a value before averaging.
SELECT AVG(COALESCE(rental_duration, 0)) FROM film;

Warning: Be careful when the number of NULL rows is significant. A column with 40% NULL values will produce an average that represents only 60% of the data โ€” potentially misleading.

4. GROUP BY โ€” One Row Per Group

GROUP BY divides the result set into groups of rows that share the same values in the specified columns. Each group is then collapsed into a single output row by the aggregate function(s).

SQL
-- How many films belong to each rating category?
SELECT rating, COUNT(*) AS film_count
FROM film
GROUP BY rating;

Result:

Code
rating | film_count
-------+-----------
G      | 178
PG     | 194
PG-13  | 223
R      | 195
NC-17  | 210

Rule: Every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function. The database cannot know which row's value to display for an ungrouped column.

SQL
-- ERROR: title is not in GROUP BY and not aggregated
SELECT rating, title, COUNT(*)
FROM film
GROUP BY rating;   -- PostgreSQL will raise an error here

5. GROUP BY with Multiple Columns

You can group by more than one column. The database forms a group for every unique combination of the grouped columns.

SQL
-- Count films grouped by both rating AND rental_duration
SELECT rating,
       rental_duration,
       COUNT(*) AS film_count
FROM film
GROUP BY rating, rental_duration
ORDER BY rating, rental_duration;

Each row in the output represents a unique (rating, rental_duration) pair โ€” for example, all G-rated films with a 3-day rental duration form one group.

6. HAVING โ€” Filtering Groups

WHERE filters individual rows before grouping. HAVING filters the computed groups after grouping. You cannot use WHERE to filter by an aggregate result.

SQL
-- Which customers have rented more than 30 times?
-- WRONG: WHERE cannot reference COUNT(*)
SELECT customer_id, COUNT(*) AS rental_count
FROM rental
WHERE COUNT(*) > 30          -- ERROR: aggregate functions not allowed in WHERE
GROUP BY customer_id;

-- CORRECT: use HAVING
SELECT customer_id, COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 30
ORDER BY rental_count DESC;

Tip: A useful mental model for the order of operations in a grouped query: FROM โ†’ WHERE โ†’ GROUP BY โ†’ aggregate functions computed โ†’ HAVING โ†’ SELECT list evaluated โ†’ ORDER BY. Understanding this order explains why WHERE cannot reference aggregates but HAVING can.

7. WHERE and HAVING Together

WHERE and HAVING serve different purposes and can appear in the same query:

SQL
-- Among films longer than 60 minutes,
-- which rating categories have a combined rental rate total above 100?
SELECT rating,
       COUNT(*)               AS film_count,
       SUM(rental_rate)       AS total_rate
FROM film
WHERE length > 60            -- filters rows BEFORE grouping
GROUP BY rating
HAVING SUM(rental_rate) > 100  -- filters groups AFTER grouping
ORDER BY total_rate DESC;

WHERE length > 60 removes short films before any grouping happens โ€” those films do not participate in the SUM. Then HAVING SUM(rental_rate) > 100 keeps only the rating categories whose total (computed from the remaining films) exceeds 100.

8. Ordering by Aggregated Values

You can reference aggregate results in ORDER BY using either the function directly or the alias defined in the SELECT list.

SQL
-- Most popular ratings by film count
SELECT rating,
       COUNT(*) AS film_count,
       ROUND(AVG(length), 1) AS avg_length_min
FROM film
GROUP BY rating
ORDER BY film_count DESC;   -- alias works in ORDER BY

You can also sort by multiple columns โ€” aggregate and non-aggregate โ€” in ORDER BY:

SQL
SELECT rating,
       rental_duration,
       COUNT(*) AS film_count
FROM film
GROUP BY rating, rental_duration
ORDER BY film_count DESC, rating ASC;

๐Ÿ’ปCode Examples

Example A โ€” Film Count and Average Length per Rating

SQL
-- Summarise the film catalogue by rating.
-- Show how many films each rating has and the average film length.
-- Include only ratings that have more than 150 films.
-- Sort by film count descending.

SELECT rating,
       COUNT(*)               AS film_count,
       ROUND(AVG(length), 1)  AS avg_length_min,
       MIN(length)            AS shortest_min,
       MAX(length)            AS longest_min
FROM film
GROUP BY rating
HAVING COUNT(*) > 150
ORDER BY film_count DESC;

What to notice:

  • Five aggregate functions on the same grouped data simultaneously.
  • ROUND(AVG(length), 1) rounds to one decimal place for readability.
  • HAVING COUNT(*) > 150 keeps only ratings with more than 150 films.

Example B โ€” Top Rental Customers

SQL
-- Find customers who have rented at least 30 items.
-- Show customer_id and their total rental count.
-- Sort by rental count descending to see the most active customers first.

SELECT customer_id,
       COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) >= 30
ORDER BY rental_count DESC;

What to notice:

  • The source table is rental, not customer โ€” the rental record exists for each transaction.
  • GROUP BY customer_id groups all rows with the same customer ID.
  • HAVING COUNT(*) >= 30 filters out customers with fewer than 30 rentals.
  • The result shows only customer IDs โ€” we will join to the customer table for names in lesson 31.

Example C โ€” Combined WHERE, GROUP BY, and HAVING

SQL
-- Among films rated PG, PG-13, G, R, or NC-17 with rental rate above $0.99,
-- show the total combined rental rate per rating,
-- but only for ratings where that total exceeds $200.
-- Also show the count and the average rate per film.

SELECT rating,
       COUNT(*)                       AS film_count,
       ROUND(AVG(rental_rate), 2)     AS avg_rate,
       ROUND(SUM(rental_rate), 2)     AS total_rate
FROM film
WHERE rental_rate > 0.99               -- filter rows before grouping
GROUP BY rating
HAVING SUM(rental_rate) > 200          -- filter groups after grouping
ORDER BY total_rate DESC;

What to notice:

  • WHERE rental_rate > 0.99 removes all $0.99 films from the dataset before groups are formed.
  • GROUP BY rating then groups the remaining films.
  • HAVING SUM(rental_rate) > 200 keeps only the groups whose total (computed without the $0.99 films) exceeds 200.
  • All three aggregate functions (COUNT, AVG, SUM) operate on the already-filtered data.

โœ๏ธPractice Tasks

Task 1Basic Film Stats
EASY

Write a single query that returns four numbers from the film table:

  • Total number of films
  • Lowest rental rate
  • Highest rental rate
  • Average rental rate (rounded to 2 decimal places)

Label each column with a meaningful alias.

Hint

All four aggregates in one SELECT with no GROUP BY. This collapses the entire table into one row.

SQL
SELECT COUNT(*)                    AS total_films,
       MIN(rental_rate)            AS cheapest,
       MAX(rental_rate)            AS most_expensive,
       ROUND(AVG(rental_rate), 2)  AS avg_rate
FROM film;
Task 2Films per Rating
MEDIUM

Write a query that shows, for each film rating, the count of films and the average film length (rounded to 1 decimal place). Order the results by average length descending. Include all rating categories.

Then modify the query to show only rating categories where the average length is above 110 minutes.

Hint

For the full listing, use GROUP BY rating ORDER BY avg_length DESC. To filter by the average, add HAVING ROUND(AVG(length), 1) > 110 โ€” you cannot use WHERE here because the average is computed after grouping.

SQL
-- Full listing
SELECT rating,
       COUNT(*)               AS film_count,
       ROUND(AVG(length), 1)  AS avg_length
FROM film
GROUP BY rating
ORDER BY avg_length DESC;

-- Only ratings with avg length > 110
SELECT rating,
       COUNT(*)               AS film_count,
       ROUND(AVG(length), 1)  AS avg_length
FROM film
GROUP BY rating
HAVING ROUND(AVG(length), 1) > 110
ORDER BY avg_length DESC;
Task 3High-Volume Renters
HARD

Write a query on the rental table that finds all customers with more than 25 rentals. Show customer_id and a column called rental_count. Sort by rental_count descending.

Then extend the query: add a second condition to HAVING that also requires the customer to have rented before a specific date. Use MAX(rental_date) > '2005-08-01' to keep only customers whose most recent rental was after August 1 2005.

Hint

Two conditions in HAVING are connected with AND. You can use multiple aggregate functions in the same HAVING clause.

SQL
-- Basic version
SELECT customer_id,
       COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 25
ORDER BY rental_count DESC;

-- Extended version with date condition
SELECT customer_id,
       COUNT(*) AS rental_count
FROM rental
GROUP BY customer_id
HAVING COUNT(*) > 25
   AND MAX(rental_date) > '2005-08-01'
ORDER BY rental_count DESC;

โš ๏ธCommon Mistakes

Non-aggregated column in SELECT without GROUP BY

SELECT rating, title, COUNT(*) FROM film GROUP BY rating raises an error because title is neither in GROUP BY nor wrapped in an aggregate. The database cannot decide which title to show for the group of all PG films. Either add title to GROUP BY or remove it from SELECT.

Using WHERE to filter aggregate results

WHERE COUNT(*) > 30 is a syntax error โ€” aggregate functions are not allowed in WHERE. Use HAVING COUNT(*) > 30 instead. WHERE filters individual rows before grouping; HAVING filters groups after the aggregates are computed.

COUNT(*) vs COUNT(col) confusion

COUNT(*) counts all rows including those with NULL in any column. COUNT(email) counts only rows where email is not NULL. Using COUNT(col) when you meant COUNT(*) silently produces a smaller number whenever NULL values exist in that column.

Forgetting ROUND on AVG

AVG(rental_rate) returns a high-precision decimal like 2.9800000000000000. Always ROUND() the result to a sensible number of decimal places before displaying it. Two decimal places (ROUND(AVG(rental_rate), 2)) is standard for monetary values.

Referencing a SELECT alias in HAVING

HAVING film_count > 150 โ€” where film_count is a SELECT alias for COUNT(*) โ€” works in PostgreSQL but is not standard SQL. For portability and clarity, prefer HAVING COUNT(*) > 150 directly. In ORDER BY, using aliases is standard and preferred.

Grouping by a column not in SELECT

SELECT COUNT(*) FROM film GROUP BY rating is valid โ€” you can group by a column you do not select. But the result will have one row per rating with no label showing which rating it is, making the output useless. Always include grouped columns in the SELECT list.

๐ŸŽ“Instructor Notes

How to run this lesson (~80 min)

  • [10 min] Motivation. Ask: "We know there are 1 000 films. Without looking them all up, how many films are rated PG? What's the average rental price?" Students will want to use a spreadsheet โ€” perfect. Explain that GROUP BY is SQL's pivot table. Run SELECT COUNT(*) FROM film first to anchor the idea: this is an aggregate over the whole table.
  • [20 min] Live code: sections 1โ€“3. Build up from COUNT(*) through COUNT(col) to COUNT(DISTINCT col). Run all three on customer.email and show the comparison. Then demonstrate AVG, MIN, MAX, SUM on film.rental_rate in one query.
  • [20 min] Live code: sections 4โ€“6. Add GROUP BY rating to the film count query. Show the error when you add title without grouping. Then introduce HAVING โ€” first run the broken version with WHERE COUNT(*) > 150 to show the error, then fix it with HAVING.
  • [10 min] Live code: section 7. Do Example C live. Trace through the order of operations on the board: WHERE removes rows โ†’ GROUP BY forms groups โ†’ aggregates run โ†’ HAVING filters groups โ†’ ORDER BY sorts the final result. This diagram is worth drawing explicitly.
  • [15 min] Task 1 in class. Task 1 is a single-row summary โ€” most students get it in 3โ€“4 minutes. Prompt: "Can you also add the median length? Try percentile_cont(0.5) WITHIN GROUP (ORDER BY length)." (This is a bonus discovery rather than a requirement.)
  • [5 min] Assign homework. Tasks 2 and 3 go home. Remind students: if the error says "must appear in the GROUP BY clause", their SELECT list has a column that is not grouped or aggregated.

Discussion questions

  • "Why can't you use a column alias from SELECT inside a WHERE clause, but you can in ORDER BY? What does this tell you about when each clause is evaluated?"
  • "If a film has rental_duration = NULL, does it count toward COUNT(*)? Does it count toward AVG(rental_duration)? Why is the answer different for each?"
  • "Imagine you want the top customer per store. How would you modify the GROUP BY to achieve that?"