Having, a less understood SQL clause

Hacker News - Sat Aug 6 01:21

2 min read

Understanding how the group by and having clauses work can help us write more efficient sql. I want to demonstrate this with an example based on a real situation.

Let's assume we have two tables - a country table with a country code and name, and an invoice table with an invoice number, country code, year, month, item count, and amount.

ctry_codectry_name
USUSA
CACanada
CNChina
....

7 rows


invoice_nbrctry_cdyear_nbrmonth_nbritem_cntinvoice_amt
2014001BR2014313162875
2021172CA2021102001299355
2020435CN2020112145654
2016201US201689007125125
2021662US20214100800135
2018743MX201811576124
............

4 million rows

We want to write a sql that aggregates the invoice data

  • for each country and year, for all countries other than USA
  • for each country, for all countries other than USA
  • for each year (including data for USA)
  • a grand total (including data for USA)

This is the output we want from our query:

ctry_nameyear_nbrtot_cnttot_amt
Canada20142001300000
Canada20152201500000
Brazil20142001150000
Brazil20151801000000
Brazil2018150750000
........
Canadanull4202800000
Brazilnull5302900000
........
All countries20149006406325
All countries201510007306368
All countries201812008206334
........
All countriesnull612441261346

Here's one that gives us the expected results. It uses 6 sqls for year-wise totals for each of the 6 countries (other than USA), 6 sqls for country-wise totals for all years, one sql for year-wise totals for all countries, and finally one sql for the grand total. These 14 sqls are then combined with 13 UNION operations as shown below.

1

2select c.ctry_name, i.year_nbr,

3 sum(i.item_cnt) as tot_cnt,

4 sum(i.invoice_amt) as tot_amt

5 from country c

6 inner join invoice i on (i.ctry_code = c.ctry_code)

7 where c.ctry_name = 'Mexico'

8 group by c.ctry_name, i.year_nbr

9UNION

10....

11....

12UNION

13

14select c.ctry_name, null as year_nbr,

15 sum(i.item_cnt) as tot_cnt,

16 sum(i.invoice_amt) as tot_amt

17 from country c

18 inner join invoice i on (i.ctry_code = c.ctry_code)

19 where c.ctry_name = 'Mexico'

20 group by c.ctry_name

21UNION

22....

23....

24UNION

25

26select 'All countries' as ctry_name, i.year_nbr,

27 sum(i.item_cnt) as tot_cnt,

28 sum(i.invoice_amt) as tot_amt

29 from country c

30 inner join invoice i on (i.ctry_code = c.ctry_code)

31 group by i.year_nbr

32UNION

33

34select 'All countries' as ctry_name, null as year_nbr,

35 sum(i.item_cnt) as tot_cnt,

36 sum(i.invoice_amt) as tot_amt

37 from country c

38 inner join invoice i on (i.ctry_code = c.ctry_code)

This query returns the expected results, but apart from being difficult to maintain, it is also slow and inefficient. It fetches data from the invoice table 14 times to perform 14 aggregations followed by an expensive set of UNION operations.

We can write a simpler and more efficient query with a better understanding of the capabilities of group by and having. Just as a where clause allows us to filter raw data, having lets us filter the resultset from a group by operation. The following sql uses group by cube() to generate aggregates for all combinations of country and year while filtering out the summary rows for USA with having.

1select coalesce(c.ctry_name,'All countries') as ctry_name,

2 i.year_nbr,

3 sum(i.item_cnt) as tot_cnt,

4 sum(i.invoice_amt) as tot_amt

5 from country c

6 inner join invoice i on (i.ctry_code = c.ctry_code)

7 group by cube(c.ctry_name, i.year_nbr)

8having coalesce(c.ctry_name,'x') != 'USA'

What is this query doing? It uses cube with group by to get-

  • one row for each country, year combination
  • one row for each country (including data for all years)
  • one row for each year (including data for all countries)
  • one row for all years, all countries

It then excludes the rows where the ctry_name is USA using the having clause. This query fetches data from the invoice table only once and is significantly faster then the initial sql.

Note that country-wise summary rows will have a null year. Similarly, year-wise summary rows will have a null country name, which we convert to All countries using the coalesce function. We also use coalesce in the having clause so we don't drop summary rows where the country name is null.

In this specific example, we needed to exclude data for USA in only two of the four aggregate groups. If the requirement was to exclude data for USA in all the summary rows, we would use the where clause to filter USA data before any aggregation occurs, rather than doing it in the having clause. The following query meets these requirements.

1select coalesce(c.ctry_name,'All countries') as ctry_name,

2 i.year_nbr,

3 sum(i.item_cnt) as tot_cnt,

4 sum(i.invoice_amt) as tot_amt

5 from country c

6 inner join invoice i on (i.ctry_code = c.ctry_code)

7 where c.ctry_name != 'USA'

8 group by cube(c.ctry_name, i.year_nbr)

group by supports extensions like cube, rollup and others that are extremely useful and are worth spending the time to understand better. We will take a look at group by in more detail soon.


The actual sql this example is based on had 24 small queries that were combined with union and had more than two tables joined in each query. The rewritten sql was just 10 lines long compared to over 200 lines for the original code and ran in a few seconds compared to over half an hour for the original query.