How to solve Median with GROUP BY?

742    Asked by CarolynBuckland in SQL Server , Asked on Jul 12, 2021

Suppose the following table t1:

=================

|  tag  |  val  |       --+ for the sake of simplicity, val is non NULL

=================

|   a1  |  v1   |

|   a1  |  v2   |

|   a1  |  v3   |

|   a1  |  v4   |

|   a1  |  v5   |

|   a2  |  v6   |

|   a2  |  v7   |

|   a2  |  v8   |

|   a2  |  v9   |

|   ... | ...   |

=================

If you execute the script below in MySQL: SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag` You would get the average values grouped by the column tag:

=================

|  tag  | AVG() |

=================

|   a1  | avg1  |

|   a2  | avg2  |

|   a3  | avg3  |

|   a4  | avg4  |

|   ... |  ...  |

=================

Besides AVG(), MySQL has several other built-in functions to calculate aggregate values (e.g. SUM(), MAX(), COUNT(), and STD()) that could be used in the same way as in the script aforementioned. However, there is no built-in function for the median.

This issue has already come up several other times at SE; however, most of them are related to tables without GROUP BY. The only one with GROUP BY seems to be MySql: Count median grouped by day; however, the script seems to be overcomplicated.

Question

What would be an easy and simple way (if possible) to calculate this median?

Follow-up

Excellent article that complements the accepted answer:

http://danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/

Answered by Chloe Burgess

This query could answer your question: SQL Server median value and group by

SELECT tag, AVG(val) as median FROM ( SELECT tag, val, (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct, seq, (SELECT count(*) FROM median t2 WHERE t2.tag < t3 xss=removed xss=removed xss=removed> 0 and seq-delta = (ct+1)/2) ) T GROUP BY tag ORDER BY tag;
I tried it on this dataset (mainly from here):
+------+------+ | tag | val | +------+------+ | 1 | 3 | | 1 | 13 |
... (see explanation below)
| 3 | 12 | | 3 | 43 | | 3 | 15 | +------+------+
and the result was:
+------+---------+ | tag | median | +------+---------+ | 1 | 23.0000 | | 2 | 22.0000 | | 3 | 15.0000 | +------+---------+

ExplanationInner subqueries will be computed first: sequence is (1)(2)(3)(4).

-- (4) compute the average (of 2 lines or 1 line)
SELECT tag, AVG(Val) as median FROM (
-- (3) get lines to compute the median value
SELECT tag, Val, (SELECT count(*) FROM median t2 -- +number of lines for the current tag value as ct WHERE t2.tag = t3.tag) as ct, seq, (SELECT count(*) FROM median t2 -- +number of lines before the current tag value as delta WHERE t2.tag < t3 xss=removed xss=removed xss=removed> 0 -- +when ct is odd, select the one line in the middle and seq-delta = (floor(ct+1)/2)) ) T
-- (4) continue here
GROUP BY tag ORDER BY tag;
Dataset:
after (1) after (2) processing (3) +------+------+ | tag | val | ct delta seq seq-delta +------+------+ | 1 | 3 | 15 0 1 1 ct : odd ct%2 <> 0 | 1 | 5 | 15 0 2 2 floor((ct+1)/2) : 8 | 1 | 7 | 15 0 3 3 | 1 | 12 | 15 0 4 4 | 1 | 13 | 15 0 5 5 | 1 | 14 | 15 0 6 6 | 1 | 21 | 15 0 7 7 | 1 | 23 | 15 0 8 8 ---> keep this line | 1 | 23 | 15 0 9 9 | 1 | 23 | 15 0 10 10 | 1 | 23 | 15 0 11 11 | 1 | 29 | 15 0 12 12 | 1 | 39 | 15 0 13 13 | 1 | 40 | 15 0 14 14 | 1 | 56 | 15 0 15 15 | 2 | 3 | 14 15 16 1 ct : even (ct%2 = 0 ) | 2 | 5 | 14 15 17 2 floor((ct+1)/2) : 7 | 2 | 7 | 14 15 18 3 floor((ct+1)/2)+1 : 8 | 2 | 12 | 14 15 19 4 | 2 | 13 | 14 15 20 5 | 2 | 14 | 14 15 21 6 | 2 | 21 | 14 15 22 7 ---> keep this line | 2 | 23 | 14 15 23 8 ---> keep this line | 2 | 23 | 14 15 24 9 | 2 | 23 | 14 15 25 10 | 2 | 23 | 14 15 26 11 | 2 | 29 | 14 15 27 12 | 2 | 40 | 14 15 28 13 | 2 | 56 | 14 15 29 14 | 3 | 12 | 3 29 30 1 ct : odd ct%2 <> 0 | 3 | 15 | 3 29 31 2 ---> keep floor((ct+1)/2) : 2 | 3 | 43 | 3 29 32 3 +------+------+

Dataset after (3)

+------+------+------+------+-------+ | tag | val | ct | seq | delta | +------+------+------+------+-------+ | 1 | 23 | 15 | 8 | 0 | | 2 | 21 | 14 | 22 | 15 | | 2 | 23 | 14 | 23 | 15 | | 3 | 15 | 3 | 31 | 29 | +------+------+------+------+-------+
Outer query will compute the avg(val) group by tag value.
Hope this helps.
But what about median computing when there are null values? See EDIT2 below
Alternative: using a function
DELIMITER // CREATE FUNCTION median(pTag int) RETURNS real READS SQL DATA DETERMINISTIC BEGIN DECLARE r real; -- result SELECT AVG(val) INTO r FROM ( SELECT val, (SELECT count(*) FROM median WHERE tag = pTag) as ct, seq FROM (SELECT val, @rownum := @rownum + 1 as seq FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 ORDER BY seq ) t3 CROSS JOIN (SELECT @rownum := 0) x HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1) or (ct%2 <> 0 and seq = (ct+1)/2) ) T; return r; END// DELIMITER ;
But the function will be called for each row:
SELECT tag, median(tag) FROM median; -- my test table is 'median' too...
This query will be "better":
select tag, median(tag) from (select distinct tag from median) t;

That's all I can do! Hope it helps!

EDIT2: Things about null values in data (column value in the example) null values show be omitted from the source data using a WHERE clause: WHERE value IS NOT NULL, in both 2 subqueriés that count lines and the subquery that gets data. EDIT3 (LAST EDIT): Change the initialization of the @rownum position

It should be put at the deepest level: so that it is declared the soonest in the execution of the query.
DELIMITER // CREATE FUNCTION median(pTag int) RETURNS real READS SQL DATA DETERMINISTIC BEGIN DECLARE r real; -- result SELECT AVG(val) INTO r FROM ( SELECT val, (SELECT count(*) FROM median WHERE tag = pTag and val is not null) as ct, seq FROM (SELECT val, @rownum := @rownum + 1 as seq FROM (SELECT * FROM median CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here WHERE tag = pTag and val is not null ORDER BY val ) t1 ORDER BY seq ) t3 HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1) or (ct%2 <> 0 and seq = (ct+1)/2) ) T; return r; END// DELIMITER ;
That is the same for the query.
Test with 2 data sets more :
| 4 | NULL | | 4 | 10 | | 4 | 15 | | 4 | 20 | | 5 | NULL | | 5 | NULL | | 5 | NULL | +------+------+
39 rows in set (0.00 sec)
+------+--------------+ | tag | median2(tag) | +------+--------------+ | 1 | 23 | | 2 | 22 | | 3 | 15 | | 4 | 15 | | 5 | NULL | +------+--------------+ 5 rows in set (0.08 sec)
Hope this helps you resolve SQL server median!






Your Answer

Interviews

Parent Categories