PG
PRO

PostgreSQL Aggregate Functions

Complete reference for PostgreSQL aggregate functions covering count, sum, average, statistical aggregates, ordered-set aggregates, and the FILTER clause for conditional aggregation. Every function includes syntax, GROUP BY examples, and anti-patterns. Updated for PostgreSQL 16.

39 functions

What are PostgreSQL Aggregate Functions?

PostgreSQL aggregate functions compute a single result from multiple input rows and are used with GROUP BY to summarise data. PostgreSQL also supports ordered-set aggregates (percentile_cont, mode), hypothetical-set aggregates (rank, dense_rank within a group), and the FILTER clause for conditional aggregation without CASE. Common aggregate functions include count(), sum(), avg(), max(), min(), string_agg(), and array_agg().

array_agg

PG 8.4+anyarray

Collects all non-NULL input values into an array. Supports ORDER BY to control element order.

DeveloperData Eng

Signature

array_agg ( expression [ORDER BY ...] ) → array

Parameters

ParameterTypeDescription
expressionanyValue to collect; NULLs are excluded

Examples

sql
SELECT array_agg(name ORDER BY name) FROM tags;
{css,html,javascript}
sql
SELECT user_id, array_agg(tag) FROM user_tags GROUP BY user_id;
Array of tags per user
sql
SELECT array_agg(DISTINCT status) FROM orders;
{cancelled,completed,pending}
sql
SELECT author_id, array_agg(title ORDER BY published_at DESC) AS recent_books FROM books GROUP BY author_id;
Books per author sorted newest-first
Anti-PatternUsing array_agg without ORDER BY and relying on element order

Without an explicit `ORDER BY`, `array_agg` produces elements in an undefined, non-deterministic order that can change between query executions, PostgreSQL versions, or after autovacuum.

✓ Instead: Always specify `array_agg(col ORDER BY col)` or an appropriate sort key when the order of array elements matters downstream.

Use `array_agg` to fetch parent + children in one query: `SELECT u.id, array_agg(t.name) AS tags FROM users u JOIN user_tags t ON t.user_id = u.id GROUP BY u.id`. This avoids N+1 queries.

example
SELECT author_id, array_agg(title ORDER BY published_at) AS books FROM books GROUP BY author_id;
All book titles per author in chronological order

avg

PG 8.0+numeric for integer/numeric input; double precision for float

Computes the arithmetic mean (average) of all non-NULL input values.

DeveloperData EngDBA

Signature

avg ( expression ) → numeric or double precision

Parameters

ParameterTypeDescription
expressionnumeric or intervalValue to average; NULLs are ignored

Examples

sql
SELECT avg(score) FROM results;
73.4
sql
SELECT avg(score)::numeric(5,2) FROM results;
73.40
sql
SELECT avg(response_time) FROM requests WHERE endpoint = '/api/v1/search';
Average response time
sql
SELECT department, avg(salary)::numeric(10,2) AS avg_salary FROM employees GROUP BY department;
Average salary per department rounded to 2 decimal places
Anti-PatternAveraging integer columns and losing decimal precision

When `avg()` is applied to integer columns, PostgreSQL returns a `numeric` type, but many developers cast back to integer with `::int`, silently truncating fractional values (e.g., an average of 3.7 becomes 3).

✓ Instead: Cast to `numeric(10,2)` or `float8` rather than `int` to preserve meaningful decimal precision: `avg(score)::numeric(10,2)`.

Instead of multiple subqueries, use `FILTER (WHERE ...)` to compute multiple averages in a single pass: `avg(score) FILTER (WHERE grade = 'A')` alongside `avg(score) FILTER (WHERE grade = 'B')`.

example
SELECT avg(score) FILTER (WHERE subject = 'math') AS math_avg, avg(score) FILTER (WHERE subject = 'science') AS sci_avg FROM grades;
Two averages in one query pass

bit_and

PG 8.0+same as input

Computes the bitwise AND of all non-NULL input values.

DeveloperDBASecurity

Signature

bit_and ( expression ) → integer or bigint or bit

Parameters

ParameterTypeDescription
expressioninteger, bigint, or bitValue to bitwise-AND

Examples

sql
SELECT bit_and(permissions) FROM user_roles WHERE user_id = 1;
Intersection of all permission bitmasks
sql
SELECT bit_and(flags) FROM feature_flags;
Flags set in ALL rows
sql
SELECT role_group, bit_and(permission_mask) AS shared_perms FROM role_assignments GROUP BY role_group;
Permissions shared by every role in each group
sql
SELECT bit_and(status_flags) & 4 AS all_have_flag_3 FROM sessions WHERE started_at > now() - interval '1 hour';
Non-zero if every recent session has flag bit 2 set
Anti-PatternChecking individual bits with equality instead of bitmask operations

Testing `WHERE permissions = 7` to check if a user has all three permission bits set conflates exact equality with bit intersection, breaking when other bits are also set.

✓ Instead: Use `bit_and(permissions) & required_mask = required_mask` to verify that specific bits are set across all rows.

`bit_and` gives the intersection of permission bitmasks — bits that are set in every row. Use it to find the minimum common permissions for a user with multiple roles.

example
SELECT bit_and(permission_mask) AS common_perms FROM user_roles WHERE user_id = $1;
Bitmask of permissions common to all roles
See also:bit_orbit_xor

bit_or

PG 8.0+same as input

Computes the bitwise OR of all non-NULL input values.

DeveloperDBASecurity

Signature

bit_or ( expression ) → integer or bigint or bit

Parameters

ParameterTypeDescription
expressioninteger, bigint, or bitValue to bitwise-OR

Examples

sql
SELECT bit_or(permissions) FROM user_roles WHERE user_id = 1;
Union of all permission bitmasks
sql
SELECT bit_or(flags) FROM feature_flags;
All flags set in ANY row
sql
SELECT team_id, bit_or(capability_flags) AS team_capabilities FROM team_members GROUP BY team_id;
Union of all capability flags across each team
sql
SELECT bit_or(event_type_mask) AS seen_event_types FROM audit_log WHERE session_id = $1;
Bitmask of all distinct event types seen in the session
Anti-PatternUsing multiple OR-joined equality checks instead of bitmask union

Writing `WHERE perms = 1 OR perms = 2 OR perms = 4` instead of leveraging bitmask aggregation becomes unmaintainable as the flag set grows and cannot summarize across multiple roles.

✓ Instead: Store capabilities as bitmasks and use `bit_or(permission_mask)` to compute the effective union in a single aggregate scan.

`bit_or` gives the union of permission bitmasks — every bit set in any row. Perfect for computing effective user permissions across multiple roles.

example
SELECT bit_or(permission_mask) AS effective_perms FROM user_roles WHERE user_id = $1;
All permissions the user has via any role

bit_xor

PG13+same as input

Returns the bitwise XOR of all non-null input values. Added in PostgreSQL 13. Useful for change detection and parity checks.

DeveloperDBA

Signature

bit_xor ( expression smallint/integer/bigint/bit ) → same type

Parameters

ParameterTypeDescription
expressionsmallint, integer, bigint, or bitNumeric or bit-string expression to XOR across all rows

Examples

XOR all permission flags for a role

sql
SELECT bit_xor(flags) FROM permissions WHERE role_id = 1;
5

Change detection: any row change flips the fingerprint

sql
SELECT bit_xor(checksum) AS row_fingerprint FROM audit_log WHERE session_id = 42;
1847263

As a window aggregate

sql
SELECT id, bit_xor(value) OVER (PARTITION BY group_id) FROM data;
(rows)

3 XOR 5 XOR 6 = 0 (parity cancels)

sql
SELECT bit_xor(val) FROM (VALUES (3),(5),(6)) t(val);
0

Lightweight batch change detection

sql
SELECT bit_xor(updated_at::bigint) FROM records WHERE batch_id = 7;
(integer)
Anti-PatternUsing XOR fingerprint for ordered-sequence validation

Since XOR is commutative and associative, bit_xor() produces the same result regardless of row order. It cannot detect row reordering or duplicate rows that cancel each other out.

✓ Instead: For order-sensitive checksums use md5(string_agg(col::text, ',' ORDER BY id)) or range checksums.

bit_xor() over a set of row hashes or checksums gives a lightweight fingerprint: if any row changes, the XOR changes. Faster than md5(string_agg()) for large sets, though order-independent (XOR is commutative).

example
SELECT bit_xor(hashtext(col1 || col2)::bigint) AS fingerprint FROM t;
A single integer that changes if any row changes

bool_and

PG 8.0+boolean

Returns true if all non-NULL input values are true, otherwise false.

DeveloperData Eng

Signature

bool_and ( expression boolean ) → boolean

Parameters

ParameterTypeDescription
expressionbooleanBoolean expression to evaluate across rows

Examples

sql
SELECT bool_and(is_verified) FROM users;
false (if any user is unverified)
sql
SELECT order_id, bool_and(shipped) AS all_shipped FROM order_items GROUP BY order_id;
true only if every item in the order has shipped
sql
SELECT bool_and(constraint_check) FROM (SELECT col > 0 AS constraint_check FROM measurements) t;
true if every measurement is positive
sql
SELECT project_id, bool_and(passed) AS all_tests_passed FROM test_runs WHERE suite = 'regression' GROUP BY project_id;
true only when every regression test passed
Anti-PatternUsing COUNT + GROUP BY to check universal conditions instead of bool_and

A common workaround is `HAVING count(*) = count(*) FILTER (WHERE cond)` to verify all rows match a condition. This is hard to read and easy to get wrong.

✓ Instead: Use `bool_and(condition)` for a clear, intent-revealing check that every row satisfies the condition.

`bool_and` is the aggregate equivalent of ALL. Use it to check if every item in a group passes a condition — e.g., all order items shipped, all tests passed.

example
SELECT project_id, bool_and(passed) AS all_tests_passed FROM test_runs GROUP BY project_id;
true only if every test in the project passed

bool_or

PG 8.0+boolean

Returns true if at least one non-NULL input value is true.

DeveloperData Eng

Signature

bool_or ( expression boolean ) → boolean

Parameters

ParameterTypeDescription
expressionbooleanBoolean expression to evaluate across rows

Examples

sql
SELECT bool_or(is_admin) FROM group_members WHERE group_id = 5;
true if any member is an admin
sql
SELECT user_id, bool_or(failed) AS had_failure FROM login_attempts GROUP BY user_id;
true if the user had any failed login
sql
SELECT bool_or(price < 0) AS has_negative_price FROM products;
Data quality check: true if any product has a negative price
sql
SELECT customer_id, bool_or(amount > 1000) AS has_large_order FROM orders GROUP BY customer_id;
true if customer ever placed an order over $1000
Anti-PatternUsing a correlated EXISTS subquery per group instead of bool_or

Writing a correlated `EXISTS (SELECT 1 FROM t WHERE t.group_id = outer.id AND condition)` for every group in a larger query causes repeated subplan executions.

✓ Instead: Use `bool_or(condition)` inside a `GROUP BY` query for a single-pass, cleaner equivalent.

When computing per-group boolean checks in a GROUP BY query, `bool_or(condition)` is cleaner and often faster than a correlated EXISTS subquery for each group.

example
SELECT customer_id, bool_or(amount > 1000) AS has_large_order FROM orders GROUP BY customer_id;
true if customer ever placed a large order

corr

PG 8.0+double precision

Returns the correlation coefficient between two sets of numbers (Pearson's r).

DeveloperData EngDBA

Signature

corr ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable values
Xdouble precisionIndependent variable values

Examples

sql
SELECT corr(revenue, ad_spend) FROM campaigns;
0.87 (strong positive correlation)
sql
SELECT corr(score, study_hours) FROM students;
Correlation between study time and scores
sql
SELECT corr(churn_rate, avg_response_time) FROM monthly_metrics;
How strongly latency correlates with churn
sql
SELECT product_id, corr(daily_sales, avg_temp) AS sales_weather_corr FROM daily_product_weather GROUP BY product_id HAVING count(*) > 30;
Correlation between weather and sales per product (minimum sample filter)
Anti-PatternInterpreting correlation as causation without sufficient sample size

`corr()` returns a value between -1 and 1 but gives no indication of whether the sample is large enough for the correlation to be statistically significant. A corr of 0.95 from 3 data points is essentially noise.

✓ Instead: Always pair `corr()` with a `HAVING count(*) > N` guard (typically N ≥ 30) and consider computing a p-value or confidence interval via external tooling for critical decisions.

Use `corr(y, x)` to measure linear correlation (-1 to 1) without leaving SQL. A value near 1 means strong positive correlation, near -1 strong negative, near 0 no linear relationship.

example
SELECT corr(churn_rate, avg_response_time) AS latency_churn_correlation FROM monthly_metrics;
How strongly latency correlates with churn

count

PG 8.0+bigint

Counts the number of input rows (with *) or the number of input rows where the expression is not NULL.

DeveloperData Eng

Signatures

count ( * ) → bigint
count ( expression ) → bigint

Parameters

ParameterTypeDescription
expressionanyValue to count; NULLs are excluded. Use * to count all rows.

Examples

sql
SELECT count(*) FROM orders;
1042
sql
SELECT count(email) FROM users;
Counts only rows where email is not NULL
sql
SELECT count(DISTINCT status) FROM orders;
Number of unique statuses
sql
SELECT department, count(*) AS headcount FROM employees GROUP BY department ORDER BY headcount DESC;
Row count per department
Anti-PatternConfusing count(*) with count(col) for NULL-counting

A common mistake is using `count(col)` expecting it to count all rows, then being surprised that NULL values are silently skipped. Equally, using `count(*)` to count non-null values in a specific column is incorrect.

✓ Instead: Use `count(*)` for total row count and `count(col)` explicitly when NULL-exclusion is desired. To count NULLs: `count(*) - count(col)`.

`count(*)` counts all rows including NULLs. `count(col)` skips NULLs. Use `count(*) - count(col)` to count NULLs in a column without a CASE expression.

example
SELECT count(*) AS total, count(email) AS with_email, count(*) - count(email) AS missing_email FROM users;
Total rows, non-null emails, and missing email count
See also:sumavgmaxmin

covar_pop

PG 8.0+double precision

Returns the population covariance of (Y, X) pairs. Measures how much two variables change together across the entire population.

Signature

covar_pop ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable values
Xdouble precisionIndependent variable values

Examples

Positive: sales and ad spend move together

sql
SELECT covar_pop(sales, advertising_spend) FROM campaigns;
1234.56

Negative: higher price → fewer returns

sql
SELECT covar_pop(return_rate, price) FROM products;
-0.023

Covariance and correlation together

sql
SELECT covar_pop(y, x), corr(y, x) FROM data;
5.2 | 0.87

Covariance per product category

sql
SELECT category, covar_pop(revenue, cost) FROM financials GROUP BY category;
(rows)

Manual OLS slope (= regr_slope())

sql
SELECT covar_pop(val_y, val_x) / (stddev_pop(val_x) * stddev_pop(val_x)) AS slope FROM data;
0.73
Anti-PatternInterpreting raw covariance magnitude without context

Covariance depends on the scale of both variables — a covariance of 1000 between salary and years_experience says nothing by itself about relationship strength.

✓ Instead: Always pair covariance with corr() to get a scale-free measure, and with regr_slope() to get the actual change per unit.

covar_pop(Y,X) / (stddev_pop(Y) * stddev_pop(X)) equals corr(Y,X). Computing both lets you separate scale (covariance) from direction (correlation) in one query.

example
SELECT covar_pop(y,x), corr(y,x), regr_slope(y,x) FROM data;
Full bivariate summary in one pass

covar_samp

PG 8.0+double precision

Returns the sample covariance of (Y, X) pairs (Bessel-corrected, divides by N-1). Use when data is a sample from a larger population.

Signature

covar_samp ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable values
Xdouble precisionIndependent variable values

Examples

Sample covariance of salary and experience

sql
SELECT covar_samp(salary, years_exp) FROM employee_sample;
8500.3

Sample vs population covariance differ by N/(N-1)

sql
SELECT covar_samp(y, x), covar_pop(y, x) FROM data;
12.5 | 12.0

Dept-level covariance estimate

sql
SELECT dept, covar_samp(performance, tenure) FROM hr_sample GROUP BY dept;
(rows)

Recent ad performance covariance

sql
SELECT covar_samp(clicks, impressions) FROM ad_sample WHERE date >= current_date - 30;
45678.9

Guard against single-row NULL

sql
SELECT count(*), covar_samp(x,y) FROM t HAVING count(*) >= 2;
(row)
Anti-PatternUsing covar_samp without checking sample size

covar_samp() returns NULL when N <= 1. With tiny groups (N=2,3), the estimate is very noisy and should not be used for decisions.

✓ Instead: Always check HAVING COUNT(*) >= 30 (or another minimum) before treating covar_samp as meaningful.

A large covar_samp can occur simply because both variables have large scales. Always check corr() alongside it. corr() is scale-free and ranges from -1 to 1.

example
SELECT covar_samp(revenue, units_sold), corr(revenue, units_sold), count(*) FROM sales_sample;
Full picture: raw covariance, correlation coefficient, and sample size

json_agg

PG 9.3+json

Aggregates values into a JSON array, including NULLs (as JSON null).

DeveloperData Eng

Signature

json_agg ( expression [ORDER BY ...] ) → json

Parameters

ParameterTypeDescription
expressionanyValue to include in the JSON array; NULLs become JSON null

Examples

sql
SELECT json_agg(name ORDER BY name) FROM tags;
["css", "html", "javascript"]
sql
SELECT user_id, json_agg(row_to_json(t)) FROM orders t GROUP BY user_id;
JSON array of order objects per user
sql
SELECT json_agg(json_build_object('id', id, 'name', name) ORDER BY name) FROM products WHERE active;
[{"id":1,"name":"Alpha"}, ...]
sql
SELECT department, json_agg(employee_name) FILTER (WHERE active = true) AS active_staff FROM employees GROUP BY department;
JSON array of active employees per department
Anti-PatternUsing json_agg when JSONB operators will be needed on the result

`json_agg` returns the `json` type, which cannot be directly queried with operators like `@>`, `?`, or `->>` in a performant way as a stored value.

✓ Instead: Use `jsonb_agg` when the aggregated array will be stored in a JSONB column or queried with JSONB containment and key-existence operators.

Combine `json_agg(row_to_json(t))` to build nested JSON directly: `SELECT u.id, json_agg(row_to_json(o)) AS orders FROM users u JOIN orders o ON o.user_id = u.id GROUP BY u.id`. No application-side assembly needed.

example
SELECT author, json_agg(json_build_object('title', title, 'year', year) ORDER BY year) AS books FROM books GROUP BY author;
Nested JSON with author and their books

json_object_agg

PG 9.3+json

Aggregates name/value pairs into a JSON object. Duplicate keys produce duplicate keys in output (last-writer-wins behavior varies by use).

DeveloperData Eng

Signature

json_object_agg ( key "any", value "any" [ORDER BY ...] ) → json

Parameters

ParameterTypeDescription
keyanyObject key (converted to text)
valueanyObject value

Examples

sql
SELECT json_object_agg(key, value) FROM config;
{"host": "localhost", "port": "5432"}
sql
SELECT json_object_agg(month, revenue) FROM monthly_revenue ORDER BY month;
{"1": 5000, "2": 6200, ...}
sql
SELECT json_object_agg(attribute, value) AS metadata FROM product_attrs WHERE product_id = 42;
{"color": "red", "size": "L", "weight": "0.5kg"}
sql
SELECT entity_id, json_object_agg(attr_name, attr_value ORDER BY attr_name) AS props FROM eav_table GROUP BY entity_id;
Alphabetically-keyed JSON property bag per entity
Anti-PatternAssuming duplicate keys are deduplicated in the output object

`json_object_agg` does not enforce unique keys. When source rows contain duplicate key values, the resulting JSON object will have repeated keys, which is technically valid JSON but breaks most parsers and libraries.

✓ Instead: Ensure keys are unique before aggregating, or use `jsonb_object_agg` which overwrites duplicate keys with the last value (deterministic with `ORDER BY`).

Entity-Attribute-Value tables can be pivoted into JSON objects with `json_object_agg(attribute, value)`. This is far more flexible than CASE-based pivoting for dynamic attribute sets.

example
SELECT entity_id, json_object_agg(attribute, value) AS props FROM eav_table GROUP BY entity_id;
Each entity's attributes as a JSON object

jsonb_agg

PG 9.4+jsonb

Aggregates values into a JSONB array. Equivalent to json_agg but returns the binary JSONB type.

DeveloperData Eng

Signature

jsonb_agg ( expression [ORDER BY ...] ) → jsonb

Parameters

ParameterTypeDescription
expressionanyValue to include; NULLs become JSON null

Examples

sql
SELECT jsonb_agg(name ORDER BY name) FROM tags;
["css", "html", "javascript"]
sql
SELECT id, jsonb_agg(detail) FILTER (WHERE detail IS NOT NULL) FROM items GROUP BY id;
JSONB arrays without nulls
sql
SELECT user_id, jsonb_agg(jsonb_build_object('role', role, 'granted_at', granted_at) ORDER BY granted_at) FROM user_roles GROUP BY user_id;
Ordered JSONB array of role objects per user
sql
SELECT project_id, jsonb_agg(DISTINCT tag) AS tags FROM project_tags GROUP BY project_id;
Deduplicated JSONB tag arrays per project
Anti-PatternStoring jsonb_agg results without considering deduplication

Unlike sets, `jsonb_agg` preserves duplicates. If you aggregate from a JOIN that produces duplicate rows (e.g., without DISTINCT), the resulting JSONB array will contain repeated entries silently.

✓ Instead: Use `jsonb_agg(DISTINCT col)` or ensure the source rows are deduplicated before aggregation.

Prefer `jsonb_agg` over `json_agg` when the result will be queried with JSONB operators (`->`, `@>`, `?`). JSONB is stored in a parsed binary format, making containment checks much faster.

example
SELECT user_id, jsonb_agg(tag) AS tags FROM user_tags GROUP BY user_id;
JSONB arrays that support @> and ? operators

jsonb_object_agg

PG 9.5+jsonb

Aggregates name/value pairs into a JSONB object. Like json_object_agg() but returns jsonb (with deduplication of duplicate keys — last writer wins).

Developer

Signature

jsonb_object_agg ( key "any", value "any" ) → jsonb

Parameters

ParameterTypeDescription
keyanyKey for the JSON object; converted to text
valueanyValue; converted to JSON representation

Examples

Collect key-value rows into a JSONB object

sql
SELECT jsonb_object_agg(key, value) FROM settings WHERE user_id = 1;
{"theme": "dark", "lang": "en"}

ORDER BY within aggregation

sql
SELECT jsonb_object_agg(name, score ORDER BY name) FROM scores WHERE user_id = 5;
{"math": 95, "science": 88}

One JSONB object per user

sql
SELECT user_id, jsonb_object_agg(pref_key, pref_val) FROM preferences GROUP BY user_id;
(rows)

Duplicate keys: last value wins in jsonb

sql
SELECT jsonb_object_agg(k, v) FROM (VALUES ('a',1),('a',2)) t(k,v);
{"a": 2}

Build column type map for a table

sql
SELECT jsonb_object_agg(attname, atttypid) FROM pg_attribute WHERE attrelid = 'mytable'::regclass AND attnum > 0;
{"id": 23, "name": 25}
Anti-PatternUsing jsonb_object_agg when row order must be preserved

JSONB objects are unordered by definition (jsonb stores keys sorted). If key insertion order matters, use json_object_agg() which preserves insertion order in its text representation.

✓ Instead: Use json_object_agg(key, value ORDER BY sort_col) when output key order is significant.

The jsonb variant deduplicates keys and stores data in a normalised binary form, making subsequent ->> and @> operators much faster than on json. Choose jsonb_object_agg when the output will be queried.

example
SELECT prefs->>'theme' FROM (SELECT jsonb_object_agg(k,v) AS prefs FROM user_prefs WHERE uid=1) t;
dark

max

PG 8.0+same as input

Returns the maximum value of all non-NULL input values.

DeveloperData Eng

Signature

max ( expression ) → same as input

Parameters

ParameterTypeDescription
expressionany sortable typeValue to find the maximum of

Examples

sql
SELECT max(amount) FROM orders;
9999.99
sql
SELECT max(created_at) FROM events;
Most recent timestamp
sql
SELECT max(name) FROM products;
Lexicographically last name
sql
SELECT user_id, max(score) AS personal_best FROM game_scores GROUP BY user_id;
Each user's highest score
Anti-PatternUsing ORDER BY + LIMIT 1 inside a subquery instead of max()

A frequent anti-pattern is `SELECT * FROM t WHERE val = (SELECT val FROM t ORDER BY val DESC LIMIT 1)` for each group, which is verbose and may not use indexes as effectively as a simple aggregate.

✓ Instead: Use `max()` with `GROUP BY` for per-group maximums. For fetching the full row of the max, use a window function: `DISTINCT ON` or `RANK() OVER (...)`.

`max` works on any sortable type: dates, timestamps, text, UUIDs, enums. Use `max(created_at)` to find the most recent event, or `max(name)` for the last alphabetically.

example
SELECT category, max(price) AS highest_price FROM products GROUP BY category;
Highest price per category

min

PG 8.0+same as input

Returns the minimum value of all non-NULL input values.

DeveloperData Eng

Signature

min ( expression ) → same as input

Parameters

ParameterTypeDescription
expressionany sortable typeValue to find the minimum of

Examples

sql
SELECT min(price) FROM products;
0.99
sql
SELECT min(created_at) FROM orders;
Oldest order timestamp
sql
SELECT category, min(price) AS lowest_price FROM products GROUP BY category;
Lowest price per category
sql
SELECT min(response_ms) AS fastest_response FROM api_logs WHERE endpoint = '/health';
Fastest recorded health-check response
Anti-PatternFetching the full min-value row with a redundant subquery

Using `WHERE price = (SELECT min(price) FROM products)` in a subquery is correct but becomes a repeated scan when done per group. It also returns multiple rows on ties without a tiebreaker.

✓ Instead: For per-group minimum rows, prefer `DISTINCT ON (group_col) ORDER BY group_col, price ASC` or a window function with `RANK()` to handle ties explicitly.

When you need the min or max value for each group, `GROUP BY` with `min()`/`max()` is always faster than a self-join or subquery with `ORDER BY ... LIMIT 1`.

example
SELECT user_id, min(created_at) AS first_order, max(created_at) AS last_order FROM orders GROUP BY user_id;
First and last order dates per user

mode

PG 9.4+same as sort expression

Returns the most frequent value (mode) in the input. Ties are broken by returning the smallest.

DeveloperData Eng

Signature

mode ( ) WITHIN GROUP ( ORDER BY sort_expression ) → same type as sort expression

Parameters

ParameterTypeDescription
sort_expressionsortableValues to find the mode of

Examples

sql
SELECT mode() WITHIN GROUP (ORDER BY status) FROM orders;
completed (most common status)
sql
SELECT mode() WITHIN GROUP (ORDER BY country_code) FROM customers;
US (most common country)
sql
SELECT customer_id, mode() WITHIN GROUP (ORDER BY product_id) AS fav_product FROM orders GROUP BY customer_id;
Most frequently ordered product per customer
sql
SELECT mode() WITHIN GROUP (ORDER BY error_code) AS most_common_error FROM application_logs WHERE level = 'ERROR' AND logged_at > now() - interval '24 hours';
Most frequent error code in the last 24 hours
Anti-PatternAssuming mode() handles ties by frequency rank

When two or more values share the highest frequency, `mode()` breaks ties by returning the smallest value (per the ORDER BY direction), not by any secondary frequency criterion. This can be surprising.

✓ Instead: If tie-breaking matters for your use case, use `GROUP BY value ORDER BY count(*) DESC, value ASC LIMIT 1` in a subquery, which makes the tie-breaking rule explicit.

`mode()` returns the most frequent value in the group directly, without needing a nested subquery or `GROUP BY value ORDER BY count(*) DESC LIMIT 1`.

example
SELECT customer_id, mode() WITHIN GROUP (ORDER BY product_id) AS fav_product FROM orders GROUP BY customer_id;
Most purchased product per customer

percentile_cont

PG 9.4+double precision or interval

Computes a continuous percentile value that interpolates between adjacent values if needed.

DeveloperData EngDBA

Signature

percentile_cont ( fraction ) WITHIN GROUP ( ORDER BY sort_expression ) → double precision or interval

Parameters

ParameterTypeDescription
fractiondouble precisionPercentile as a fraction between 0 and 1 (e.g., 0.5 = median)
sort_expressionsortableValues to compute the percentile over

Examples

sql
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median FROM employees;
62000 (varies)
sql
SELECT percentile_cont(0.95) WITHIN GROUP (ORDER BY response_ms) AS p95 FROM requests;
p95 response time
sql
SELECT percentile_cont(ARRAY[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY value) AS quartiles FROM data;
{Q1, median, Q3}
sql
SELECT department, percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary FROM employees GROUP BY department;
Median salary per department
Anti-PatternUsing avg() as a proxy for median on skewed distributions

Averages are heavily influenced by outliers. In skewed distributions (e.g., income, response times), the average can be much higher than what a typical value looks like, misleading stakeholders.

✓ Instead: Use `percentile_cont(0.5) WITHIN GROUP (ORDER BY col)` for the true median, and report p95/p99 for tail latency rather than average.

Pass an array to `percentile_cont` to compute multiple percentiles in a single scan: `percentile_cont(ARRAY[0.5, 0.95, 0.99]) WITHIN GROUP (ORDER BY x)` returns an array with p50, p95, p99.

example
SELECT percentile_cont(ARRAY[0.5, 0.95, 0.99]) WITHIN GROUP (ORDER BY duration_ms) AS percentiles FROM api_calls;
[median, p95, p99] in one pass

percentile_disc

PG 9.4+same as sort expression

Computes a discrete percentile — returns the first value whose cumulative distribution is >= fraction, always an actual input value.

DeveloperData EngDBA

Signature

percentile_disc ( fraction ) WITHIN GROUP ( ORDER BY sort_expression ) → same type as sort expression

Parameters

ParameterTypeDescription
fractiondouble precisionPercentile as a fraction 0–1
sort_expressionsortableValues to rank

Examples

sql
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY salary) AS median FROM employees;
60000 (actual value in dataset)
sql
SELECT percentile_disc(1.0) WITHIN GROUP (ORDER BY score) AS max_score FROM results;
Maximum score
sql
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY status_code) AS median_status FROM requests;
Actual status code at the 50th percentile
sql
SELECT region, percentile_disc(0.75) WITHIN GROUP (ORDER BY order_amount) AS p75_order FROM orders GROUP BY region;
75th percentile order value per region, always an actual order amount
Anti-PatternUsing percentile_disc expecting an interpolated result for continuous data

`percentile_disc` always returns an actual row value and never interpolates. For continuous numeric data (e.g., response times), this can give a noticeably different result from the true mathematical percentile.

✓ Instead: Use `percentile_cont` for continuous numeric or interval data where interpolation between values is statistically correct.

Use `percentile_disc` when the result must be an actual data value — e.g., finding the median product category rank or the typical text response. `percentile_cont` may interpolate between values, `percentile_disc` always picks a real row.

example
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY status_code) AS median_status FROM requests;
Actual status code at the median

range_agg

PG 14+anymultirange

Computes the union of all non-NULL range values, returning a multirange.

DeveloperData Eng

Signature

range_agg ( expression anyrange ) → anymultirange

Parameters

ParameterTypeDescription
expressionanyrangeRange values to union together

Examples

sql
SELECT range_agg(during) FROM reservations WHERE room_id = 1;
{[2025-01-01,2025-01-05),[2025-01-10,2025-01-15)}
sql
SELECT room_id, range_agg(during) AS occupied FROM reservations GROUP BY room_id;
Occupied periods per room
sql
SELECT range_agg(valid_during) AS total_coverage FROM subscription_periods WHERE user_id = $1;
All subscription periods merged into a multirange
sql
SELECT tsrange('2025-01-01', '2025-02-01') - range_agg(booked_at) AS available_slots FROM bookings WHERE resource_id = 42;
Free time slots in January as a multirange
Anti-PatternManually merging overlapping ranges in application code instead of using range_agg

A common pattern before PG 14 was to fetch all ranges into the application layer and merge overlaps using custom loop logic. This is slow for large datasets and moves work out of the database.

✓ Instead: Use `range_agg` (PG 14+) to compute the union of all ranges server-side in a single aggregate pass, then subtract from the full interval to find gaps.

Use `range_agg` to merge overlapping bookings/reservations into a single multirange, then subtract from the full range to find available slots — all in SQL.

example
SELECT int8range(0, 86400) - range_agg(during) AS free_slots FROM bookings WHERE day = today;
Free time slots as a multirange

regr_avgx

PG 8.0+double precision

Returns the average of the independent variable (X) over non-null (Y, X) pairs. Equivalent to AVG(X) FILTER (WHERE Y IS NOT NULL AND X IS NOT NULL).

Signature

regr_avgx ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable (used only to filter NULL pairs)
Xdouble precisionIndependent variable whose average is returned

Examples

Average ad spend over complete pairs

sql
SELECT regr_avgx(revenue, ad_spend) FROM campaigns;
12500.0

Full bivariate summary

sql
SELECT regr_avgx(y, x), regr_avgy(y, x), regr_count(y, x) FROM data;
5.0 | 8.2 | 100

Mean experience per department (complete cases only)

sql
SELECT dept, regr_avgx(salary, years_exp) FROM employees GROUP BY dept;
(rows)

Manual check of regression line through means

sql
SELECT regr_intercept(y,x) - regr_slope(y,x)*regr_avgx(y,x) AS verify_intercept FROM data;
(close to regr_avgy(y,x))

Average CTR for complete click/CTR pairs

sql
SELECT regr_avgx(clicks, ctr) FROM ads WHERE date = current_date;
0.045
Anti-PatternComputing averages separately and forgetting NULL-pair filtering

AVG(X) alone includes X values where Y is NULL. regr_avgx(Y,X) only averages X where both Y and X are non-null, keeping the regression dataset consistent.

✓ Instead: Use regr_avgx(Y,X) instead of AVG(X) FILTER (WHERE X IS NOT NULL AND Y IS NOT NULL) for cleaner code.

Run regr_slope, regr_intercept, regr_r2, regr_count, regr_avgx, regr_avgy in a single SELECT to get a full OLS regression summary without external tools.

example
SELECT regr_slope(y,x), regr_intercept(y,x), regr_r2(y,x), regr_count(y,x), regr_avgx(y,x), regr_avgy(y,x) FROM data;
Complete linear regression summary in one query

regr_avgy

PG 8.0+double precision

Returns the average of the dependent variable (Y) over non-null (Y, X) pairs. The regression line always passes through (regr_avgx, regr_avgy).

Signature

regr_avgy ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable whose average is returned
Xdouble precisionIndependent variable (used only to filter NULL pairs)

Examples

Average revenue over complete pairs

sql
SELECT regr_avgy(revenue, ad_spend) FROM campaigns;
85000.0

Mean point and slope together

sql
SELECT regr_avgy(y,x), regr_avgx(y,x), regr_slope(y,x) FROM data;
8.2 | 5.0 | 1.35

Mean salary for complete-case pairs

sql
SELECT regr_avgy(salary, years_exp) FROM employees WHERE dept = 3;
72500.0

Regression line passes through mean point

sql
SELECT abs(regr_avgy(y,x) - (regr_slope(y,x)*regr_avgx(y,x) + regr_intercept(y,x))) < 0.001 FROM data;
true

Mean margin per product

sql
SELECT product, regr_avgy(margin, volume) FROM sales GROUP BY product;
(rows)
Anti-PatternForgetting that regr_avgy filters NULL pairs

regr_avgy(Y,X) excludes rows where either Y or X is NULL, so it may differ from AVG(Y). This ensures consistency with the regression dataset.

✓ Instead: If you need the unconditional mean of Y, use AVG(Y). Use regr_avgy(Y,X) only when you want the mean restricted to the complete-case regression dataset.

A fundamental property of OLS: the regression line passes through (regr_avgx, regr_avgy). Use this as a sanity check: regr_avgy ≈ regr_slope*regr_avgx + regr_intercept.

example
SELECT regr_avgy(y,x) AS mean_y, regr_slope(y,x)*regr_avgx(y,x) + regr_intercept(y,x) AS predicted_mean_y FROM data;
Both values should be equal (within floating-point precision)

regr_count

PG 8.0+bigint

Returns the number of non-null (Y, X) input pairs. Useful to check regression sample size before interpreting regr_slope, regr_r2 etc.

Signature

regr_count ( Y double precision, X double precision ) → bigint

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

Complete pairs used in regression

sql
SELECT regr_count(revenue, ad_spend) FROM campaigns;
142

Total rows vs complete-case count

sql
SELECT count(*), regr_count(y, x) FROM data;
200 | 142

Minimum sample size check

sql
SELECT regr_count(y,x) >= 30 AS sufficient_sample FROM data;
true

Filter groups with too few complete pairs

sql
SELECT dept, regr_count(salary, experience), regr_slope(salary, experience) FROM employees GROUP BY dept HAVING regr_count(salary, experience) >= 10;
(rows)

Report N alongside correlation

sql
SELECT regr_count(y,x), corr(y,x) FROM t;
50 | 0.82
Anti-PatternUsing COUNT(*) to check regression sample size

COUNT(*) counts all rows including those with NULL Y or X. The actual regression uses only complete pairs — regr_count() gives the true N.

✓ Instead: Use regr_count(Y,X) for the actual regression sample size instead of COUNT(*) or COUNT(X) separately.

A high regr_r2 with a tiny regr_count is meaningless. Always include regr_count() in regression output to show the sample size supporting the coefficients.

example
SELECT regr_slope(y,x), regr_r2(y,x), regr_count(y,x) AS n FROM data GROUP BY segment;
Slope, R², and N for each segment

regr_intercept

PG 8.0+double precision

Returns the Y-intercept of the least-squares-fit linear equation determined by (X, Y) pairs. The intercept is the predicted Y when X = 0.

Signature

regr_intercept ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

Baseline revenue when ad_spend = 0

sql
SELECT regr_intercept(revenue, ad_spend) FROM campaigns;
15000.0

Full linear model parameters

sql
SELECT regr_slope(y,x) AS slope, regr_intercept(y,x) AS intercept FROM data;
1.35 | 4.2

Predict Y at X=500

sql
SELECT regr_slope(y,x) * 500 + regr_intercept(y,x) AS predicted_at_500 FROM data;
679.2

Base salary per dept (0 years experience)

sql
SELECT dept, regr_intercept(salary, years_exp) FROM employees GROUP BY dept;
(rows)

Verify: intercept = avgy - slope*avgx

sql
SELECT abs(regr_intercept(y,x) - (regr_avgy(y,x) - regr_slope(y,x)*regr_avgx(y,x))) < 0.001 FROM t;
true
Anti-PatternExtrapolating far outside the training range

The linear model from regr_intercept/regr_slope is only valid near the range of the training data. Predicting at X=0 when all training X values are in the thousands may give nonsensical intercepts.

✓ Instead: Centre and scale X before regression, or use the model only for interpolation within the data range.

regr_slope() and regr_intercept() let you do simple OLS prediction directly in SQL. Compute the model once in a CTE, then apply it to new X values in the main query.

example
WITH model AS (SELECT regr_slope(y,x) AS m, regr_intercept(y,x) AS b FROM training) SELECT x, m*x+b AS predicted FROM model, new_data;
Predictions for new X values using the trained model

regr_r2

PG 8.0+double precision

Returns the square of the correlation coefficient (R²) for the linear regression of Y on X. Indicates what fraction of Y variance is explained by X (0=none, 1=perfect).

Signature

regr_r2 ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

76% of revenue variance explained by ad spend

sql
SELECT regr_r2(revenue, ad_spend) FROM campaigns;
0.76

Rounded R² for reporting

sql
SELECT round(regr_r2(y,x)::numeric, 4) AS r_squared FROM data;
0.8312

corr² equals regr_r2 for simple linear regression

sql
SELECT corr(y,x)^2 AS corr_squared, regr_r2(y,x) FROM data;
0.69 | 0.69

Feature importance ranking by R²

sql
SELECT feature, regr_r2(target, feature_val) FROM feature_data GROUP BY feature ORDER BY 2 DESC;
(rows)

Flag models with poor explanatory power

sql
SELECT regr_r2(y,x) < 0.3 AS weak_fit FROM data;
false
Anti-PatternTreating high R² as proof of causation

R² measures linear fit quality, not causation. A high R² can come from confounders, overfitting a small dataset, or pure coincidence.

✓ Instead: Report R² alongside regr_count() (sample size) and corr() (directionality). Always ask whether the relationship makes domain sense.

Run regr_r2(target, feature) for each candidate feature column to quickly rank predictors by univariate explanatory power. A fast feature selection step before more expensive multivariate analysis.

example
SELECT unnest(ARRAY['age','income','score']) AS feat, unnest(ARRAY[regr_r2(y,age), regr_r2(y,income), regr_r2(y,score)]) AS r2 FROM t ORDER BY 2 DESC;
Features ranked by R² with the target variable

regr_slope

PG 8.0+double precision

Returns the slope of the least-squares fit linear equation for the data points (Y, X).

DeveloperData EngDBA

Signature

regr_slope ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_slope(revenue, ad_spend) FROM campaigns;
3.24 (revenue per dollar of ad spend)
sql
SELECT regr_slope(y, x), regr_intercept(y, x) FROM data_points;
Slope and intercept of best-fit line
sql
SELECT regr_slope(revenue, month_num) AS monthly_growth, regr_intercept(revenue, month_num) AS base FROM monthly_revenue;
Linear growth rate and base value
sql
SELECT service, regr_slope(error_count, request_count) AS error_rate_trend FROM daily_service_metrics GROUP BY service HAVING count(*) >= 14;
Error rate trend slope per service over observed days
Anti-PatternApplying linear regression to non-linear relationships

`regr_slope` fits a straight line. If the relationship between Y and X is exponential, logarithmic, or cyclical, the slope will be misleading and forecasts will be inaccurate.

✓ Instead: Always plot or inspect the data first. For exponential growth, apply `ln()` to the Y values before regression. Use `regr_r2` to validate how well the linear model actually fits.

Compute trend line coefficients directly in SQL: `regr_slope(y, x) * new_x + regr_intercept(y, x)` gives a forecasted value. No need for Python/R for simple linear projections.

example
SELECT regr_slope(revenue, month_num) AS monthly_growth, regr_intercept(revenue, month_num) AS base FROM monthly_revenue;
Linear growth rate and base value

regr_sxx

PG 8.0+double precision

Returns SUM((X - avgX)²), the sum of squared deviations of X from its mean. Used internally for regression calculations; also useful for manual OLS verification.

Signature

regr_sxx ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable (used to filter NULL pairs)
Xdouble precisionIndependent variable whose sum of squares is computed

Examples

Sum of squared deviations of X

sql
SELECT regr_sxx(y, x) FROM data;
450.0

Verify: slope = Sxy/Sxx

sql
SELECT regr_sxy(y,x) / regr_sxx(y,x) AS manual_slope, regr_slope(y,x) FROM data;
1.35 | 1.35

Derive sample variance of X

sql
SELECT regr_sxx(y,x), regr_count(y,x), regr_sxx(y,x)/(regr_count(y,x)-1) AS sample_varX FROM data;
450 | 50 | 9.18

X dispersion per department

sql
SELECT dept, regr_sxx(salary, experience) FROM employees GROUP BY dept;
(rows)

Sxx and Syy are different measures

sql
SELECT regr_sxx(y,x) = regr_syy(x,y) FROM symmetric_data;
false
Anti-PatternManually computing SUM((x-avg(x))^2) in application code

Computing sum of squares manually with two database round-trips (first for avg, then for sum) is slower and less numerically stable than regr_sxx() which uses a single-pass online algorithm.

✓ Instead: Use regr_sxx(y,x) directly — single-pass, numerically stable, and eliminates the need for a correlated subquery.

The standard error of the regression slope = sqrt(MSE / regr_sxx(y,x)), where MSE = (regr_syy(y,x) - regr_sxy(y,x)²/regr_sxx(y,x)) / (N-2). Useful for confidence intervals without external tools.

example
SELECT sqrt((regr_syy(y,x) - regr_sxy(y,x)^2/regr_sxx(y,x)) / ((regr_count(y,x)-2) * regr_sxx(y,x))) AS slope_stderr FROM data;
Standard error of the OLS slope estimate

regr_sxy

PG 8.0+double precision

Returns SUM((X - avgX) * (Y - avgY)), the sum of cross-deviations (cross-product). This is the numerator of the OLS slope formula.

Signature

regr_sxy ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

Sum of cross-deviations

sql
SELECT regr_sxy(y, x) FROM data;
607.5

Manual slope = Sxy / Sxx

sql
SELECT regr_sxy(y,x) / regr_sxx(y,x) AS slope_check, regr_slope(y,x) FROM data;
1.35 | 1.35

Manual correlation = Sxy / sqrt(Sxx*Syy)

sql
SELECT regr_sxy(y,x) / sqrt(regr_sxx(y,x) * regr_syy(y,x)) AS corr_check, corr(y,x) FROM data;
0.82 | 0.82

Positive: Y and X move in same direction

sql
SELECT sign(regr_sxy(y,x)) AS direction FROM data;
1

Products with strongest revenue-cost co-movement

sql
SELECT product, regr_sxy(revenue, cost) FROM financials GROUP BY product ORDER BY 2 DESC;
(rows)
Anti-PatternTreating regr_sxy as a direct measure of correlation strength

regr_sxy is not bounded; its magnitude depends on sample size and the scale of both variables. A large regr_sxy does not mean a strong correlation.

✓ Instead: Use corr(Y,X) for a scale-free [-1, 1] measure of linear association strength.

regr_slope = Sxy/Sxx and corr = Sxy/sqrt(Sxx*Syy). If your computed slope or correlation does not match these formulas, there is a data or NULL-handling issue.

example
SELECT regr_sxy(y,x)/regr_sxx(y,x) AS manual_slope, regr_slope(y,x), regr_sxy(y,x)/sqrt(regr_sxx(y,x)*regr_syy(y,x)) AS manual_corr, corr(y,x) FROM data;
Manual and built-in values should match exactly

regr_syy

PG 8.0+double precision

Returns SUM((Y - avgY)²), the sum of squared deviations of Y from its mean. Represents total variance of Y in the regression dataset (used for R² computation).

Signature

regr_syy ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable whose sum of squares is computed
Xdouble precisionIndependent variable (used to filter NULL pairs)

Examples

Total sum of squares for Y

sql
SELECT regr_syy(y, x) FROM data;
891.25

Fraction of Y variance unexplained by X

sql
SELECT 1 - regr_sxy(y,x)^2 / (regr_sxx(y,x)*regr_syy(y,x)) AS sse_fraction FROM data;
0.31

Manual R² matches built-in

sql
SELECT regr_sxy(y,x)^2 / (regr_sxx(y,x)*regr_syy(y,x)) AS r2_check, regr_r2(y,x) FROM data;
0.69 | 0.69

Derive sample variance of Y from Syy

sql
SELECT regr_syy(y,x)/(regr_count(y,x)-1) AS sample_varY, var_samp(y) FROM data;
18.2 | 18.2

Total salary dispersion per dept (complete cases)

sql
SELECT dept, regr_syy(salary, tenure) FROM employees GROUP BY dept;
(rows)
Anti-PatternConfusing regr_syy with total row count variance

regr_syy filters to complete (Y,X) pairs, so it may differ from var_samp(Y)*(N-1) computed over all rows if there are NULLs in X.

✓ Instead: Use regr_syy(Y,X) only within the context of the regression family — it is consistent with regr_count, regr_sxx, etc. Use var_samp(Y)*(COUNT(Y)-1) for unconditional Y variance.

regr_syy = SSTotal, regr_sxy²/regr_sxx = SSRegression (explained), their difference = SSResidual (unexplained). R² = SSRegression/SSTotal.

example
SELECT regr_syy(y,x) AS ss_total, regr_sxy(y,x)^2/regr_sxx(y,x) AS ss_regression, regr_syy(y,x) - regr_sxy(y,x)^2/regr_sxx(y,x) AS ss_residual FROM data;
ANOVA decomposition in a single query

stddev

PG 8.0+double precision or numeric

Returns the sample standard deviation of all non-NULL input values (alias for stddev_samp).

DeveloperData EngDBA

Signature

stddev ( expression ) → double precision or numeric

Parameters

ParameterTypeDescription
expressionnumeric or floatValues to compute standard deviation over

Examples

sql
SELECT stddev(score) FROM test_results;
12.34 (varies)
sql
SELECT avg(score) AS mean, stddev(score) AS sd FROM scores;
Mean and standard deviation
sql
SELECT product_id, avg(price) AS avg_price, stddev(price) AS price_spread FROM price_history GROUP BY product_id;
Price volatility per product
sql
SELECT stddev(latency_ms) AS jitter FROM network_samples WHERE host = 'db-primary' AND sampled_at > now() - interval '1 hour';
Network jitter (latency variability) over the last hour
Anti-PatternUsing stddev on a very small sample and treating it as population stddev

`stddev` (alias for `stddev_samp`) applies Bessel's correction (divides by N-1), which is appropriate for samples but inflates the value for tiny groups. On a group of 2 rows, the result can be misleading.

✓ Instead: Use `stddev_pop` when working with complete populations (all data, not a sample). Always check `count(*) > 30` before drawing statistical conclusions from `stddev`.

Use `avg(x) ± 2 * stddev(x)` to define a normal range. Values outside this range are statistical outliers. Combine with a CTE or subquery to label outliers inline.

example
SELECT id, value, CASE WHEN abs(value - avg_val) > 2 * sd THEN 'outlier' ELSE 'normal' END FROM (SELECT *, avg(value) OVER () AS avg_val, stddev(value) OVER () AS sd FROM measurements) t;
Each row labelled as outlier or normal

stddev_pop

PG 8.0+double precision (float8) or numeric

Returns the population standard deviation of the input values. Use when your data IS the entire population (not a sample). Returns NULL if no non-null inputs.

Developer

Signature

stddev_pop ( expression numeric/real/double precision ) → double precision/numeric

Parameters

ParameterTypeDescription
expressionnumeric, real, or double precisionNumeric values to compute population standard deviation over

Examples

Population stddev when you have all data points

sql
SELECT stddev_pop(score) FROM census_scores;
12.34

Per-class grade spread

sql
SELECT class_id, stddev_pop(grade) FROM grades GROUP BY class_id;
(rows)

Population vs sample stddev differ slightly

sql
SELECT stddev_pop(price), stddev_samp(price) FROM products;
14.2 | 14.3

As a window function

sql
SELECT stddev_pop(val) OVER (PARTITION BY group_id) FROM data;
(rows)

Rounded population stddev for reporting

sql
SELECT round(stddev_pop(salary)::numeric, 2) FROM employees WHERE dept = 1;
5432.10
Anti-PatternUsing stddev_pop() on survey samples

stddev_pop() assumes you have all data points. For sample data estimating a larger group, the population formula underestimates variability.

✓ Instead: Use stddev_samp() (or the alias stddev()) for samples — it uses the Bessel-corrected formula (divides by N-1).

stddev_pop() divides by N (use when data is the entire population). stddev_samp() / stddev() divides by N-1 (use for samples estimating a larger population). For typical survey or sample data, use stddev_samp().

example
SELECT stddev_pop(val) AS pop_std, stddev_samp(val) AS sample_std FROM measurements;
Both values; pop_std is always ≤ sample_std

stddev_samp

PG 8.0+double precision (float8) or numeric

Returns the sample standard deviation (Bessel-corrected, divides by N-1). This is an alias for stddev(). Use when your data is a sample from a larger population.

Developer

Signature

stddev_samp ( expression numeric/real/double precision ) → double precision/numeric

Parameters

ParameterTypeDescription
expressionnumeric, real, or double precisionNumeric values to compute sample standard deviation over

Examples

Variation in sampled response times

sql
SELECT stddev_samp(response_time_ms) FROM request_sample;
45.67

stddev_samp and stddev are identical

sql
SELECT stddev_samp(val), stddev(val) FROM data;
12.3 | 12.3

Revenue variability per region

sql
SELECT region, round(stddev_samp(revenue)::numeric,2) FROM sales GROUP BY region;
(rows)

Rolling 10-row sample stddev

sql
SELECT stddev_samp(val) OVER (ORDER BY ts ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) FROM readings;
(rows)

Sample stddev of 1-5 = sqrt(2.5)

sql
SELECT stddev_samp(x) FROM (VALUES (1::float),(2),(3),(4),(5)) t(x);
1.5811388300841898
Anti-PatternUsing stddev_samp() on a single-row result

stddev_samp() with only one input returns NULL (dividing by N-1 = 0 is undefined). Always check COUNT(*) >= 2 before interpreting results.

✓ Instead: Add FILTER (WHERE val IS NOT NULL) HAVING COUNT(val) >= 2 to guard against NULL results on tiny datasets.

Rows beyond avg ± 2*stddev_samp are statistical outliers. Use this pattern inline without a subquery using window functions.

example
SELECT * FROM t WHERE ABS(val - AVG(val) OVER()) > 2 * STDDEV_SAMP(val) OVER();
Rows more than 2 standard deviations from the mean

string_agg

PG 9.0+text

Concatenates non-NULL string values with a delimiter. Supports ORDER BY to control output order.

DeveloperData Eng

Signature

string_agg ( value text, delimiter text [ORDER BY ...] ) → text

Parameters

ParameterTypeDescription
valuetext or byteaString to concatenate; NULLs are skipped
delimitertext or byteaSeparator between values

Examples

sql
SELECT string_agg(name, ', ' ORDER BY name) FROM tags;
css, html, javascript
sql
SELECT user_id, string_agg(role, ' | ') FROM user_roles GROUP BY user_id;
admin | editor | viewer
sql
SELECT string_agg(email, E'\n') FROM newsletter_subscribers;
Newline-separated email list
sql
SELECT category, string_agg(DISTINCT tag, ', ' ORDER BY tag) FROM items GROUP BY category;
Deduplicated, sorted tag list per category
Anti-PatternOmitting the delimiter argument or using an empty string unintentionally

`string_agg` requires exactly two arguments. Passing an empty string `''` as the delimiter will concatenate values without any separator, which is often a mistake when a comma or space was intended.

✓ Instead: Always explicitly pass the desired delimiter. If you truly want no separator, document it clearly. Use `array_agg` + `array_to_string` if you need to switch delimiters later.

`string_agg` is the cleanest way to produce comma-separated value strings in SQL. Use it in reporting queries where the application expects a single string field with multiple values.

example
SELECT category, string_agg(product_name, ', ' ORDER BY product_name) AS products FROM items GROUP BY category;
"Electronics: laptop, phone, tablet"

sum

PG 8.0+bigint for smallint/int, numeric for bigint, same type for float/numeric

Computes the sum of all non-NULL input values.

DeveloperData Eng

Signature

sum ( expression ) → same as input or numeric

Parameters

ParameterTypeDescription
expressionnumeric typeValue to sum; NULLs are ignored

Examples

sql
SELECT sum(amount) FROM payments;
94823.50

sum of zero rows is NULL, not 0

sql
SELECT sum(amount) FROM payments WHERE amount IS NULL;
NULL
sql
SELECT COALESCE(sum(amount), 0) FROM payments WHERE user_id = 999;
0
sql
SELECT product_id, sum(quantity) AS total_sold FROM order_items GROUP BY product_id ORDER BY total_sold DESC;
Total units sold per product
Anti-PatternAssuming sum() returns 0 for empty groups

`sum()` returns NULL — not 0 — when there are no rows or all values are NULL. This can propagate silently through calculations and produce unexpected NULL results downstream.

✓ Instead: Always wrap with `COALESCE(sum(col), 0)` when a zero value is semantically correct for empty groups.

`sum()` returns NULL when applied to zero rows (or all-NULL rows), not 0. Wrap with `COALESCE(sum(x), 0)` whenever a zero result is meaningful.

example
SELECT user_id, COALESCE(sum(points), 0) AS total_points FROM user_points GROUP BY user_id;
0 for users with no points rows

var_pop

PG 8.0+double precision (float8) or numeric

Returns the population variance of the input values (square of stddev_pop). Use when your data represents the entire population.

Developer

Signature

var_pop ( expression numeric/real/double precision ) → double precision/numeric

Parameters

ParameterTypeDescription
expressionnumeric, real, or double precisionNumeric values to compute population variance over

Examples

Population variance when you have all data

sql
SELECT var_pop(score) FROM complete_census;
152.25

Equivalent to stddev_pop()

sql
SELECT sqrt(var_pop(price)) AS pop_stddev FROM products;
14.2

Pop vs sample variance comparison

sql
SELECT var_pop(val), var_samp(val) FROM measurements;
100.0 | 102.0

Sales variance per product

sql
SELECT product_id, var_pop(daily_sales) FROM sales GROUP BY product_id;
(rows)

Partition-level variance as window function

sql
SELECT var_pop(x) OVER (PARTITION BY group_id) FROM data;
(rows)
Anti-PatternComputing variance manually with AVG

Manually computing AVG(x - mean)^2 requires two passes over the data (or a subquery). var_pop() does it in a single pass using Welford's online algorithm internally.

✓ Instead: Use var_pop(col) directly — it is more accurate (avoids catastrophic cancellation for large values) and requires only one scan.

Variance is additive (variances of independent subsets can be combined mathematically), while standard deviation is not. For multi-level rollups, work in variance space and take sqrt() at the end.

example
SELECT category, sqrt(SUM(var_pop(price) * COUNT(*)) / SUM(COUNT(*))) AS combined_stddev FROM products GROUP BY ROLLUP(category);
Properly combined stddev across rollup levels

var_samp

PG 8.0+double precision (float8) or numeric

Returns the sample variance (Bessel-corrected, divides by N-1). Alias for variance(). Use when your data is a sample estimating a larger population.

Developer

Signature

var_samp ( expression numeric/real/double precision ) → double precision/numeric

Parameters

ParameterTypeDescription
expressionnumeric, real, or double precisionNumeric values to compute sample variance over

Examples

Variance of latency sample

sql
SELECT var_samp(latency_ms) FROM perf_sample;
2345.6

var_samp and variance are identical

sql
SELECT var_samp(val), variance(val) FROM t;
25.0 | 25.0

Most variable regions first

sql
SELECT region, round(var_samp(revenue)::numeric, 2) FROM sales GROUP BY region ORDER BY 2 DESC;
(rows)

Rolling 5-row sample variance

sql
SELECT var_samp(score) OVER (ORDER BY exam_date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) FROM results;
(rows)

Classify sensor stability

sql
SELECT CASE WHEN var_samp(val) < 10 THEN 'stable' ELSE 'variable' END FROM sensors WHERE sensor_id = 3;
stable
Anti-PatternUsing var_samp on complete population data

If your table contains every member of the population (not a sample), var_samp inflates variance by dividing by N-1 instead of N.

✓ Instead: Use var_pop() when you have the full population — it gives the exact variance, not an unbiased estimate.

var_samp() returns NULL for a single row (N-1=0). Always pair with COUNT() >= 2 when the group might have only one member to avoid silent NULLs in reports.

example
SELECT g, var_samp(val) FROM t GROUP BY g HAVING COUNT(*) >= 2;
Only groups with enough data for a meaningful sample variance

variance

PG 8.0+double precision or numeric

Returns the sample variance of all non-NULL input values (alias for var_samp).

DeveloperData EngDBA

Signature

variance ( expression ) → double precision or numeric

Parameters

ParameterTypeDescription
expressionnumeric or floatValues to compute variance over

Examples

sql
SELECT variance(score) FROM test_results;
152.3 (varies)
sql
SELECT variance(response_time) AS variability FROM requests;
Variance of response times
sql
SELECT sensor_id, avg(reading) AS mean_val, variance(reading) AS instability FROM sensor_data GROUP BY sensor_id;
Sensor stability report
sql
SELECT sqrt(variance(x) + variance(y)) AS combined_stddev FROM measurements;
Combined standard deviation of two independent variables
Anti-PatternComparing variance values across different units or scales

Variance is scale-dependent: doubling measurement units quadruples the variance. Comparing `variance(price_usd)` to `variance(quantity)` is meaningless because the units differ.

✓ Instead: Standardize values (z-score normalize) before comparing spread across different columns, or compare coefficients of variation (`stddev / avg`) instead.

`variance(x)` = `stddev(x)^2`. Use variance when you need to add contributions from multiple independent sources (variances add, standard deviations don't). For human-readable spread, use stddev.

example
SELECT sqrt(variance(x) + variance(y)) AS combined_stddev FROM measurements;
Combined standard deviation from two independent variables

xml_agg

PG 8.4+xml

Concatenates XML values into an XML sequence. All non-null inputs are concatenated in aggregate order. Returns NULL if no non-null values exist.

DeveloperDBA

Signature

xml_agg ( xml ) → xml

Parameters

ParameterTypeDescription
expressionxmlXML value to aggregate; NULLs are ignored

Examples

Build XML fragment from rows

sql
SELECT xml_agg(xmlelement(name item, id, name)) FROM products WHERE active;
<item>1Widget</item><item>2Gadget</item>

Wrap aggregated items in root element

sql
SELECT xmlelement(name catalog, xml_agg(xmlelement(name product, xmlattributes(id AS id), name))) FROM products;
<catalog><product id="1">Widget</product>...</catalog>

ORDER BY within aggregation

sql
SELECT xml_agg(xmlelement(name row, val) ORDER BY val) FROM t;
<row>a</row><row>b</row>

One XML fragment per department

sql
SELECT department, xml_agg(xmlelement(name emp, name)) FROM employees GROUP BY department;
(rows)

Aggregate pre-built XML values

sql
SELECT xml_agg(data) FROM (SELECT '<?xml version="1.0"?><x>1</x>'::xml AS data) t;
<?xml version="1.0"?><x>1</x>
Anti-PatternConcatenating XML with string_agg()

Using string_agg() to concatenate XML-formatted strings bypasses XML validation and escaping, producing invalid XML if any value contains <, >, or & characters.

✓ Instead: Use xml_agg(xmlelement(name tag, col)) to let PostgreSQL handle proper XML escaping automatically.

Pair xml_agg() with xmlelement() and xmlattributes() to build well-formed XML documents directly in SQL, avoiding round-trips to application code for simple XML serialisation.

example
SELECT xmlelement(name orders, xml_agg(xmlelement(name order, xmlattributes(id AS id), total))) FROM orders WHERE status = 'shipped';
<orders><order id="1">99.99</order>...</orders>