∑

PostgreSQL Aggregate Functions

Complete reference for PostgreSQL aggregate functions covering count, sum, average, statistical aggregates, ordered-set aggregates, and the FILTER clause for conditional aggregation. Every function includes syntax, GROUP BY examples, and anti-patterns. Updated for PostgreSQL 16.

39 functions

What are PostgreSQL Aggregate Functions?

PostgreSQL aggregate functions compute a single result from multiple input rows and are used with GROUP BY to summarise data. PostgreSQL also supports ordered-set aggregates (percentile_cont, mode), hypothetical-set aggregates (rank, dense_rank within a group), and the FILTER clause for conditional aggregation without CASE. Common aggregate functions include count(), sum(), avg(), max(), min(), string_agg(), and array_agg().

array_agg

PG 8.4+→ anyarray

Collects all non-NULL input values into an array. Supports ORDER BY to control element order.

DeveloperData Eng

Signature

array_agg ( expression [ORDER BY ...] ) → array

Parameters

Parameter	Type	Description
expression	any	Value to collect; NULLs are excluded

Examples

sql

SELECT array_agg(name ORDER BY name) FROM tags;

→{css,html,javascript}

sql

SELECT user_id, array_agg(tag) FROM user_tags GROUP BY user_id;

→Array of tags per user

sql

SELECT array_agg(DISTINCT status) FROM orders;

→{cancelled,completed,pending}

sql

SELECT author_id, array_agg(title ORDER BY published_at DESC) AS recent_books FROM books GROUP BY author_id;

→Books per author sorted newest-first

⚠Anti-Pattern— Using array_agg without ORDER BY and relying on element order

Without an explicit `ORDER BY`, `array_agg` produces elements in an undefined, non-deterministic order that can change between query executions, PostgreSQL versions, or after autovacuum.

✓ Instead: Always specify `array_agg(col ORDER BY col)` or an appropriate sort key when the order of array elements matters downstream.

Use `array_agg` to fetch parent + children in one query: `SELECT u.id, array_agg(t.name) AS tags FROM users u JOIN user_tags t ON t.user_id = u.id GROUP BY u.id`. This avoids N+1 queries.

example

SELECT author_id, array_agg(title ORDER BY published_at) AS books FROM books GROUP BY author_id;

→All book titles per author in chronological order

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Arrays are 1-indexed in PostgreSQL, not 0-indexed

PostgreSQL arrays start at index 1 by default. array[0] returns NULL, not the first element — silently wrong.

avg

PG 8.0+→ numeric for integer/numeric input; double precision for float

Computes the arithmetic mean (average) of all non-NULL input values.

DeveloperData EngDBA

Signature

avg ( expression ) → numeric or double precision

Parameters

Parameter	Type	Description
expression	numeric or interval	Value to average; NULLs are ignored

Examples

sql

SELECT avg(score) FROM results;

→73.4

sql

SELECT avg(score)::numeric(5,2) FROM results;

→73.40

sql

SELECT avg(response_time) FROM requests WHERE endpoint = '/api/v1/search';

→Average response time

sql

SELECT department, avg(salary)::numeric(10,2) AS avg_salary FROM employees GROUP BY department;

→Average salary per department rounded to 2 decimal places

⚠Anti-Pattern— Averaging integer columns and losing decimal precision

When `avg()` is applied to integer columns, PostgreSQL returns a `numeric` type, but many developers cast back to integer with `::int`, silently truncating fractional values (e.g., an average of 3.7 becomes 3).

✓ Instead: Cast to `numeric(10,2)` or `float8` rather than `int` to preserve meaningful decimal precision: `avg(score)::numeric(10,2)`.

Instead of multiple subqueries, use `FILTER (WHERE ...)` to compute multiple averages in a single pass: `avg(score) FILTER (WHERE grade = 'A')` alongside `avg(score) FILTER (WHERE grade = 'B')`.

example

SELECT avg(score) FILTER (WHERE subject = 'math') AS math_avg, avg(score) FILTER (WHERE subject = 'science') AS sci_avg FROM grades;

→Two averages in one query pass

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

bit_and

PG 8.0+→ same as input

Computes the bitwise AND of all non-NULL input values.

DeveloperDBASecurity

Signature

bit_and ( expression ) → integer or bigint or bit

Parameters

Parameter	Type	Description
expression	integer, bigint, or bit	Value to bitwise-AND

Examples

sql

SELECT bit_and(permissions) FROM user_roles WHERE user_id = 1;

→Intersection of all permission bitmasks

sql

SELECT bit_and(flags) FROM feature_flags;

→Flags set in ALL rows

sql

SELECT role_group, bit_and(permission_mask) AS shared_perms FROM role_assignments GROUP BY role_group;

→Permissions shared by every role in each group

sql

SELECT bit_and(status_flags) & 4 AS all_have_flag_3 FROM sessions WHERE started_at > now() - interval '1 hour';

→Non-zero if every recent session has flag bit 2 set

⚠Anti-Pattern— Checking individual bits with equality instead of bitmask operations

Testing `WHERE permissions = 7` to check if a user has all three permission bits set conflates exact equality with bit intersection, breaking when other bits are also set.

✓ Instead: Use `bit_and(permissions) & required_mask = required_mask` to verify that specific bits are set across all rows.

`bit_and` gives the intersection of permission bitmasks — bits that are set in every row. Use it to find the minimum common permissions for a user with multiple roles.

example

SELECT bit_and(permission_mask) AS common_perms FROM user_roles WHERE user_id = $1;

→Bitmask of permissions common to all roles

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

bit_or

PG 8.0+→ same as input

Computes the bitwise OR of all non-NULL input values.

DeveloperDBASecurity

Signature

bit_or ( expression ) → integer or bigint or bit

Parameters

Parameter	Type	Description
expression	integer, bigint, or bit	Value to bitwise-OR

Examples

sql

SELECT bit_or(permissions) FROM user_roles WHERE user_id = 1;

→Union of all permission bitmasks

sql

SELECT bit_or(flags) FROM feature_flags;

→All flags set in ANY row

sql

SELECT team_id, bit_or(capability_flags) AS team_capabilities FROM team_members GROUP BY team_id;

→Union of all capability flags across each team

sql

SELECT bit_or(event_type_mask) AS seen_event_types FROM audit_log WHERE session_id = $1;

→Bitmask of all distinct event types seen in the session

⚠Anti-Pattern— Using multiple OR-joined equality checks instead of bitmask union

Writing `WHERE perms = 1 OR perms = 2 OR perms = 4` instead of leveraging bitmask aggregation becomes unmaintainable as the flag set grows and cannot summarize across multiple roles.

✓ Instead: Store capabilities as bitmasks and use `bit_or(permission_mask)` to compute the effective union in a single aggregate scan.

`bit_or` gives the union of permission bitmasks — every bit set in any row. Perfect for computing effective user permissions across multiple roles.

example

SELECT bit_or(permission_mask) AS effective_perms FROM user_roles WHERE user_id = $1;

→All permissions the user has via any role

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

bit_xor

PG13+→ same as input

Returns the bitwise XOR of all non-null input values. Added in PostgreSQL 13. Useful for change detection and parity checks.

DeveloperDBA

Signature

bit_xor ( expression smallint/integer/bigint/bit ) → same type

Parameters

Parameter	Type	Description
expression	smallint, integer, bigint, or bit	Numeric or bit-string expression to XOR across all rows

Examples

XOR all permission flags for a role

sql

SELECT bit_xor(flags) FROM permissions WHERE role_id = 1;

→5

Change detection: any row change flips the fingerprint

sql

SELECT bit_xor(checksum) AS row_fingerprint FROM audit_log WHERE session_id = 42;

→1847263

As a window aggregate

sql

SELECT id, bit_xor(value) OVER (PARTITION BY group_id) FROM data;

→(rows)

3 XOR 5 XOR 6 = 0 (parity cancels)

sql

SELECT bit_xor(val) FROM (VALUES (3),(5),(6)) t(val);

→0

Lightweight batch change detection

sql

SELECT bit_xor(updated_at::bigint) FROM records WHERE batch_id = 7;

→(integer)

⚠Anti-Pattern— Using XOR fingerprint for ordered-sequence validation

Since XOR is commutative and associative, bit_xor() produces the same result regardless of row order. It cannot detect row reordering or duplicate rows that cancel each other out.

✓ Instead: For order-sensitive checksums use md5(string_agg(col::text, ',' ORDER BY id)) or range checksums.

bit_xor() over a set of row hashes or checksums gives a lightweight fingerprint: if any row changes, the XOR changes. Faster than md5(string_agg()) for large sets, though order-independent (XOR is commutative).

example

SELECT bit_xor(hashtext(col1 || col2)::bigint) AS fingerprint FROM t;

→A single integer that changes if any row changes

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

bool_and

PG 8.0+→ boolean

Returns true if all non-NULL input values are true, otherwise false.

DeveloperData Eng

Signature

bool_and ( expression boolean ) → boolean

Parameters

Parameter	Type	Description
expression	boolean	Boolean expression to evaluate across rows

Examples

sql

SELECT bool_and(is_verified) FROM users;

→false (if any user is unverified)

sql

SELECT order_id, bool_and(shipped) AS all_shipped FROM order_items GROUP BY order_id;

→true only if every item in the order has shipped

sql

SELECT bool_and(constraint_check) FROM (SELECT col > 0 AS constraint_check FROM measurements) t;

→true if every measurement is positive

sql

SELECT project_id, bool_and(passed) AS all_tests_passed FROM test_runs WHERE suite = 'regression' GROUP BY project_id;

→true only when every regression test passed

⚠Anti-Pattern— Using COUNT + GROUP BY to check universal conditions instead of bool_and

A common workaround is `HAVING count(*) = count(*) FILTER (WHERE cond)` to verify all rows match a condition. This is hard to read and easy to get wrong.

✓ Instead: Use `bool_and(condition)` for a clear, intent-revealing check that every row satisfies the condition.

`bool_and` is the aggregate equivalent of ALL. Use it to check if every item in a group passes a condition — e.g., all order items shipped, all tests passed.

example

SELECT project_id, bool_and(passed) AS all_tests_passed FROM test_runs GROUP BY project_id;

→true only if every test in the project passed

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

bool_or

PG 8.0+→ boolean

Returns true if at least one non-NULL input value is true.

DeveloperData Eng

Signature

bool_or ( expression boolean ) → boolean

Parameters

Parameter	Type	Description
expression	boolean	Boolean expression to evaluate across rows

Examples

sql

SELECT bool_or(is_admin) FROM group_members WHERE group_id = 5;

→true if any member is an admin

sql

SELECT user_id, bool_or(failed) AS had_failure FROM login_attempts GROUP BY user_id;

→true if the user had any failed login

sql

SELECT bool_or(price < 0) AS has_negative_price FROM products;

→Data quality check: true if any product has a negative price

sql

SELECT customer_id, bool_or(amount > 1000) AS has_large_order FROM orders GROUP BY customer_id;

→true if customer ever placed an order over $1000

⚠Anti-Pattern— Using a correlated EXISTS subquery per group instead of bool_or

Writing a correlated `EXISTS (SELECT 1 FROM t WHERE t.group_id = outer.id AND condition)` for every group in a larger query causes repeated subplan executions.

✓ Instead: Use `bool_or(condition)` inside a `GROUP BY` query for a single-pass, cleaner equivalent.

When computing per-group boolean checks in a GROUP BY query, `bool_or(condition)` is cleaner and often faster than a correlated EXISTS subquery for each group.

example

SELECT customer_id, bool_or(amount > 1000) AS has_large_order FROM orders GROUP BY customer_id;

→true if customer ever placed a large order

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

corr

PG 8.0+→ double precision

Returns the correlation coefficient between two sets of numbers (Pearson's r).

DeveloperData EngDBA

Signature

corr ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable values
X	double precision	Independent variable values

Examples

sql

SELECT corr(revenue, ad_spend) FROM campaigns;

→0.87 (strong positive correlation)

sql

SELECT corr(score, study_hours) FROM students;

→Correlation between study time and scores

sql

SELECT corr(churn_rate, avg_response_time) FROM monthly_metrics;

→How strongly latency correlates with churn

sql

SELECT product_id, corr(daily_sales, avg_temp) AS sales_weather_corr FROM daily_product_weather GROUP BY product_id HAVING count(*) > 30;

→Correlation between weather and sales per product (minimum sample filter)

⚠Anti-Pattern— Interpreting correlation as causation without sufficient sample size

`corr()` returns a value between -1 and 1 but gives no indication of whether the sample is large enough for the correlation to be statistically significant. A corr of 0.95 from 3 data points is essentially noise.

✓ Instead: Always pair `corr()` with a `HAVING count(*) > N` guard (typically N ≥ 30) and consider computing a p-value or confidence interval via external tooling for critical decisions.

Use `corr(y, x)` to measure linear correlation (-1 to 1) without leaving SQL. A value near 1 means strong positive correlation, near -1 strong negative, near 0 no linear relationship.

example

SELECT corr(churn_rate, avg_response_time) AS latency_churn_correlation FROM monthly_metrics;

→How strongly latency correlates with churn

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

count

PG 8.0+→ bigint

Counts the number of input rows (with *) or the number of input rows where the expression is not NULL.

DeveloperData Eng

Signatures

count ( * ) → bigint

count ( expression ) → bigint

Parameters

Parameter	Type	Description
expression	any	Value to count; NULLs are excluded. Use * to count all rows.

Examples

sql

SELECT count(*) FROM orders;

→1042

sql

SELECT count(email) FROM users;

→Counts only rows where email is not NULL

sql

SELECT count(DISTINCT status) FROM orders;

→Number of unique statuses

sql

SELECT department, count(*) AS headcount FROM employees GROUP BY department ORDER BY headcount DESC;

→Row count per department

⚠Anti-Pattern— Confusing count(*) with count(col) for NULL-counting

A common mistake is using `count(col)` expecting it to count all rows, then being surprised that NULL values are silently skipped. Equally, using `count(*)` to count non-null values in a specific column is incorrect.

✓ Instead: Use `count(*)` for total row count and `count(col)` explicitly when NULL-exclusion is desired. To count NULLs: `count(*) - count(col)`.

`count(*)` counts all rows including NULLs. `count(col)` skips NULLs. Use `count(*) - count(col)` to count NULLs in a column without a CASE expression.

example

SELECT count(*) AS total, count(email) AS with_email, count(*) - count(email) AS missing_email FROM users;

→Total rows, non-null emails, and missing email count

Common Gotchas

⚠

NULL kills comparisons silently — nothing equals NULL, not even NULL

NULL = NULL is not TRUE. It is NULL. Queries filtering for NULL with = will silently return zero rows.

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

covar_pop

PG 8.0+→ double precision

Returns the population covariance of (Y, X) pairs. Measures how much two variables change together across the entire population.

Signature

covar_pop ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable values
X	double precision	Independent variable values

Examples

Positive: sales and ad spend move together

sql

SELECT covar_pop(sales, advertising_spend) FROM campaigns;

→1234.56

Negative: higher price → fewer returns

sql

SELECT covar_pop(return_rate, price) FROM products;

→-0.023

Covariance and correlation together

sql

SELECT covar_pop(y, x), corr(y, x) FROM data;

→5.2 | 0.87

Covariance per product category

sql

SELECT category, covar_pop(revenue, cost) FROM financials GROUP BY category;

→(rows)

Manual OLS slope (= regr_slope())

sql

SELECT covar_pop(val_y, val_x) / (stddev_pop(val_x) * stddev_pop(val_x)) AS slope FROM data;

→0.73

⚠Anti-Pattern— Interpreting raw covariance magnitude without context

Covariance depends on the scale of both variables — a covariance of 1000 between salary and years_experience says nothing by itself about relationship strength.

✓ Instead: Always pair covariance with corr() to get a scale-free measure, and with regr_slope() to get the actual change per unit.

covar_pop(Y,X) / (stddev_pop(Y) * stddev_pop(X)) equals corr(Y,X). Computing both lets you separate scale (covariance) from direction (correlation) in one query.

example

SELECT covar_pop(y,x), corr(y,x), regr_slope(y,x) FROM data;

→Full bivariate summary in one pass

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

covar_samp

PG 8.0+→ double precision

Returns the sample covariance of (Y, X) pairs (Bessel-corrected, divides by N-1). Use when data is a sample from a larger population.

Signature

covar_samp ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable values
X	double precision	Independent variable values

Examples

Sample covariance of salary and experience

sql

SELECT covar_samp(salary, years_exp) FROM employee_sample;

→8500.3

Sample vs population covariance differ by N/(N-1)

sql

SELECT covar_samp(y, x), covar_pop(y, x) FROM data;

→12.5 | 12.0

Dept-level covariance estimate

sql

SELECT dept, covar_samp(performance, tenure) FROM hr_sample GROUP BY dept;

→(rows)

Recent ad performance covariance

sql

SELECT covar_samp(clicks, impressions) FROM ad_sample WHERE date >= current_date - 30;

→45678.9

Guard against single-row NULL

sql

SELECT count(*), covar_samp(x,y) FROM t HAVING count(*) >= 2;

→(row)

⚠Anti-Pattern— Using covar_samp without checking sample size

covar_samp() returns NULL when N <= 1. With tiny groups (N=2,3), the estimate is very noisy and should not be used for decisions.

✓ Instead: Always check HAVING COUNT(*) >= 30 (or another minimum) before treating covar_samp as meaningful.

A large covar_samp can occur simply because both variables have large scales. Always check corr() alongside it. corr() is scale-free and ranges from -1 to 1.

example

SELECT covar_samp(revenue, units_sold), corr(revenue, units_sold), count(*) FROM sales_sample;

→Full picture: raw covariance, correlation coefficient, and sample size

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

json_agg

PG 9.3+→ json

Aggregates values into a JSON array, including NULLs (as JSON null).

DeveloperData Eng

Signature

json_agg ( expression [ORDER BY ...] ) → json

Parameters

Parameter	Type	Description
expression	any	Value to include in the JSON array; NULLs become JSON null

Examples

sql

SELECT json_agg(name ORDER BY name) FROM tags;

→["css", "html", "javascript"]

sql

SELECT user_id, json_agg(row_to_json(t)) FROM orders t GROUP BY user_id;

→JSON array of order objects per user

sql

SELECT json_agg(json_build_object('id', id, 'name', name) ORDER BY name) FROM products WHERE active;

→[{"id":1,"name":"Alpha"}, ...]

sql

SELECT department, json_agg(employee_name) FILTER (WHERE active = true) AS active_staff FROM employees GROUP BY department;

→JSON array of active employees per department

⚠Anti-Pattern— Using json_agg when JSONB operators will be needed on the result

`json_agg` returns the `json` type, which cannot be directly queried with operators like `@>`, `?`, or `->>` in a performant way as a stored value.

✓ Instead: Use `jsonb_agg` when the aggregated array will be stored in a JSONB column or queried with JSONB containment and key-existence operators.

Combine `json_agg(row_to_json(t))` to build nested JSON directly: `SELECT u.id, json_agg(row_to_json(o)) AS orders FROM users u JOIN orders o ON o.user_id = u.id GROUP BY u.id`. No application-side assembly needed.

example

SELECT author, json_agg(json_build_object('title', title, 'year', year) ORDER BY year) AS books FROM books GROUP BY author;

→Nested JSON with author and their books

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Arrays are 1-indexed in PostgreSQL, not 0-indexed

PostgreSQL arrays start at index 1 by default. array[0] returns NULL, not the first element — silently wrong.

json_object_agg

PG 9.3+→ json

Aggregates name/value pairs into a JSON object. Duplicate keys produce duplicate keys in output (last-writer-wins behavior varies by use).

DeveloperData Eng

Signature

json_object_agg ( key "any", value "any" [ORDER BY ...] ) → json

Parameters

Parameter	Type	Description
key	any	Object key (converted to text)
value	any	Object value

Examples

sql

SELECT json_object_agg(key, value) FROM config;

→{"host": "localhost", "port": "5432"}

sql

SELECT json_object_agg(month, revenue) FROM monthly_revenue ORDER BY month;

→{"1": 5000, "2": 6200, ...}

sql

SELECT json_object_agg(attribute, value) AS metadata FROM product_attrs WHERE product_id = 42;

→{"color": "red", "size": "L", "weight": "0.5kg"}

sql

SELECT entity_id, json_object_agg(attr_name, attr_value ORDER BY attr_name) AS props FROM eav_table GROUP BY entity_id;

→Alphabetically-keyed JSON property bag per entity

⚠Anti-Pattern— Assuming duplicate keys are deduplicated in the output object

`json_object_agg` does not enforce unique keys. When source rows contain duplicate key values, the resulting JSON object will have repeated keys, which is technically valid JSON but breaks most parsers and libraries.

✓ Instead: Ensure keys are unique before aggregating, or use `jsonb_object_agg` which overwrites duplicate keys with the last value (deterministic with `ORDER BY`).

Entity-Attribute-Value tables can be pivoted into JSON objects with `json_object_agg(attribute, value)`. This is far more flexible than CASE-based pivoting for dynamic attribute sets.

example

SELECT entity_id, json_object_agg(attribute, value) AS props FROM eav_table GROUP BY entity_id;

→Each entity's attributes as a JSON object

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

jsonb_agg

PG 9.4+→ jsonb

Aggregates values into a JSONB array. Equivalent to json_agg but returns the binary JSONB type.

DeveloperData Eng

Signature

jsonb_agg ( expression [ORDER BY ...] ) → jsonb

Parameters

Parameter	Type	Description
expression	any	Value to include; NULLs become JSON null

Examples

sql

SELECT jsonb_agg(name ORDER BY name) FROM tags;

→["css", "html", "javascript"]

sql

SELECT id, jsonb_agg(detail) FILTER (WHERE detail IS NOT NULL) FROM items GROUP BY id;

→JSONB arrays without nulls

sql

SELECT user_id, jsonb_agg(jsonb_build_object('role', role, 'granted_at', granted_at) ORDER BY granted_at) FROM user_roles GROUP BY user_id;

→Ordered JSONB array of role objects per user

sql

SELECT project_id, jsonb_agg(DISTINCT tag) AS tags FROM project_tags GROUP BY project_id;

→Deduplicated JSONB tag arrays per project

⚠Anti-Pattern— Storing jsonb_agg results without considering deduplication

Unlike sets, `jsonb_agg` preserves duplicates. If you aggregate from a JOIN that produces duplicate rows (e.g., without DISTINCT), the resulting JSONB array will contain repeated entries silently.

✓ Instead: Use `jsonb_agg(DISTINCT col)` or ensure the source rows are deduplicated before aggregation.

Prefer `jsonb_agg` over `json_agg` when the result will be queried with JSONB operators (`->`, `@>`, `?`). JSONB is stored in a parsed binary format, making containment checks much faster.

example

SELECT user_id, jsonb_agg(tag) AS tags FROM user_tags GROUP BY user_id;

→JSONB arrays that support @> and ? operators

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Arrays are 1-indexed in PostgreSQL, not 0-indexed

PostgreSQL arrays start at index 1 by default. array[0] returns NULL, not the first element — silently wrong.

jsonb_object_agg

PG 9.5+→ jsonb

Aggregates name/value pairs into a JSONB object. Like json_object_agg() but returns jsonb (with deduplication of duplicate keys — last writer wins).

Developer

Signature

jsonb_object_agg ( key "any", value "any" ) → jsonb

Parameters

Parameter	Type	Description
key	any	Key for the JSON object; converted to text
value	any	Value; converted to JSON representation

Examples

Collect key-value rows into a JSONB object

sql

SELECT jsonb_object_agg(key, value) FROM settings WHERE user_id = 1;

→{"theme": "dark", "lang": "en"}

ORDER BY within aggregation

sql

SELECT jsonb_object_agg(name, score ORDER BY name) FROM scores WHERE user_id = 5;

→{"math": 95, "science": 88}

One JSONB object per user

sql

SELECT user_id, jsonb_object_agg(pref_key, pref_val) FROM preferences GROUP BY user_id;

→(rows)

Duplicate keys: last value wins in jsonb

sql

SELECT jsonb_object_agg(k, v) FROM (VALUES ('a',1),('a',2)) t(k,v);

→{"a": 2}

Build column type map for a table

sql

SELECT jsonb_object_agg(attname, atttypid) FROM pg_attribute WHERE attrelid = 'mytable'::regclass AND attnum > 0;

→{"id": 23, "name": 25}

⚠Anti-Pattern— Using jsonb_object_agg when row order must be preserved

JSONB objects are unordered by definition (jsonb stores keys sorted). If key insertion order matters, use json_object_agg() which preserves insertion order in its text representation.

✓ Instead: Use json_object_agg(key, value ORDER BY sort_col) when output key order is significant.

The jsonb variant deduplicates keys and stores data in a normalised binary form, making subsequent ->> and @> operators much faster than on json. Choose jsonb_object_agg when the output will be queried.

example

SELECT prefs->>'theme' FROM (SELECT jsonb_object_agg(k,v) AS prefs FROM user_prefs WHERE uid=1) t;

→dark

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

max

PG 8.0+→ same as input

Returns the maximum value of all non-NULL input values.

DeveloperData Eng

Signature

max ( expression ) → same as input

Parameters

Parameter	Type	Description
expression	any sortable type	Value to find the maximum of

Examples

sql

SELECT max(amount) FROM orders;

→9999.99

sql

SELECT max(created_at) FROM events;

→Most recent timestamp

sql

SELECT max(name) FROM products;

→Lexicographically last name

sql

SELECT user_id, max(score) AS personal_best FROM game_scores GROUP BY user_id;

→Each user's highest score

⚠Anti-Pattern— Using ORDER BY + LIMIT 1 inside a subquery instead of max()

A frequent anti-pattern is `SELECT * FROM t WHERE val = (SELECT val FROM t ORDER BY val DESC LIMIT 1)` for each group, which is verbose and may not use indexes as effectively as a simple aggregate.

✓ Instead: Use `max()` with `GROUP BY` for per-group maximums. For fetching the full row of the max, use a window function: `DISTINCT ON` or `RANK() OVER (...)`.

`max` works on any sortable type: dates, timestamps, text, UUIDs, enums. Use `max(created_at)` to find the most recent event, or `max(name)` for the last alphabetically.

example

SELECT category, max(price) AS highest_price FROM products GROUP BY category;

→Highest price per category

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

min

PG 8.0+→ same as input

Returns the minimum value of all non-NULL input values.

DeveloperData Eng

Signature

min ( expression ) → same as input

Parameters

Parameter	Type	Description
expression	any sortable type	Value to find the minimum of

Examples

sql

SELECT min(price) FROM products;

→0.99

sql

SELECT min(created_at) FROM orders;

→Oldest order timestamp

sql

SELECT category, min(price) AS lowest_price FROM products GROUP BY category;

→Lowest price per category

sql

SELECT min(response_ms) AS fastest_response FROM api_logs WHERE endpoint = '/health';

→Fastest recorded health-check response

⚠Anti-Pattern— Fetching the full min-value row with a redundant subquery

Using `WHERE price = (SELECT min(price) FROM products)` in a subquery is correct but becomes a repeated scan when done per group. It also returns multiple rows on ties without a tiebreaker.

✓ Instead: For per-group minimum rows, prefer `DISTINCT ON (group_col) ORDER BY group_col, price ASC` or a window function with `RANK()` to handle ties explicitly.

When you need the min or max value for each group, `GROUP BY` with `min()`/`max()` is always faster than a self-join or subquery with `ORDER BY ... LIMIT 1`.

example

SELECT user_id, min(created_at) AS first_order, max(created_at) AS last_order FROM orders GROUP BY user_id;

→First and last order dates per user

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

mode

PG 9.4+→ same as sort expression

Returns the most frequent value (mode) in the input. Ties are broken by returning the smallest.

DeveloperData Eng

Signature

mode ( ) WITHIN GROUP ( ORDER BY sort_expression ) → same type as sort expression

Parameters

Parameter	Type	Description
sort_expression	sortable	Values to find the mode of

Examples

sql

SELECT mode() WITHIN GROUP (ORDER BY status) FROM orders;

→completed (most common status)

sql

SELECT mode() WITHIN GROUP (ORDER BY country_code) FROM customers;

→US (most common country)

sql

SELECT customer_id, mode() WITHIN GROUP (ORDER BY product_id) AS fav_product FROM orders GROUP BY customer_id;

→Most frequently ordered product per customer

sql

SELECT mode() WITHIN GROUP (ORDER BY error_code) AS most_common_error FROM application_logs WHERE level = 'ERROR' AND logged_at > now() - interval '24 hours';

→Most frequent error code in the last 24 hours

⚠Anti-Pattern— Assuming mode() handles ties by frequency rank

When two or more values share the highest frequency, `mode()` breaks ties by returning the smallest value (per the ORDER BY direction), not by any secondary frequency criterion. This can be surprising.

✓ Instead: If tie-breaking matters for your use case, use `GROUP BY value ORDER BY count(*) DESC, value ASC LIMIT 1` in a subquery, which makes the tie-breaking rule explicit.

`mode()` returns the most frequent value in the group directly, without needing a nested subquery or `GROUP BY value ORDER BY count(*) DESC LIMIT 1`.

example

SELECT customer_id, mode() WITHIN GROUP (ORDER BY product_id) AS fav_product FROM orders GROUP BY customer_id;

→Most purchased product per customer

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

percentile_cont

PG 9.4+→ double precision or interval

Computes a continuous percentile value that interpolates between adjacent values if needed.

DeveloperData EngDBA

Signature

percentile_cont ( fraction ) WITHIN GROUP ( ORDER BY sort_expression ) → double precision or interval

Parameters

Parameter	Type	Description
fraction	double precision	Percentile as a fraction between 0 and 1 (e.g., 0.5 = median)
sort_expression	sortable	Values to compute the percentile over

Examples

sql

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median FROM employees;

→62000 (varies)

sql

SELECT percentile_cont(0.95) WITHIN GROUP (ORDER BY response_ms) AS p95 FROM requests;

→p95 response time

sql

SELECT percentile_cont(ARRAY[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY value) AS quartiles FROM data;

→{Q1, median, Q3}

sql

SELECT department, percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary FROM employees GROUP BY department;

→Median salary per department

⚠Anti-Pattern— Using avg() as a proxy for median on skewed distributions

Averages are heavily influenced by outliers. In skewed distributions (e.g., income, response times), the average can be much higher than what a typical value looks like, misleading stakeholders.

✓ Instead: Use `percentile_cont(0.5) WITHIN GROUP (ORDER BY col)` for the true median, and report p95/p99 for tail latency rather than average.

Pass an array to `percentile_cont` to compute multiple percentiles in a single scan: `percentile_cont(ARRAY[0.5, 0.95, 0.99]) WITHIN GROUP (ORDER BY x)` returns an array with p50, p95, p99.

example

SELECT percentile_cont(ARRAY[0.5, 0.95, 0.99]) WITHIN GROUP (ORDER BY duration_ms) AS percentiles FROM api_calls;

→[median, p95, p99] in one pass

See also:percentile_disc stddev avg min max

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

percentile_disc

PG 9.4+→ same as sort expression

Computes a discrete percentile — returns the first value whose cumulative distribution is >= fraction, always an actual input value.

DeveloperData EngDBA

Signature

percentile_disc ( fraction ) WITHIN GROUP ( ORDER BY sort_expression ) → same type as sort expression

Parameters

Parameter	Type	Description
fraction	double precision	Percentile as a fraction 0–1
sort_expression	sortable	Values to rank

Examples

sql

SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY salary) AS median FROM employees;

→60000 (actual value in dataset)

sql

SELECT percentile_disc(1.0) WITHIN GROUP (ORDER BY score) AS max_score FROM results;

→Maximum score

sql

SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY status_code) AS median_status FROM requests;

→Actual status code at the 50th percentile

sql

SELECT region, percentile_disc(0.75) WITHIN GROUP (ORDER BY order_amount) AS p75_order FROM orders GROUP BY region;

→75th percentile order value per region, always an actual order amount

⚠Anti-Pattern— Using percentile_disc expecting an interpolated result for continuous data

`percentile_disc` always returns an actual row value and never interpolates. For continuous numeric data (e.g., response times), this can give a noticeably different result from the true mathematical percentile.

✓ Instead: Use `percentile_cont` for continuous numeric or interval data where interpolation between values is statistically correct.

Use `percentile_disc` when the result must be an actual data value — e.g., finding the median product category rank or the typical text response. `percentile_cont` may interpolate between values, `percentile_disc` always picks a real row.

example

SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY status_code) AS median_status FROM requests;

→Actual status code at the median

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

range_agg

PG 14+→ anymultirange

Computes the union of all non-NULL range values, returning a multirange.

DeveloperData Eng

Signature

range_agg ( expression anyrange ) → anymultirange

Parameters

Parameter	Type	Description
expression	anyrange	Range values to union together

Examples

sql

SELECT range_agg(during) FROM reservations WHERE room_id = 1;

→{[2025-01-01,2025-01-05),[2025-01-10,2025-01-15)}

sql

SELECT room_id, range_agg(during) AS occupied FROM reservations GROUP BY room_id;

→Occupied periods per room

sql

SELECT range_agg(valid_during) AS total_coverage FROM subscription_periods WHERE user_id = $1;

→All subscription periods merged into a multirange

sql

SELECT tsrange('2025-01-01', '2025-02-01') - range_agg(booked_at) AS available_slots FROM bookings WHERE resource_id = 42;

→Free time slots in January as a multirange

⚠Anti-Pattern— Manually merging overlapping ranges in application code instead of using range_agg

A common pattern before PG 14 was to fetch all ranges into the application layer and merge overlaps using custom loop logic. This is slow for large datasets and moves work out of the database.

✓ Instead: Use `range_agg` (PG 14+) to compute the union of all ranges server-side in a single aggregate pass, then subtract from the full interval to find gaps.

Use `range_agg` to merge overlapping bookings/reservations into a single multirange, then subtract from the full range to find available slots — all in SQL.

example

SELECT int8range(0, 86400) - range_agg(during) AS free_slots FROM bookings WHERE day = today;

→Free time slots as a multirange

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

regr_avgx

PG 8.0+→ double precision

Returns the average of the independent variable (X) over non-null (Y, X) pairs. Equivalent to AVG(X) FILTER (WHERE Y IS NOT NULL AND X IS NOT NULL).

Signature

regr_avgx ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable (used only to filter NULL pairs)
X	double precision	Independent variable whose average is returned

Examples

Average ad spend over complete pairs

sql

SELECT regr_avgx(revenue, ad_spend) FROM campaigns;

→12500.0

Full bivariate summary

sql

SELECT regr_avgx(y, x), regr_avgy(y, x), regr_count(y, x) FROM data;

→5.0 | 8.2 | 100

Mean experience per department (complete cases only)

sql

SELECT dept, regr_avgx(salary, years_exp) FROM employees GROUP BY dept;

→(rows)

Manual check of regression line through means

sql

SELECT regr_intercept(y,x) - regr_slope(y,x)*regr_avgx(y,x) AS verify_intercept FROM data;

→(close to regr_avgy(y,x))

Average CTR for complete click/CTR pairs

sql

SELECT regr_avgx(clicks, ctr) FROM ads WHERE date = current_date;

→0.045

⚠Anti-Pattern— Computing averages separately and forgetting NULL-pair filtering

AVG(X) alone includes X values where Y is NULL. regr_avgx(Y,X) only averages X where both Y and X are non-null, keeping the regression dataset consistent.

✓ Instead: Use regr_avgx(Y,X) instead of AVG(X) FILTER (WHERE X IS NOT NULL AND Y IS NOT NULL) for cleaner code.

Run regr_slope, regr_intercept, regr_r2, regr_count, regr_avgx, regr_avgy in a single SELECT to get a full OLS regression summary without external tools.

example

SELECT regr_slope(y,x), regr_intercept(y,x), regr_r2(y,x), regr_count(y,x), regr_avgx(y,x), regr_avgy(y,x) FROM data;

→Complete linear regression summary in one query

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_avgy

PG 8.0+→ double precision

Returns the average of the dependent variable (Y) over non-null (Y, X) pairs. The regression line always passes through (regr_avgx, regr_avgy).

Signature

regr_avgy ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable whose average is returned
X	double precision	Independent variable (used only to filter NULL pairs)

Examples

Average revenue over complete pairs

sql

SELECT regr_avgy(revenue, ad_spend) FROM campaigns;

→85000.0

Mean point and slope together

sql

SELECT regr_avgy(y,x), regr_avgx(y,x), regr_slope(y,x) FROM data;

→8.2 | 5.0 | 1.35

Mean salary for complete-case pairs

sql

SELECT regr_avgy(salary, years_exp) FROM employees WHERE dept = 3;

→72500.0

Regression line passes through mean point

sql

SELECT abs(regr_avgy(y,x) - (regr_slope(y,x)*regr_avgx(y,x) + regr_intercept(y,x))) < 0.001 FROM data;

→true

Mean margin per product

sql

SELECT product, regr_avgy(margin, volume) FROM sales GROUP BY product;

→(rows)

⚠Anti-Pattern— Forgetting that regr_avgy filters NULL pairs

regr_avgy(Y,X) excludes rows where either Y or X is NULL, so it may differ from AVG(Y). This ensures consistency with the regression dataset.

✓ Instead: If you need the unconditional mean of Y, use AVG(Y). Use regr_avgy(Y,X) only when you want the mean restricted to the complete-case regression dataset.

A fundamental property of OLS: the regression line passes through (regr_avgx, regr_avgy). Use this as a sanity check: regr_avgy ≈ regr_slope*regr_avgx + regr_intercept.

example

SELECT regr_avgy(y,x) AS mean_y, regr_slope(y,x)*regr_avgx(y,x) + regr_intercept(y,x) AS predicted_mean_y FROM data;

→Both values should be equal (within floating-point precision)

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_count

PG 8.0+→ bigint

Returns the number of non-null (Y, X) input pairs. Useful to check regression sample size before interpreting regr_slope, regr_r2 etc.

Signature

regr_count ( Y double precision, X double precision ) → bigint

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable
X	double precision	Independent variable

Examples

Complete pairs used in regression

sql

SELECT regr_count(revenue, ad_spend) FROM campaigns;

→142

Total rows vs complete-case count

sql

SELECT count(*), regr_count(y, x) FROM data;

→200 | 142

Minimum sample size check

sql

SELECT regr_count(y,x) >= 30 AS sufficient_sample FROM data;

→true

Filter groups with too few complete pairs

sql

SELECT dept, regr_count(salary, experience), regr_slope(salary, experience) FROM employees GROUP BY dept HAVING regr_count(salary, experience) >= 10;

→(rows)

Report N alongside correlation

sql

SELECT regr_count(y,x), corr(y,x) FROM t;

→50 | 0.82

⚠Anti-Pattern— Using COUNT(*) to check regression sample size

COUNT(*) counts all rows including those with NULL Y or X. The actual regression uses only complete pairs — regr_count() gives the true N.

✓ Instead: Use regr_count(Y,X) for the actual regression sample size instead of COUNT(*) or COUNT(X) separately.

A high regr_r2 with a tiny regr_count is meaningless. Always include regr_count() in regression output to show the sample size supporting the coefficients.

example

SELECT regr_slope(y,x), regr_r2(y,x), regr_count(y,x) AS n FROM data GROUP BY segment;

→Slope, R², and N for each segment

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_intercept

PG 8.0+→ double precision

Returns the Y-intercept of the least-squares-fit linear equation determined by (X, Y) pairs. The intercept is the predicted Y when X = 0.

Signature

regr_intercept ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable
X	double precision	Independent variable

Examples

Baseline revenue when ad_spend = 0

sql

SELECT regr_intercept(revenue, ad_spend) FROM campaigns;

→15000.0

Full linear model parameters

sql

SELECT regr_slope(y,x) AS slope, regr_intercept(y,x) AS intercept FROM data;

→1.35 | 4.2

Predict Y at X=500

sql

SELECT regr_slope(y,x) * 500 + regr_intercept(y,x) AS predicted_at_500 FROM data;

→679.2

Base salary per dept (0 years experience)

sql

SELECT dept, regr_intercept(salary, years_exp) FROM employees GROUP BY dept;

→(rows)

Verify: intercept = avgy - slope*avgx

sql

SELECT abs(regr_intercept(y,x) - (regr_avgy(y,x) - regr_slope(y,x)*regr_avgx(y,x))) < 0.001 FROM t;

→true

⚠Anti-Pattern— Extrapolating far outside the training range

The linear model from regr_intercept/regr_slope is only valid near the range of the training data. Predicting at X=0 when all training X values are in the thousands may give nonsensical intercepts.

✓ Instead: Centre and scale X before regression, or use the model only for interpolation within the data range.

regr_slope() and regr_intercept() let you do simple OLS prediction directly in SQL. Compute the model once in a CTE, then apply it to new X values in the main query.

example

WITH model AS (SELECT regr_slope(y,x) AS m, regr_intercept(y,x) AS b FROM training) SELECT x, m*x+b AS predicted FROM model, new_data;

→Predictions for new X values using the trained model

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_r2

PG 8.0+→ double precision

Returns the square of the correlation coefficient (R²) for the linear regression of Y on X. Indicates what fraction of Y variance is explained by X (0=none, 1=perfect).

Signature

regr_r2 ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable
X	double precision	Independent variable

Examples

76% of revenue variance explained by ad spend

sql

SELECT regr_r2(revenue, ad_spend) FROM campaigns;

→0.76

Rounded R² for reporting

sql

SELECT round(regr_r2(y,x)::numeric, 4) AS r_squared FROM data;

→0.8312

corr² equals regr_r2 for simple linear regression

sql

SELECT corr(y,x)^2 AS corr_squared, regr_r2(y,x) FROM data;

→0.69 | 0.69

Feature importance ranking by R²

sql

SELECT feature, regr_r2(target, feature_val) FROM feature_data GROUP BY feature ORDER BY 2 DESC;

→(rows)

Flag models with poor explanatory power

sql

SELECT regr_r2(y,x) < 0.3 AS weak_fit FROM data;

→false

⚠Anti-Pattern— Treating high R² as proof of causation

R² measures linear fit quality, not causation. A high R² can come from confounders, overfitting a small dataset, or pure coincidence.

✓ Instead: Report R² alongside regr_count() (sample size) and corr() (directionality). Always ask whether the relationship makes domain sense.

Run regr_r2(target, feature) for each candidate feature column to quickly rank predictors by univariate explanatory power. A fast feature selection step before more expensive multivariate analysis.

example

SELECT unnest(ARRAY['age','income','score']) AS feat, unnest(ARRAY[regr_r2(y,age), regr_r2(y,income), regr_r2(y,score)]) AS r2 FROM t ORDER BY 2 DESC;

→Features ranked by R² with the target variable

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_slope

PG 8.0+→ double precision

Returns the slope of the least-squares fit linear equation for the data points (Y, X).

DeveloperData EngDBA

Signature

regr_slope ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable
X	double precision	Independent variable

Examples

sql

SELECT regr_slope(revenue, ad_spend) FROM campaigns;

→3.24 (revenue per dollar of ad spend)

sql

SELECT regr_slope(y, x), regr_intercept(y, x) FROM data_points;

→Slope and intercept of best-fit line

sql

SELECT regr_slope(revenue, month_num) AS monthly_growth, regr_intercept(revenue, month_num) AS base FROM monthly_revenue;

→Linear growth rate and base value

sql

SELECT service, regr_slope(error_count, request_count) AS error_rate_trend FROM daily_service_metrics GROUP BY service HAVING count(*) >= 14;

→Error rate trend slope per service over observed days

⚠Anti-Pattern— Applying linear regression to non-linear relationships

`regr_slope` fits a straight line. If the relationship between Y and X is exponential, logarithmic, or cyclical, the slope will be misleading and forecasts will be inaccurate.

✓ Instead: Always plot or inspect the data first. For exponential growth, apply `ln()` to the Y values before regression. Use `regr_r2` to validate how well the linear model actually fits.

Compute trend line coefficients directly in SQL: `regr_slope(y, x) * new_x + regr_intercept(y, x)` gives a forecasted value. No need for Python/R for simple linear projections.

example

SELECT regr_slope(revenue, month_num) AS monthly_growth, regr_intercept(revenue, month_num) AS base FROM monthly_revenue;

→Linear growth rate and base value

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_sxx

PG 8.0+→ double precision

Returns SUM((X - avgX)²), the sum of squared deviations of X from its mean. Used internally for regression calculations; also useful for manual OLS verification.

Signature

regr_sxx ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable (used to filter NULL pairs)
X	double precision	Independent variable whose sum of squares is computed

Examples

Sum of squared deviations of X

sql

SELECT regr_sxx(y, x) FROM data;

→450.0

Verify: slope = Sxy/Sxx

sql

SELECT regr_sxy(y,x) / regr_sxx(y,x) AS manual_slope, regr_slope(y,x) FROM data;

→1.35 | 1.35

Derive sample variance of X

sql

SELECT regr_sxx(y,x), regr_count(y,x), regr_sxx(y,x)/(regr_count(y,x)-1) AS sample_varX FROM data;

→450 | 50 | 9.18

X dispersion per department

sql

SELECT dept, regr_sxx(salary, experience) FROM employees GROUP BY dept;

→(rows)

Sxx and Syy are different measures

sql

SELECT regr_sxx(y,x) = regr_syy(x,y) FROM symmetric_data;

→false

⚠Anti-Pattern— Manually computing SUM((x-avg(x))^2) in application code

Computing sum of squares manually with two database round-trips (first for avg, then for sum) is slower and less numerically stable than regr_sxx() which uses a single-pass online algorithm.

✓ Instead: Use regr_sxx(y,x) directly — single-pass, numerically stable, and eliminates the need for a correlated subquery.

The standard error of the regression slope = sqrt(MSE / regr_sxx(y,x)), where MSE = (regr_syy(y,x) - regr_sxy(y,x)²/regr_sxx(y,x)) / (N-2). Useful for confidence intervals without external tools.

example

SELECT sqrt((regr_syy(y,x) - regr_sxy(y,x)^2/regr_sxx(y,x)) / ((regr_count(y,x)-2) * regr_sxx(y,x))) AS slope_stderr FROM data;

→Standard error of the OLS slope estimate

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_sxy

PG 8.0+→ double precision

Returns SUM((X - avgX) * (Y - avgY)), the sum of cross-deviations (cross-product). This is the numerator of the OLS slope formula.

Signature

regr_sxy ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable
X	double precision	Independent variable

Examples

Sum of cross-deviations

sql

SELECT regr_sxy(y, x) FROM data;

→607.5

Manual slope = Sxy / Sxx

sql

SELECT regr_sxy(y,x) / regr_sxx(y,x) AS slope_check, regr_slope(y,x) FROM data;

→1.35 | 1.35

Manual correlation = Sxy / sqrt(Sxx*Syy)

sql

SELECT regr_sxy(y,x) / sqrt(regr_sxx(y,x) * regr_syy(y,x)) AS corr_check, corr(y,x) FROM data;

→0.82 | 0.82

Positive: Y and X move in same direction

sql

SELECT sign(regr_sxy(y,x)) AS direction FROM data;

→1

Products with strongest revenue-cost co-movement

sql

SELECT product, regr_sxy(revenue, cost) FROM financials GROUP BY product ORDER BY 2 DESC;

→(rows)

⚠Anti-Pattern— Treating regr_sxy as a direct measure of correlation strength

regr_sxy is not bounded; its magnitude depends on sample size and the scale of both variables. A large regr_sxy does not mean a strong correlation.

✓ Instead: Use corr(Y,X) for a scale-free [-1, 1] measure of linear association strength.

regr_slope = Sxy/Sxx and corr = Sxy/sqrt(Sxx*Syy). If your computed slope or correlation does not match these formulas, there is a data or NULL-handling issue.

example

SELECT regr_sxy(y,x)/regr_sxx(y,x) AS manual_slope, regr_slope(y,x), regr_sxy(y,x)/sqrt(regr_sxx(y,x)*regr_syy(y,x)) AS manual_corr, corr(y,x) FROM data;

→Manual and built-in values should match exactly

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

regr_syy

PG 8.0+→ double precision

Returns SUM((Y - avgY)²), the sum of squared deviations of Y from its mean. Represents total variance of Y in the regression dataset (used for R² computation).

Signature

regr_syy ( Y double precision, X double precision ) → double precision

Parameters

Parameter	Type	Description
Y	double precision	Dependent variable whose sum of squares is computed
X	double precision	Independent variable (used to filter NULL pairs)

Examples

Total sum of squares for Y

sql

SELECT regr_syy(y, x) FROM data;

→891.25

Fraction of Y variance unexplained by X

sql

SELECT 1 - regr_sxy(y,x)^2 / (regr_sxx(y,x)*regr_syy(y,x)) AS sse_fraction FROM data;

→0.31

Manual R² matches built-in

sql

SELECT regr_sxy(y,x)^2 / (regr_sxx(y,x)*regr_syy(y,x)) AS r2_check, regr_r2(y,x) FROM data;

→0.69 | 0.69

Derive sample variance of Y from Syy

sql

SELECT regr_syy(y,x)/(regr_count(y,x)-1) AS sample_varY, var_samp(y) FROM data;

→18.2 | 18.2

Total salary dispersion per dept (complete cases)

sql

SELECT dept, regr_syy(salary, tenure) FROM employees GROUP BY dept;

→(rows)

⚠Anti-Pattern— Confusing regr_syy with total row count variance

regr_syy filters to complete (Y,X) pairs, so it may differ from var_samp(Y)*(N-1) computed over all rows if there are NULLs in X.

✓ Instead: Use regr_syy(Y,X) only within the context of the regression family — it is consistent with regr_count, regr_sxx, etc. Use var_samp(Y)*(COUNT(Y)-1) for unconditional Y variance.

regr_syy = SSTotal, regr_sxy²/regr_sxx = SSRegression (explained), their difference = SSResidual (unexplained). R² = SSRegression/SSTotal.

example

SELECT regr_syy(y,x) AS ss_total, regr_sxy(y,x)^2/regr_sxx(y,x) AS ss_regression, regr_syy(y,x) - regr_sxy(y,x)^2/regr_sxx(y,x) AS ss_residual FROM data;

→ANOVA decomposition in a single query

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

stddev

PG 8.0+→ double precision or numeric

Returns the sample standard deviation of all non-NULL input values (alias for stddev_samp).

DeveloperData EngDBA

Signature

stddev ( expression ) → double precision or numeric

Parameters

Parameter	Type	Description
expression	numeric or float	Values to compute standard deviation over

Examples

sql

SELECT stddev(score) FROM test_results;

→12.34 (varies)

sql

SELECT avg(score) AS mean, stddev(score) AS sd FROM scores;

→Mean and standard deviation

sql

SELECT product_id, avg(price) AS avg_price, stddev(price) AS price_spread FROM price_history GROUP BY product_id;

→Price volatility per product

sql

SELECT stddev(latency_ms) AS jitter FROM network_samples WHERE host = 'db-primary' AND sampled_at > now() - interval '1 hour';

→Network jitter (latency variability) over the last hour

⚠Anti-Pattern— Using stddev on a very small sample and treating it as population stddev

`stddev` (alias for `stddev_samp`) applies Bessel's correction (divides by N-1), which is appropriate for samples but inflates the value for tiny groups. On a group of 2 rows, the result can be misleading.

✓ Instead: Use `stddev_pop` when working with complete populations (all data, not a sample). Always check `count(*) > 30` before drawing statistical conclusions from `stddev`.

Use `avg(x) ± 2 * stddev(x)` to define a normal range. Values outside this range are statistical outliers. Combine with a CTE or subquery to label outliers inline.

example

SELECT id, value, CASE WHEN abs(value - avg_val) > 2 * sd THEN 'outlier' ELSE 'normal' END FROM (SELECT *, avg(value) OVER () AS avg_val, stddev(value) OVER () AS sd FROM measurements) t;

→Each row labelled as outlier or normal

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

stddev_pop

PG 8.0+→ double precision (float8) or numeric

Returns the population standard deviation of the input values. Use when your data IS the entire population (not a sample). Returns NULL if no non-null inputs.

Developer

Signature

stddev_pop ( expression numeric/real/double precision ) → double precision/numeric

Parameters

Parameter	Type	Description
expression	numeric, real, or double precision	Numeric values to compute population standard deviation over

Examples

Population stddev when you have all data points

sql

SELECT stddev_pop(score) FROM census_scores;

→12.34

Per-class grade spread

sql

SELECT class_id, stddev_pop(grade) FROM grades GROUP BY class_id;

→(rows)

Population vs sample stddev differ slightly

sql

SELECT stddev_pop(price), stddev_samp(price) FROM products;

→14.2 | 14.3

As a window function

sql

SELECT stddev_pop(val) OVER (PARTITION BY group_id) FROM data;

→(rows)

Rounded population stddev for reporting

sql

SELECT round(stddev_pop(salary)::numeric, 2) FROM employees WHERE dept = 1;

→5432.10

⚠Anti-Pattern— Using stddev_pop() on survey samples

stddev_pop() assumes you have all data points. For sample data estimating a larger group, the population formula underestimates variability.

✓ Instead: Use stddev_samp() (or the alias stddev()) for samples — it uses the Bessel-corrected formula (divides by N-1).

stddev_pop() divides by N (use when data is the entire population). stddev_samp() / stddev() divides by N-1 (use for samples estimating a larger population). For typical survey or sample data, use stddev_samp().

example

SELECT stddev_pop(val) AS pop_std, stddev_samp(val) AS sample_std FROM measurements;

→Both values; pop_std is always ≤ sample_std

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

stddev_samp

PG 8.0+→ double precision (float8) or numeric

Returns the sample standard deviation (Bessel-corrected, divides by N-1). This is an alias for stddev(). Use when your data is a sample from a larger population.

Developer

Signature

stddev_samp ( expression numeric/real/double precision ) → double precision/numeric

Parameters

Parameter	Type	Description
expression	numeric, real, or double precision	Numeric values to compute sample standard deviation over

Examples

Variation in sampled response times

sql

SELECT stddev_samp(response_time_ms) FROM request_sample;

→45.67

stddev_samp and stddev are identical

sql

SELECT stddev_samp(val), stddev(val) FROM data;

→12.3 | 12.3

Revenue variability per region

sql

SELECT region, round(stddev_samp(revenue)::numeric,2) FROM sales GROUP BY region;

→(rows)

Rolling 10-row sample stddev

sql

SELECT stddev_samp(val) OVER (ORDER BY ts ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) FROM readings;

→(rows)

Sample stddev of 1-5 = sqrt(2.5)

sql

SELECT stddev_samp(x) FROM (VALUES (1::float),(2),(3),(4),(5)) t(x);

→1.5811388300841898

⚠Anti-Pattern— Using stddev_samp() on a single-row result

stddev_samp() with only one input returns NULL (dividing by N-1 = 0 is undefined). Always check COUNT(*) >= 2 before interpreting results.

✓ Instead: Add FILTER (WHERE val IS NOT NULL) HAVING COUNT(val) >= 2 to guard against NULL results on tiny datasets.

Rows beyond avg ± 2*stddev_samp are statistical outliers. Use this pattern inline without a subquery using window functions.

example

SELECT * FROM t WHERE ABS(val - AVG(val) OVER()) > 2 * STDDEV_SAMP(val) OVER();

→Rows more than 2 standard deviations from the mean

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

string_agg

PG 9.0+→ text

Concatenates non-NULL string values with a delimiter. Supports ORDER BY to control output order.

DeveloperData Eng

Signature

string_agg ( value text, delimiter text [ORDER BY ...] ) → text

Parameters

Parameter	Type	Description
value	text or bytea	String to concatenate; NULLs are skipped
delimiter	text or bytea	Separator between values

Examples

sql

SELECT string_agg(name, ', ' ORDER BY name) FROM tags;

→css, html, javascript

sql

SELECT user_id, string_agg(role, ' | ') FROM user_roles GROUP BY user_id;

→admin | editor | viewer

sql

SELECT string_agg(email, E'\n') FROM newsletter_subscribers;

→Newline-separated email list

sql

SELECT category, string_agg(DISTINCT tag, ', ' ORDER BY tag) FROM items GROUP BY category;

→Deduplicated, sorted tag list per category

⚠Anti-Pattern— Omitting the delimiter argument or using an empty string unintentionally

`string_agg` requires exactly two arguments. Passing an empty string `''` as the delimiter will concatenate values without any separator, which is often a mistake when a comma or space was intended.

✓ Instead: Always explicitly pass the desired delimiter. If you truly want no separator, document it clearly. Use `array_agg` + `array_to_string` if you need to switch delimiters later.

`string_agg` is the cleanest way to produce comma-separated value strings in SQL. Use it in reporting queries where the application expects a single string field with multiple values.

example

SELECT category, string_agg(product_name, ', ' ORDER BY product_name) AS products FROM items GROUP BY category;

→"Electronics: laptop, phone, tablet"

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

UPDATE ... FROM with multiple matching rows produces non-deterministic results

When an UPDATE ... FROM join matches multiple rows from the FROM table, PostgreSQL uses one arbitrarily — no error.

sum

PG 8.0+→ bigint for smallint/int, numeric for bigint, same type for float/numeric

Computes the sum of all non-NULL input values.

DeveloperData Eng

Signature

sum ( expression ) → same as input or numeric

Parameters

Parameter	Type	Description
expression	numeric type	Value to sum; NULLs are ignored

Examples

sql

SELECT sum(amount) FROM payments;

→94823.50

sum of zero rows is NULL, not 0

sql

SELECT sum(amount) FROM payments WHERE amount IS NULL;

→NULL

sql

SELECT COALESCE(sum(amount), 0) FROM payments WHERE user_id = 999;

→0

sql

SELECT product_id, sum(quantity) AS total_sold FROM order_items GROUP BY product_id ORDER BY total_sold DESC;

→Total units sold per product

⚠Anti-Pattern— Assuming sum() returns 0 for empty groups

`sum()` returns NULL — not 0 — when there are no rows or all values are NULL. This can propagate silently through calculations and produce unexpected NULL results downstream.

✓ Instead: Always wrap with `COALESCE(sum(col), 0)` when a zero value is semantically correct for empty groups.

`sum()` returns NULL when applied to zero rows (or all-NULL rows), not 0. Wrap with `COALESCE(sum(x), 0)` whenever a zero result is meaningful.

example

SELECT user_id, COALESCE(sum(points), 0) AS total_points FROM user_points GROUP BY user_id;

→0 for users with no points rows

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

var_pop

PG 8.0+→ double precision (float8) or numeric

Returns the population variance of the input values (square of stddev_pop). Use when your data represents the entire population.

Developer

Signature

var_pop ( expression numeric/real/double precision ) → double precision/numeric

Parameters

Parameter	Type	Description
expression	numeric, real, or double precision	Numeric values to compute population variance over

Examples

Population variance when you have all data

sql

SELECT var_pop(score) FROM complete_census;

→152.25

Equivalent to stddev_pop()

sql

SELECT sqrt(var_pop(price)) AS pop_stddev FROM products;

→14.2

Pop vs sample variance comparison

sql

SELECT var_pop(val), var_samp(val) FROM measurements;

→100.0 | 102.0

Sales variance per product

sql

SELECT product_id, var_pop(daily_sales) FROM sales GROUP BY product_id;

→(rows)

Partition-level variance as window function

sql

SELECT var_pop(x) OVER (PARTITION BY group_id) FROM data;

→(rows)

⚠Anti-Pattern— Computing variance manually with AVG

Manually computing AVG(x - mean)^2 requires two passes over the data (or a subquery). var_pop() does it in a single pass using Welford's online algorithm internally.

✓ Instead: Use var_pop(col) directly — it is more accurate (avoids catastrophic cancellation for large values) and requires only one scan.

Variance is additive (variances of independent subsets can be combined mathematically), while standard deviation is not. For multi-level rollups, work in variance space and take sqrt() at the end.

example

SELECT category, sqrt(SUM(var_pop(price) * COUNT(*)) / SUM(COUNT(*))) AS combined_stddev FROM products GROUP BY ROLLUP(category);

→Properly combined stddev across rollup levels

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

var_samp

PG 8.0+→ double precision (float8) or numeric

Returns the sample variance (Bessel-corrected, divides by N-1). Alias for variance(). Use when your data is a sample estimating a larger population.

Developer

Signature

var_samp ( expression numeric/real/double precision ) → double precision/numeric

Parameters

Parameter	Type	Description
expression	numeric, real, or double precision	Numeric values to compute sample variance over

Examples

Variance of latency sample

sql

SELECT var_samp(latency_ms) FROM perf_sample;

→2345.6

var_samp and variance are identical

sql

SELECT var_samp(val), variance(val) FROM t;

→25.0 | 25.0

Most variable regions first

sql

SELECT region, round(var_samp(revenue)::numeric, 2) FROM sales GROUP BY region ORDER BY 2 DESC;

→(rows)

Rolling 5-row sample variance

sql

SELECT var_samp(score) OVER (ORDER BY exam_date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) FROM results;

→(rows)

Classify sensor stability

sql

SELECT CASE WHEN var_samp(val) < 10 THEN 'stable' ELSE 'variable' END FROM sensors WHERE sensor_id = 3;

→stable

⚠Anti-Pattern— Using var_samp on complete population data

If your table contains every member of the population (not a sample), var_samp inflates variance by dividing by N-1 instead of N.

✓ Instead: Use var_pop() when you have the full population — it gives the exact variance, not an unbiased estimate.

var_samp() returns NULL for a single row (N-1=0). Always pair with COUNT() >= 2 when the group might have only one member to avoid silent NULLs in reports.

example

SELECT g, var_samp(val) FROM t GROUP BY g HAVING COUNT(*) >= 2;

→Only groups with enough data for a meaningful sample variance

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

variance

PG 8.0+→ double precision or numeric

Returns the sample variance of all non-NULL input values (alias for var_samp).

DeveloperData EngDBA

Signature

variance ( expression ) → double precision or numeric

Parameters

Parameter	Type	Description
expression	numeric or float	Values to compute variance over

Examples

sql

SELECT variance(score) FROM test_results;

→152.3 (varies)

sql

SELECT variance(response_time) AS variability FROM requests;

→Variance of response times

sql

SELECT sensor_id, avg(reading) AS mean_val, variance(reading) AS instability FROM sensor_data GROUP BY sensor_id;

→Sensor stability report

sql

SELECT sqrt(variance(x) + variance(y)) AS combined_stddev FROM measurements;

→Combined standard deviation of two independent variables

⚠Anti-Pattern— Comparing variance values across different units or scales

Variance is scale-dependent: doubling measurement units quadruples the variance. Comparing `variance(price_usd)` to `variance(quantity)` is meaningless because the units differ.

✓ Instead: Standardize values (z-score normalize) before comparing spread across different columns, or compare coefficients of variation (`stddev / avg`) instead.

`variance(x)` = `stddev(x)^2`. Use variance when you need to add contributions from multiple independent sources (variances add, standard deviations don't). For human-readable spread, use stddev.

example

SELECT sqrt(variance(x) + variance(y)) AS combined_stddev FROM measurements;

→Combined standard deviation from two independent variables

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

⚠

Stale statistics after bulk load — planner uses wrong row estimates

After INSERT of millions of rows, the query planner still uses the old row count until ANALYZE runs. This leads to terrible plans.

xml_agg

PG 8.4+→ xml

Concatenates XML values into an XML sequence. All non-null inputs are concatenated in aggregate order. Returns NULL if no non-null values exist.

DeveloperDBA

Signature

xml_agg ( xml ) → xml

Parameters

Parameter	Type	Description
expression	xml	XML value to aggregate; NULLs are ignored

Examples

Build XML fragment from rows

sql

SELECT xml_agg(xmlelement(name item, id, name)) FROM products WHERE active;

→<item>1Widget</item><item>2Gadget</item>

Wrap aggregated items in root element

sql

SELECT xmlelement(name catalog, xml_agg(xmlelement(name product, xmlattributes(id AS id), name))) FROM products;

→<catalog><product id="1">Widget</product>...</catalog>

ORDER BY within aggregation

sql

SELECT xml_agg(xmlelement(name row, val) ORDER BY val) FROM t;

→<row>a</row><row>b</row>

One XML fragment per department

sql

SELECT department, xml_agg(xmlelement(name emp, name)) FROM employees GROUP BY department;

→(rows)

Aggregate pre-built XML values

sql

SELECT xml_agg(data) FROM (SELECT '<?xml version="1.0"?><x>1</x>'::xml AS data) t;

→<?xml version="1.0"?><x>1</x>

⚠Anti-Pattern— Concatenating XML with string_agg()

Using string_agg() to concatenate XML-formatted strings bypasses XML validation and escaping, producing invalid XML if any value contains <, >, or & characters.

✓ Instead: Use xml_agg(xmlelement(name tag, col)) to let PostgreSQL handle proper XML escaping automatically.

Pair xml_agg() with xmlelement() and xmlattributes() to build well-formed XML documents directly in SQL, avoiding round-trips to application code for simple XML serialisation.

example

SELECT xmlelement(name orders, xml_agg(xmlelement(name order, xmlattributes(id AS id), total))) FROM orders WHERE status = 'shipped';

→<orders><order id="1">99.99</order>...</orders>

Common Gotchas

⚠

string_agg without ORDER BY produces non-deterministic results

string_agg(col, ',') returns results in arbitrary order unless you specify ORDER BY inside the aggregate.

⚠

COUNT(col) skips NULLs but COUNT(*) does not

COUNT(column_name) counts non-NULL values only. COUNT(*) counts all rows. These produce different results when the column has NULLs.

PostgreSQL Aggregate Functions

Jump to function

What are PostgreSQL Aggregate Functions?

array_agg

Signature

Parameters

Examples

Common Gotchas

avg

Signature

Parameters

Examples

Common Gotchas

bit_and

Signature

Parameters

Examples

Common Gotchas

bit_or

Signature

Parameters

Examples

Common Gotchas

bit_xor

Signature

Parameters

Examples

Common Gotchas

bool_and

Signature

Parameters

Examples

Common Gotchas

bool_or

Signature

Parameters

Examples

Common Gotchas

corr

Signature

Parameters

Examples

Common Gotchas

count

Signatures

Parameters

Examples

Common Gotchas

covar_pop

Signature

Parameters

Examples

Common Gotchas

covar_samp

Signature

Parameters

Examples

Common Gotchas

json_agg

Signature

Parameters

Examples

Common Gotchas

json_object_agg

Signature

Parameters

Examples

Common Gotchas

jsonb_agg

Signature

Parameters

Examples

Common Gotchas

jsonb_object_agg

Signature

Parameters

Examples

Common Gotchas

max

Signature