PG
PRO
📊

PostgreSQL Statistical Aggregate Functions

Complete reference for PostgreSQL statistical aggregate functions covering correlation, linear regression, variance, standard deviation, percentile, and mode. Every function includes syntax, use cases for data analysis, and examples using real datasets. Updated for PostgreSQL 16.

19 functions

What are PostgreSQL Statistical Aggregate Functions?

PostgreSQL statistical functions are aggregate functions for performing statistical analysis directly in SQL without pulling data into a separate tool. They include var_pop() and var_samp() for variance, stddev() for standard deviation, corr() for Pearson correlation, regr_slope() and regr_intercept() for linear regression, percentile_cont() for interpolated percentiles, and mode() for the most frequent value.

corr

PG 8.0+double precision

Returns the Pearson correlation coefficient between two sets of values. Returns a value from -1 (perfect negative correlation) to 1 (perfect positive correlation).

Data EngDeveloperDBA

Signature

corr ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable values
Xdouble precisionIndependent variable values

Examples

sql
SELECT corr(revenue, ad_spend) FROM campaigns;
Correlation between ad spend and revenue
sql
SELECT corr(price, quantity_sold) FROM sales_data;
Price-demand correlation (expected negative)
sql
SELECT corr(page_views, signups) AS pv_signup_corr FROM daily_metrics WHERE date >= CURRENT_DATE - 90;
90-day correlation between page views and signups
sql
SELECT corr(load_avg, response_time) AS load_vs_latency, corr(connection_count, response_time) AS conn_vs_latency FROM server_metrics;
Scan multiple causes of latency in one query
Anti-PatternTreating a high correlation as evidence of a linear or causal relationship

`corr()` returns NULL when either variable has zero standard deviation (all values identical) — failing to guard against this produces silent NULLs in dashboards; use `NULLIF(stddev_pop(x), 0)` checks before relying on the result. Treating a high Pearson correlation as evidence of a linear relationship without inspecting the data — a non-linear relationship or outliers can produce misleadingly high or low `corr()` values. Also: Passing arguments in the wrong order: `corr(X, Y)` vs `corr(Y, X)` — the result is the same numerically (Pearson is symmetric) but the convention `corr(Y, X)` matches the regression functions `regr_slope(Y, X)`; mixing conventions causes confusion. Also: Using `corr()` on time-series data without accounting for autocorrelation — two independently trending series will show high correlation even with no real relationship (spurious correlation).

✓ Instead: See description and related functions for the correct approach.

Use `corr()` to quickly scan for relationships between columns. A correlation near 0 means no linear relationship. Values above 0.7 or below -0.7 usually warrant further investigation.

example
SELECT corr(load_avg, response_time) AS load_vs_latency, corr(connection_count, response_time) AS conn_vs_latency FROM server_metrics;
Scan multiple potential causes of latency in one query

covar_pop / covar_samp

PG 8.0+double precision

Returns the population or sample covariance of the (X, Y) pairs. Covariance measures how two variables change together.

Data EngDeveloperDBA

Signatures

covar_pop ( Y double precision, X double precision ) → double precision
covar_samp ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionFirst variable
Xdouble precisionSecond variable

Examples

sql
SELECT covar_samp(revenue, ad_spend) FROM campaigns;
Sample covariance between revenue and spend
sql
SELECT covar_pop(height_cm, weight_kg) FROM patients;
Population covariance between height and weight
sql
SELECT covar_pop(revenue, spend) / nullif(stddev_pop(revenue) * stddev_pop(spend), 0) AS manual_corr, corr(revenue, spend) AS builtin FROM campaigns;
Same value — two equivalent ways to compute Pearson correlation
sql
SELECT covar_samp(height_cm, weight_kg) AS sample_cov, covar_pop(height_cm, weight_kg) AS pop_cov FROM patients;
Sample vs population covariance — differ by N-1 vs N denominator
Anti-PatternInterpreting covariance magnitude to compare relationships across different scales

Interpreting covariance magnitude directly to compare relationships across different variable scales — a covariance of 1000 between height(cm) and weight(kg) says nothing about 'strength' compared to covariance between other pairs; always normalize to correlation. Using `covar_pop` on sampled data instead of `covar_samp`, producing a downward-biased estimate (divides by N instead of N-1). Also: Not protecting the normalization formula `covar_pop(y,x) / (stddev_pop(y) * stddev_pop(x))` against division by zero when either variable is constant — wrap with `NULLIF`. Also: Running separate queries for `covar_pop` and `stddev_pop` when a single SELECT computes all three aggregates in one table scan.

✓ Instead: See description and related functions for the correct approach.

`corr(y, x) = covar_pop(y, x) / (stddev_pop(y) * stddev_pop(x))`. Covariance alone is hard to interpret because its magnitude depends on the scales of Y and X. Prefer `corr()` for interpretable results.

example
SELECT covar_pop(revenue, spend) / nullif(stddev_pop(revenue) * stddev_pop(spend), 0) AS manual_corr, corr(revenue, spend) AS builtin_corr FROM campaigns;
Same value — two ways to compute correlation

cume_dist (hypothetical)

PG 9.4+double precision

Computes the cumulative distribution of the hypothetical row: the fraction of rows with values <= the hypothetical value.

Data EngDeveloperDBA

Signature

cume_dist ( args ) WITHIN GROUP ( ORDER BY sort_expression ) → double precision

Parameters

ParameterTypeDescription
argsanyHypothetical value
sort_expressionany sortable typeDistribution to compare against

Examples

sql
SELECT cume_dist(500) WITHIN GROUP (ORDER BY order_value) FROM orders;
Fraction of orders with value <= 500
sql
SELECT cume_dist(75000) WITHIN GROUP (ORDER BY salary) FROM employees;
Fraction of employees earning <= $75,000
sql
SELECT cume_dist(500) WITHIN GROUP (ORDER BY response_ms) AS below_500ms FROM api_logs WHERE date = CURRENT_DATE;
Fraction of requests completing in under 500ms
sql
SELECT segment, cume_dist(100) WITHIN GROUP (ORDER BY score) AS fraction_below_100 FROM user_scores GROUP BY segment;
What fraction of each segment scores below 100
Anti-PatternUsing cume_dist when only the rank position — not the cumulative fraction — is needed

Confusing `cume_dist(x)` with `percentile_cont(f)` — `cume_dist` takes a value and returns its fractional position; `percentile_cont` takes a fraction and returns the value at that position; they are inverses of each other. Expecting `cume_dist` to return 0.0 for the minimum value in the group — unlike `percent_rank`, `cume_dist` uses rank/N so the minimum gets 1/N, not 0; code that checks `cume_dist(...) = 0` will never be true. Also: Using `cume_dist` when the hypothetical value type does not match the `ORDER BY` expression type — implicit casts can produce unexpected sort order and an incorrect cumulative fraction. Also: Not accounting for tied values in `cume_dist` results — when the hypothetical value equals several existing rows, `cume_dist` counts all tied rows as <= the value, which can make the fraction jump discontinuously at tie points.

✓ Instead: See description and related functions for the correct approach.

Unlike `percent_rank`, `cume_dist` computes rank/N so the minimum value in the group gets 1/N (not 0). Use it when 'what fraction of data is at or below X' is the right question.

example
SELECT cume_dist(now() - interval '30 days') WITHIN GROUP (ORDER BY last_login) FROM users;
Fraction of users who last logged in 30+ days ago

dense_rank (hypothetical)

PG 9.4+bigint

Like rank(), but no gaps are left after tied values — ranks are consecutive integers.

Data EngDeveloperDBA

Signature

dense_rank ( args ) WITHIN GROUP ( ORDER BY sort_expression ) → bigint

Parameters

ParameterTypeDescription
argsanyHypothetical values
sort_expressionany sortable typeSorted group

Examples

sql
SELECT dense_rank(500) WITHIN GROUP (ORDER BY score DESC) FROM leaderboard;
Dense rank of score 500 on the leaderboard
sql
SELECT rank(90) WITHIN GROUP (ORDER BY score DESC) AS with_gaps, dense_rank(90) WITHIN GROUP (ORDER BY score DESC) AS no_gaps FROM scores;
Shows the difference when tied scores exist above 90
sql
SELECT dense_rank(50000) WITHIN GROUP (ORDER BY salary) FROM employees;
Dense rank of $50,000 salary — no gaps between adjacent rank levels
sql
SELECT dept, dense_rank(80000) WITHIN GROUP (ORDER BY salary) AS rank_in_dept FROM employees GROUP BY dept;
How $80,000 ranks within each department's salary distribution
Anti-PatternUsing dense_rank when downstream logic relies on gap-based tie counting

Using `dense_rank()` when downstream logic relies on rank gaps to count tied entries — because `dense_rank` compresses gaps, you cannot infer how many rows share a rank without an additional count. Confusing hypothetical `dense_rank(val) WITHIN GROUP (ORDER BY col)` with the window function `DENSE_RANK() OVER (ORDER BY col)` — the hypothetical aggregate asks about a single external value's position, not each row's rank within a partition. Also: Choosing between `rank` and `dense_rank` without considering the business requirement: use `rank` when you need 'position in a competition with gaps' and `dense_rank` when you need 'number of distinct scores above this value'. Also: Not testing the tie boundary: if the hypothetical value itself ties with existing rows, `dense_rank` includes those ties at the same rank level, which may or may not be the intended behavior for threshold checks.

✓ Instead: See description and related functions for the correct approach.

If three rows tie for rank 2, `rank()` assigns 2,2,2 then jumps to 5. `dense_rank()` assigns 2,2,2 then 3. Use `dense_rank` when the count of distinct rank positions matters more than the exact gap.

example
SELECT rank(90) WITHIN GROUP (ORDER BY score DESC) AS with_gaps, dense_rank(90) WITHIN GROUP (ORDER BY score DESC) AS no_gaps FROM scores;
Shows the difference when ties exist above score 90

mode (statistics)

PG 9.4+same as sort type

Returns the most frequent value in the sorted group. If multiple values are equally frequent, returns the smallest.

Data EngDeveloperDBA

Signature

mode () WITHIN GROUP ( ORDER BY sort_expression ) → same as sort_expression

Parameters

ParameterTypeDescription
sort_expressionany sortable typeValues to find mode of

Examples

sql
SELECT mode() WITHIN GROUP (ORDER BY product_id) FROM order_items;
The most-ordered product
sql
SELECT mode() WITHIN GROUP (ORDER BY error_code) FROM error_log WHERE date = current_date;
Most common error code today
sql
SELECT mode() WITHIN GROUP (ORDER BY error_code) AS most_common_error FROM error_log WHERE date = CURRENT_DATE;
Most frequent error code today
sql
SELECT user_id, mode() WITHIN GROUP (ORDER BY category) AS favorite_category FROM purchases GROUP BY user_id LIMIT 5;
Each user's most-purchased category
Anti-PatternUsing mode() when the distribution is nearly uniform

Relying on `mode()` when the distribution is nearly uniform — in uniform distributions the 'mode' is arbitrary (the smallest among equally frequent values) and carries no analytical meaning. Using `mode()` and expecting it to handle ties in a specific business-defined way — it silently returns the smallest tied value per the `ORDER BY` sort order, which may not match the desired tie-breaking rule. Also: Building a manual 'most frequent value' query with `GROUP BY val ORDER BY COUNT(*) DESC LIMIT 1` inside a subquery per outer group, which is O(N log N) per group and far less efficient than `mode() WITHIN GROUP`. Also: Confusing `mode()` (most frequent value) with `avg()` or `median` — for multimodal distributions, none of these alone describes the distribution well; always pair with a count or histogram.

✓ Instead: See description and related functions for the correct approach.

Instead of `SELECT val FROM t GROUP BY val ORDER BY count(*) DESC LIMIT 1`, use `mode() WITHIN GROUP (ORDER BY val)` — it's one aggregation step and composes cleanly in GROUP BY queries.

example
SELECT user_id, mode() WITHIN GROUP (ORDER BY category) AS favorite_category FROM purchases GROUP BY user_id;
Each user's most-purchased category

percent_rank (hypothetical)

PG 9.4+double precision

Computes the relative rank (fraction) of the hypothetical row: (rank - 1) / (N - 1).

Data EngDeveloperDBA

Signature

percent_rank ( args ) WITHIN GROUP ( ORDER BY sort_expression ) → double precision

Parameters

ParameterTypeDescription
argsanyHypothetical values
sort_expressionany sortable typeSorted group

Examples

sql
SELECT percent_rank(75000) WITHIN GROUP (ORDER BY salary) FROM employees;
Fraction of salaries below $75,000
sql
SELECT percent_rank(100) WITHIN GROUP (ORDER BY test_score) FROM students;
0.0 if 100 is the minimum score, 1.0 if it's the maximum
sql
SELECT cume_dist(75) WITHIN GROUP (ORDER BY score) AS cume, percent_rank(75) WITHIN GROUP (ORDER BY score) AS prank FROM test_results;
Compares cume_dist and percent_rank for the same value
sql
SELECT product_id, percent_rank(49.99) WITHIN GROUP (ORDER BY price) AS price_percentile FROM order_items GROUP BY product_id;
$49.99's percentile rank within each product's price history
Anti-PatternConfusing percent_rank with cume_dist

Confusing `percent_rank` with `cume_dist` — `percent_rank` uses (rank-1)/(N-1) so the minimum value returns 0.0, while `cume_dist` uses rank/N so the minimum returns 1/N; they answer subtly different questions. Using `percent_rank` on a group with N=1 — it returns 0 by definition but the result is meaningless as a percentile with a single data point. Also: Interpreting `percent_rank(x)` as 'the x-th percentile value' — it returns the percentile position OF x, not the value AT a given percentile position; use `percentile_cont` for the inverse operation. Also: Applying `percent_rank` to the wrong sort direction — `ORDER BY col DESC` gives the fraction of values that x ranks ABOVE, while `ORDER BY col ASC` gives the fraction x ranks above from the bottom; document the direction explicitly.

✓ Instead: See description and related functions for the correct approach.

A result of 0.75 means the hypothetical value ranks higher than 75% of the group. Unlike `cume_dist`, `percent_rank` uses (rank-1)/(N-1) so the lowest value always gets 0.

example
SELECT percent_rank(100) WITHIN GROUP (ORDER BY test_score) FROM students;
0.0 if 100 is the minimum score, 1.0 if it's the maximum

percentile_cont (statistics)

PG 9.4+same as sort type

Computes a continuous percentile: the value at the given fraction of the sorted set, interpolating between adjacent values if necessary. The multi-fraction form returns an array.

Data EngDeveloperDBA

Signatures

percentile_cont ( fraction ) WITHIN GROUP ( ORDER BY sort_expression ) → same as sort_expression
percentile_cont ( fractions anyarray ) WITHIN GROUP ( ORDER BY sort_expression ) → anyarray

Parameters

ParameterTypeDescription
fractiondouble precisionFraction between 0 and 1 (e.g., 0.5 = median)
sort_expressionnumeric or comparable typeValues to compute percentile over

Examples

sql
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary FROM employees;
Median salary (interpolated)
sql
SELECT percentile_cont(ARRAY[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY response_ms) AS quartiles FROM api_logs;
Q1, median, Q3 response times
sql
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median, avg(salary) AS mean FROM employees WHERE dept = 'Engineering';
Median vs mean salary — if they differ significantly, distribution is skewed
sql
SELECT percentile_cont(ARRAY[0.5, 0.95, 0.99]) WITHIN GROUP (ORDER BY response_ms) AS p50_p95_p99 FROM api_requests;
{120, 450, 980} — median, p95, p99 latencies in one pass
Anti-PatternCalling percentile_cont multiple times for different percentiles

Using `AVG` instead of `percentile_cont(0.5)` for skewed distributions such as income, latency, or response times — the mean is pulled by outliers and does not represent the 'typical' value; the median is almost always more appropriate for skewed data. Calling `percentile_cont` once per percentile in separate queries (e.g., p50, p95, p99 in three separate statements) instead of using the array form `percentile_cont(ARRAY[0.5, 0.95, 0.99])` which sorts once. Also: Using `percentile_cont` on non-numeric types such as text or enums — `percentile_cont` interpolates between values and requires a numeric sort expression; use `percentile_disc` for non-numeric types. Also: Assuming `percentile_cont(0.5)` always returns a value that exists in the table — it interpolates between adjacent values, so the result may not be a real observation in the dataset.

✓ Instead: See description and related functions for the correct approach.

Pass an array of fractions to get multiple percentiles in one pass: `percentile_cont(ARRAY[0.1,0.25,0.5,0.75,0.9,0.99]) WITHIN GROUP (ORDER BY latency)` gives a complete distribution summary without scanning the table multiple times.

example
SELECT percentile_cont(ARRAY[0.5, 0.95, 0.99]) WITHIN GROUP (ORDER BY response_ms) AS p50_p95_p99 FROM api_requests;
{120, 450, 980} — median, p95, p99 latencies

percentile_disc (statistics)

PG 9.4+same as sort type

Computes a discrete percentile: the first value in sorted order whose cumulative distribution function is >= the given fraction. Returns an actual value from the dataset (no interpolation).

Data EngDeveloperDBA

Signature

percentile_disc ( fraction ) WITHIN GROUP ( ORDER BY sort_expression ) → same as sort_expression

Parameters

ParameterTypeDescription
fractiondouble precisionFraction between 0 and 1
sort_expressionany sortable typeValues to compute percentile over

Examples

sql
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY score) FROM test_results;
The actual middle value (not interpolated)
sql
SELECT percentile_disc(0.9) WITHIN GROUP (ORDER BY page_load_ms) FROM performance_log;
The p90 page load time — an actual observed value
sql
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY score) AS median_score FROM test_results WHERE exam_id = 7;
Actual middle score value — always exists in the dataset
sql
SELECT percentile_disc(0.9) WITHIN GROUP (ORDER BY page_load_ms) AS p90_load FROM perf_log WHERE date = CURRENT_DATE;
P90 page load — an actual observed millisecond value
Anti-PatternUsing percentile_disc when an interpolated estimate is needed

Using `percentile_disc` when a smoothed (interpolated) estimate is needed — `percentile_disc` always returns an actual row value, which can produce a different result from `percentile_cont` on even-sized datasets; choose based on whether a real observation or an interpolated estimate is more appropriate. Expecting `percentile_disc` to support the array-of-fractions shorthand like `percentile_cont` — it does not; each discrete percentile requires its own function call or lateral subquery. Also: Forgetting that `percentile_disc` NULLs are handled by the `ORDER BY` sort expression — `ORDER BY col NULLS LAST` is needed if NULLs should not be the 'smallest' values in the distribution. Also: Using `MIN` or `MAX` where `percentile_disc(0.0)` or `percentile_disc(1.0)` would be clearer in the context of a percentile analysis, making the intent less obvious to readers.

✓ Instead: See description and related functions for the correct approach.

`percentile_disc` works on any sortable type (text, date, enum). `percentile_cont` requires numeric types (it interpolates). Use `percentile_disc` when you need an actual value from the dataset.

example
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY created_at) AS median_signup_date FROM users;
The date that splits users into two equal halves

rank (hypothetical)

PG 9.4+bigint

Computes the rank that the hypothetical row formed from args would have in the sorted group. Ties receive the same rank and leave gaps.

Data EngDeveloperDBA

Signature

rank ( args ) WITHIN GROUP ( ORDER BY sort_expression ) → bigint

Parameters

ParameterTypeDescription
argsanyHypothetical values to rank
sort_expressionany sortable typeSorted group to rank against

Examples

sql
SELECT rank(50000) WITHIN GROUP (ORDER BY salary) FROM employees;
Rank of $50,000 salary among all employees
sql
SELECT rank('2025-01-01') WITHIN GROUP (ORDER BY created_at) FROM events;
Rank of a specific date in the events timeline
sql
SELECT dept, rank(80000) WITHIN GROUP (ORDER BY salary) AS rank_of_80k FROM employees GROUP BY dept;
Rank of $80,000 within each department's salary distribution
sql
SELECT rank(90) WITHIN GROUP (ORDER BY score DESC) AS gapped_rank, dense_rank(90) WITHIN GROUP (ORDER BY score DESC) AS dense FROM scores;
Side-by-side comparison showing gap vs no-gap behavior when ties exist above 90
Anti-PatternConfusing hypothetical rank() with the window function RANK() OVER

Confusing hypothetical-set `rank(val) WITHIN GROUP (ORDER BY col)` with the window function `RANK() OVER (ORDER BY col)` — the hypothetical form asks 'where would this value rank?' against the aggregate group, not a window partition. Not accounting for gap semantics: `rank()` leaves gaps after ties (ranks 1,1,3), so converting rank to a percentile by dividing by COUNT(*) gives a misleading result when ties exist; use `percent_rank` or `cume_dist` instead. Also: Using `rank()` to determine whether a hypothetical value is 'in the top N%' when `cume_dist()` directly returns the cumulative fraction without requiring division. Also: Passing the hypothetical value as the wrong type (e.g., integer vs double precision) — implicit casting differences can shift the rank result when the sort expression type and the argument type differ subtly.

✓ Instead: See description and related functions for the correct approach.

Use `cume_dist(value) WITHIN GROUP (ORDER BY col)` to answer 'what fraction of values are <= this?'. For '1 in N' framing, use `rank(value) WITHIN GROUP (ORDER BY col)` against the total count.

example
SELECT cume_dist(75) WITHIN GROUP (ORDER BY score) FROM test_results;
Fraction of scores <= 75 (e.g., 0.85 = 85th percentile)

regr_avgx / regr_avgy

PG 8.0+double precision

Returns the average of the independent (X) or dependent (Y) variable over rows where both values are non-null.

Data EngDeveloperDBA

Signatures

regr_avgx ( Y double precision, X double precision ) → double precision
regr_avgy ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_avgx(revenue, ad_spend), regr_avgy(revenue, ad_spend) FROM campaigns;
Mean ad spend and mean revenue over paired rows
sql
SELECT avg(ad_spend) AS all_avg, regr_avgx(revenue, ad_spend) AS paired_avg FROM campaigns;
May differ if some rows have NULL revenue — shows the paired-filter effect
sql
SELECT regr_avgx(revenue, ad_spend) AS mean_x, regr_avgy(revenue, ad_spend) AS mean_y FROM campaigns;
Mean ad spend and mean revenue over the same paired subset
sql
SELECT regr_intercept(y, x) - regr_avgx(y, x) * regr_slope(y, x) AS manual_intercept_check FROM data;
Verifies intercept = mean_y - slope * mean_x
Anti-PatternUsing avg(X) instead of regr_avgx when computing regression-aligned means

Using `avg(X)` instead of `regr_avgx(Y, X)` when computing residuals or centering variables for regression — `avg(X)` includes rows where Y is NULL, producing a different mean than the one used internally by `regr_slope`. Confusing `regr_avgx(Y, X)` and `regr_avgy(Y, X)` — the argument order is always (Y, X) for both, but the first returns the mean of X and the second the mean of Y; the naming is counterintuitive. Also: Assuming `regr_avgx` and `avg(x)` are interchangeable in non-regression contexts — use plain `avg` for general-purpose averages and `regr_avg*` only when you need the paired-row subset. Also: Not using `regr_avgx` when manually verifying that the regression line passes through the point (mean_x, mean_y), leading to off-by-NULL-subset verification errors.

✓ Instead: See description and related functions for the correct approach.

`regr_avgx(Y, X)` is not the same as `avg(X)` — it only includes rows where both Y and X are non-null. Use it when you need means computed over the same paired dataset used for regression.

example
SELECT avg(ad_spend) AS all_avg, regr_avgx(revenue, ad_spend) AS paired_avg FROM campaigns;
May differ if some rows have NULL revenue

regr_count

PG 8.0+bigint

Returns the number of rows in which both Y and X are not null. Used as the N for other regression statistics.

Data EngDeveloperDBA

Signature

regr_count ( Y double precision, X double precision ) → bigint

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_count(revenue, ad_spend) FROM campaigns;
Number of rows with both values non-null
sql
SELECT segment, regr_slope(ltv, age), regr_r2(ltv, age) FROM customers GROUP BY segment HAVING regr_count(ltv, age) >= 30;
Regression only for segments with sufficient data
sql
SELECT count(*) AS total_rows, regr_count(revenue, ad_spend) AS paired_rows FROM campaigns;
Reveals how many rows have NULL in either variable
sql
SELECT regr_count(y, x) AS n, regr_slope(y, x) AS slope FROM dataset HAVING regr_count(y, x) > 1;
Guard: requires at least 2 data points before computing slope
Anti-PatternUsing COUNT(*) instead of regr_count to verify regression sample size

Using `COUNT(*)` instead of `regr_count(Y, X)` to verify regression N — `COUNT(*)` counts all rows including those with NULL in either variable, overstating the effective sample size. Not gating regression results with a `HAVING regr_count(...) >= threshold` clause — a slope computed from 2 rows is always a perfect fit (R²=1) yet entirely unreliable. Also: Assuming `regr_count` equals `COUNT(x)` — it only counts rows where BOTH X and Y are non-null; one nullable column can silently halve your effective N. Also: Skipping `regr_count` validation in production pipelines, allowing near-zero-N regression outputs to propagate into forecasting models as if they were reliable.

✓ Instead: See description and related functions for the correct approach.

Always check `regr_count` alongside regression results. A slope computed from 3 data points is unreliable. Filter groups with too few data points using HAVING.

example
SELECT segment, regr_slope(ltv, age), regr_r2(ltv, age) FROM customers GROUP BY segment HAVING regr_count(ltv, age) >= 30;
Regression only for segments with sufficient data

regr_intercept

PG 8.0+double precision

Returns the y-intercept of the least-squares fit linear equation through the (X, Y) data points.

Data EngDeveloperDBA

Signature

regr_intercept ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_intercept(sale_price, sq_footage) FROM properties;
Base price when square footage is 0 (theoretical minimum)
sql
SELECT regr_slope(price, sqft) AS slope, regr_intercept(price, sqft) AS base, regr_r2(price, sqft) AS r_sq FROM homes;
Complete property price model: base value + $/sqft rate
sql
SELECT regr_intercept(revenue, months_active) AS baseline_revenue FROM customers WHERE cohort = '2024-Q1';
Estimated revenue at account creation (months_active = 0)
sql
SELECT regr_count(revenue, months_active) AS n, regr_r2(revenue, months_active) AS fit FROM customers HAVING regr_count(revenue, months_active) >= 30;
Only compute model when N >= 30 for statistical reliability
Anti-PatternInterpreting the intercept as meaningful when X=0 is outside the data range

Interpreting the intercept as a physically meaningful 'base value' when X=0 is far outside the observed data range — the intercept is often nonsensical in context (e.g., a house with 0 square feet). Neglecting to pair `regr_intercept` with `regr_slope` when predicting Y — using only the intercept or only the slope produces wrong predictions. Also: Calling `regr_intercept` and `regr_slope` in separate subqueries, scanning the table twice, when a single SELECT can compute both aggregates in one pass. Also: Ignoring that `regr_intercept` returns NULL when `regr_count` < 2, leading to NULL forecasts without any error or warning.

✓ Instead: See description and related functions for the correct approach.

Retrieve slope, intercept, and R² together: `SELECT regr_slope(y,x), regr_intercept(y,x), regr_r2(y,x)`. The R² value tells you how well the line fits (0 = no fit, 1 = perfect fit).

example
SELECT regr_slope(price, sqft) AS slope, regr_intercept(price, sqft) AS base, regr_r2(price, sqft) AS r_squared FROM homes;
Complete linear model parameters in one pass

regr_r2

PG 8.0+double precision

Returns the square of the Pearson correlation coefficient (coefficient of determination). Indicates how well a linear model fits the data.

Data EngDeveloperDBA

Signature

regr_r2 ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_r2(revenue, ad_spend) FROM campaigns;
Fraction of revenue variance explained by ad spend
sql
SELECT feature_name, regr_r2(churn::int, feature_value) AS predictive_power FROM feature_table ORDER BY predictive_power DESC;
Rank features by linear predictive power for churn
sql
SELECT regr_r2(revenue, ad_spend) AS r2, regr_count(revenue, ad_spend) AS n FROM campaigns WHERE channel = 'email';
Fit quality alongside sample size — r2 is unreliable without checking n
sql
SELECT channel, regr_r2(revenue, ad_spend) AS r2 FROM campaigns GROUP BY channel ORDER BY r2 DESC;
Compare ad-spend/revenue linearity across channels
Anti-PatternTreating a high R² as proof of a good model

Assuming a high R² (e.g., 0.95) means the linear model is appropriate — a curved relationship can still produce high R² while the linear assumption is badly violated. Using R² as the sole criterion for model quality without inspecting residuals — outliers or heteroscedasticity can give a misleadingly high or low R². Also: Confusing `regr_r2(Y, X)` with `corr(Y, X)^2` — they are mathematically equivalent for simple linear regression, but mixing them up in code reviews causes confusion; use the most expressive form for your intent. Also: Reporting R² without also reporting sample size (`regr_count`) — an R² of 0.99 from 4 data points is not evidence of a good model.

✓ Instead: See description and related functions for the correct approach.

An R² of 0.8 means 80% of the variance in Y is explained by X. Note: a high R² does not guarantee the relationship is causal or that the linear model is appropriate.

example
SELECT feature, regr_r2(churn::int, feature_value) AS predictive_power FROM feature_correlations ORDER BY predictive_power DESC;
Rank features by how well they linearly predict churn

regr_slope

PG 8.0+double precision

Returns the slope of the least-squares fit linear equation through the (X, Y) data points.

Data EngDeveloperDBA

Signature

regr_slope ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_slope(revenue, months_active) FROM customers;
Average revenue increase per additional month of activity
sql
SELECT regr_slope(cpu_pct, connection_count) FROM server_stats;
CPU increase per additional connection
sql
SELECT regr_slope(revenue, week_num) * 52 + regr_intercept(revenue, week_num) AS projected_annual FROM weekly_sales;
Full-year revenue projection from trend line
sql
SELECT regr_slope(y, x) AS slope, regr_intercept(y, x) AS intercept, regr_r2(y, x) AS r_squared, regr_count(y, x) AS n FROM measurement_pairs;
Full linear model in one query — slope, intercept, fit quality, and N
Anti-PatternSwapping Y and X arguments in regression functions

Swapping Y and X arguments — `regr_slope(X, Y)` gives the inverse slope and is a common argument-order mistake that produces wrong forecasts silently. Using `regr_slope` without checking `regr_r2` or `regr_count` first — a slope computed from 3 points or on data with R²=0.02 is statistically meaningless but looks like a real result. Also: Extrapolating far outside the observed X range — the linear model is only valid within (or near) the range of training data; extreme extrapolation amplifies any model misfit. Also: Assuming a significant slope implies causality — a non-zero slope only means a linear association exists, not that X causes Y.

✓ Instead: See description and related functions for the correct approach.

Use `regr_slope(y, x) * future_x + regr_intercept(y, x)` to forecast values in-database. This is a lightweight alternative to exporting data to a separate analytics tool.

example
SELECT regr_slope(revenue, week_num) * 52 + regr_intercept(revenue, week_num) AS projected_annual FROM weekly_sales;
Full-year revenue projection from trend line

regr_sxx / regr_syy / regr_sxy

PG 8.0+double precision

Returns the sum of squares of X deviations, Y deviations, or the cross-product of X and Y deviations. Building blocks for manual regression analysis.

Data EngDeveloperDBA

Signatures

regr_sxx ( Y double precision, X double precision ) → double precision
regr_syy ( Y double precision, X double precision ) → double precision
regr_sxy ( Y double precision, X double precision ) → double precision

Parameters

ParameterTypeDescription
Ydouble precisionDependent variable
Xdouble precisionIndependent variable

Examples

sql
SELECT regr_sxy(revenue, ad_spend) / regr_sxx(revenue, ad_spend) AS manual_slope FROM campaigns;
Same as regr_slope() — manual computation
sql
SELECT regr_sxy(y, x) / nullif(regr_sxx(y, x), 0) AS manual_slope FROM data;
Same as regr_slope() — manual verification via building blocks
sql
SELECT regr_sxy(y, x) / nullif(sqrt(regr_sxx(y, x) * regr_syy(y, x)), 0) AS manual_corr FROM data;
Equivalent to corr(y, x) — manual Pearson correlation
sql
SELECT regr_sxx(y, x), regr_syy(y, x), regr_sxy(y, x) FROM measurement_pairs WHERE experiment_id = 3;
All three sum-of-squares components in one pass for custom statistics
Anti-PatternDividing regr_sxy by regr_sxx without guarding against zero

Dividing `regr_sxy` by `regr_sxx` without guarding against `regr_sxx = 0` (all X values identical) — use `NULLIF(regr_sxx(...), 0)` to avoid division-by-zero. Recomputing these building blocks in multiple separate subqueries instead of computing all three (`regr_sxx`, `regr_syy`, `regr_sxy`) in a single aggregate scan. Also: Manually computing `SUM((x - avg(x)) * (y - avg(y)))` via a self-join or subquery when `regr_sxy` does the same thing in one pass with better numerical stability. Also: Forgetting that all three functions apply the same paired-NULL filter as `regr_count` — rows with NULL in either variable are excluded, so manual `SUM` formulas using `avg(x)` over the full table will produce different results.

✓ Instead: See description and related functions for the correct approach.

`regr_slope = regr_sxy / regr_sxx` and `corr = regr_sxy / sqrt(regr_sxx * regr_syy)`. Having these building blocks allows you to compute weighted or conditional regression variants in SQL.

example
SELECT regr_sxy(y, x) / nullif(sqrt(regr_sxx(y, x) * regr_syy(y, x)), 0) AS manual_corr FROM data;
Equivalent to corr(y, x)

stddev / stddev_samp

PG 8.0+double precision

Returns the sample standard deviation of all non-null input values. `stddev` is an alias for `stddev_samp`.

Data EngDeveloperDBA

Signatures

stddev_samp ( numeric ) → double precision
stddev ( numeric ) → double precision

Parameters

ParameterTypeDescription
expressionnumericNumeric values to compute sample standard deviation over

Examples

sql
SELECT stddev(price) FROM products;
Sample standard deviation of product prices
sql
SELECT category, round(stddev(price)::numeric, 2) FROM products GROUP BY category;
Price spread per category
sql
SELECT category, round(stddev(price)::numeric, 2) AS price_sd, count(*) AS n FROM products GROUP BY category HAVING count(*) >= 10;
Price spread per category — filtered to groups with enough data
sql
SELECT stddev(response_ms) AS latency_sd FROM api_logs WHERE endpoint = '/api/search' AND DATE(logged_at) = CURRENT_DATE;
Today's response time variability for the search endpoint
Anti-PatternUsing stddev on full-population data and ignoring single-row NULL behavior

Using `stddev` (sample) on full-population data such as a complete census table — this inflates the spread slightly due to the N-1 Bessel correction. Expecting `stddev` to return 0 for a single-row group instead of NULL, leading to silent NULLs propagating into downstream calculations. Also: Comparing standard deviations across groups with very different sample sizes without also checking group counts — a tiny-N group can produce a wildly unstable stddev. Also: Using multiple query passes (one for avg, one for stddev) when both can be computed in a single aggregate scan.

✓ Instead: See description and related functions for the correct approach.

Sample standard deviation requires at least 2 values. With only 1 row in a group, `stddev` returns NULL. Use `COALESCE(stddev(x), 0)` if you need 0 instead of NULL for single-row groups.

example
SELECT group_id, COALESCE(stddev(value), 0) AS spread FROM measurements GROUP BY group_id;
0 for single-measurement groups, actual spread otherwise

stddev_pop

PG 8.0+double precision

Returns the population standard deviation of all non-null input values.

Data EngDeveloperDBA

Signature

stddev_pop ( numeric ) → double precision

Parameters

ParameterTypeDescription
expressionnumericNumeric values to compute population standard deviation over

Examples

sql
SELECT stddev_pop(score) FROM test_results;
Population std dev of all scores
sql
SELECT department, stddev_pop(salary) FROM employees GROUP BY department;
Salary spread per department
sql
SELECT dept, stddev_pop(salary) AS spread, avg(salary) AS mean FROM employees GROUP BY dept ORDER BY spread DESC;
Department with the highest salary spread listed first
sql
SELECT stddev_pop(score) AS pop_sd, stddev_samp(score) AS sample_sd, count(*) AS n FROM test_results;
Compare population vs sample std dev — nearly identical when n is large
Anti-PatternUsing stddev_pop on sample data instead of stddev_samp

Using `stddev_pop` on a sample dataset instead of `stddev_samp`, producing a biased (too-small) estimate of spread. Confusing `stddev_pop` with `stddev` — `stddev` is an alias for `stddev_samp`, not `stddev_pop`; always check which denominator (N vs N-1) is needed. Also: Interpreting a zero result as 'no data' — `stddev_pop` returns 0 when all values are identical, and NULL only when there are no non-null rows. Also: Running separate queries for `stddev_pop` and `avg` when a single grouped query can compute both in one pass over the data.

✓ Instead: See description and related functions for the correct approach.

Use `stddev_pop` when your data IS the entire population. Use `stddev_samp` (or `stddev`) when your data is a sample. For small samples, the difference is significant.

example
SELECT stddev_pop(score) AS pop_sd, stddev_samp(score) AS sample_sd FROM scores;
Compare population vs sample standard deviation

var_pop

PG 8.0+double precision

Returns the population variance of all non-null input values (square of population standard deviation).

Data EngDeveloperDBA

Signature

var_pop ( numeric ) → double precision

Parameters

ParameterTypeDescription
expressionnumericNumeric values to compute population variance over

Examples

sql
SELECT var_pop(response_time_ms) FROM api_logs;
Variance of response times across all requests
sql
SELECT var_pop(latency) AS variance, sqrt(var_pop(latency)) AS stddev FROM perf_logs WHERE service = 'auth';
Verify sqrt(variance) = stddev — both express the same spread
sql
SELECT dept, var_pop(salary) AS salary_variance FROM employees GROUP BY dept ORDER BY salary_variance DESC;
Departments ranked by salary spread
sql
SELECT var_pop(daily_sales) AS sales_variance FROM (SELECT date, sum(amount) AS daily_sales FROM sales GROUP BY date) sub;
Variance of daily revenue — higher variance means more unpredictable cash flow
Anti-PatternTreating var_pop output as a standard deviation

Treating `var_pop` output as if it were a standard deviation — variance is in squared units, making it unintuitive for threshold comparisons (e.g., 'variance > 100 ms' is meaningless without taking the square root). Using `var_pop` on sampled data instead of `var_samp`, producing a downward-biased variance estimate that understates true population variance. Also: Summing variances from different groups when the groups are not independent — variance addition only holds under statistical independence. Also: Computing variance with a manual `AVG((x - mean)^2)` subquery instead of the built-in `var_pop`, which is numerically stabler and computed in a single pass.

✓ Instead: See description and related functions for the correct approach.

`var_pop(x) = stddev_pop(x)^2`. Use variance when composing statistics (variances add under independence) and convert to std dev for human-readable output.

example
SELECT sqrt(var_pop(latency)) AS stddev_check, stddev_pop(latency) AS direct FROM perf_logs;
Both return the same value

var_samp / variance

PG 8.0+double precision

Returns the sample variance of all non-null input values. `variance` is an alias for `var_samp`.

Data EngDeveloperDBA

Signatures

var_samp ( numeric ) → double precision
variance ( numeric ) → double precision

Parameters

ParameterTypeDescription
expressionnumericNumeric values to compute sample variance over

Examples

sql
SELECT variance(amount) FROM transactions WHERE month = '2025-01';
Sample variance of transaction amounts in January
sql
SELECT variance(amount) AS tx_variance, stddev(amount) AS tx_sd FROM transactions WHERE month = '2025-01';
Variance and std dev of January transaction amounts
sql
SELECT category, variance(price) AS price_variance, count(*) AS n FROM products GROUP BY category HAVING count(*) >= 5;
Price variance per category — only statistically meaningful groups
sql
SELECT stddev(revenue) / nullif(avg(revenue), 0) AS cv FROM monthly_sales;
Coefficient of variation — relative variability across months
Anti-PatternDisplaying raw variance values to end users

Using `variance` (sample) on a complete population table, introducing a small upward bias via the N-1 Bessel correction when N should be used instead. Displaying raw variance values to end users — variance is in squared units and is difficult to interpret; always convert to standard deviation via `sqrt(variance(...))` for presentation. Also: Ignoring NULL rows: `var_samp` silently skips NULLs, so the effective N can be much smaller than the table row count; always cross-check with `COUNT(expression)` vs `COUNT(*)`. Also: Running a subquery to compute the mean first and then computing deviations manually — `var_samp` does this in a single numerically stable pass without a self-join.

✓ Instead: See description and related functions for the correct approach.

The coefficient of variation (CV = stddev/mean) tells you relative variability. Compute it with `stddev(x) / nullif(avg(x), 0)` — useful for comparing variability across different scales.

example
SELECT stddev(revenue) / nullif(avg(revenue), 0) AS cv FROM monthly_sales;
Relative variability (lower = more consistent)