The RStats type implements numerically stable running statistics using Welford's algorithm. This document describes the design decisions, implementation details, and rationale for the sentinel value approach used to represent empty state.
The RStats type uses count=0 with all other fields set to 0.0 as the canonical empty state representation.
- Unambiguous: count=0 is the natural definition of "no data accumulated"
- Simple: Single boolean check
(count == 0)for emptiness - No conflicts: Cannot collide with legitimate statistics (any real data has count >= 1)
While count=0 alone is sufficient to identify empty state, we enforce that all other fields (mean, m2, min, max) must also be 0.0 when count=0. This provides:
-
Canonical Representation
- Empty state has exactly one valid serialization:
(count:0,mean:0,min:0,max:0,stddev:0) - Enables byte-for-byte comparison of empty values
- Simplifies debugging and testing
- Empty state has exactly one valid serialization:
-
Clean Display
- Text output shows intuitive zeros:
(count:0,mean:0,min:0,max:0,stddev:0) - No confusing sentinel values like -1 that might suggest data
- Text output shows intuitive zeros:
-
Corruption Detection
- If count=0 but other fields are non-zero, we know data is corrupted
- Catches bugs in serialization/deserialization
- Detects memory corruption or improper initialization
-
Consistency Across Formats
- Binary and text formats produce identical empty representations
- No format-specific edge cases
The canonical empty state (count=0, all zeros) is enforced at four validation points:
Purpose: Reject malformed text input from users or external systems
When: Parsing SQL input like SELECT '(count:0,mean:1,...)'::rstats
Error Code: ERRCODE_INVALID_TEXT_REPRESENTATION
Purpose: Catch corruption in network protocol or disk storage
When: Receiving binary data from client or reading from disk
Error Code: ERRCODE_DATA_CORRUPTED
Purpose: Catch bugs in C code that corrupts RStats in memory
When: Before sending binary data to client
Error Code: ERRCODE_DATA_CORRUPTED
Note: Includes hint that this is an internal bug
Purpose: Guard any code path checking if statistics are empty
When: Called by rstats_add_value() to detect lazy initialization
Error Code: ERRCODE_DATA_CORRUPTED
The current design cannot distinguish between:
- Truly empty statistics:
rstats_set_empty() - Statistics initialized with single zero value:
rstats_init_internal(&stats, 0.0)
Both result in: count=1 for init(0), but count=0 for empty.
This is acceptable because:
- The count field correctly reflects the difference (0 vs. 1)
- Both are valid states with clear semantics
- The distinction is not operationally important
- User can check count if they need to distinguish
typedef struct RStats {
int64 count; /* 8 bytes - Primary empty indicator */
double mean; /* 8 bytes - Must be 0.0 when count=0 */
double m2; /* 8 bytes - Must be 0.0 when count=0 */
double min; /* 8 bytes - Must be 0.0 when count=0 */
double max; /* 8 bytes - Must be 0.0 when count=0 */
} RStats;
/* Total: 40 bytes, fixed-size, no varlena header */The sentinel validation has minimal performance impact:
- Validation Cost: 4 floating-point equality comparisons
- Compiler Optimization: Often optimized to SIMD comparison
- Zero Overhead in Happy Path: If count > 0, skip validation entirely
- Only on Boundaries: Validation only at I/O and emptiness checks, not during computation
If sentinel approach needs modification in the future:
- Update this documentation with rationale
- Increment type version (create
rstats--0.1--0.2.sqlmigration script) - Add compatibility layer if needed
- Update all four validation points consistently
This section documents the available operators and their semantics.
Implicit casts from numeric types initialize RStats with a single value:
-- From double precision
SELECT 42.5::rstats;
-- Result: (count:1,mean:42.5,min:42.5,max:42.5,stddev:0)
-- From integer
SELECT 100::rstats;
-- Result: (count:1,mean:100,min:100,max:100,stddev:0)
-- From numeric
SELECT 3.14159::numeric::rstats;
-- Result: (count:1,mean:3.14159,min:3.14159,max:3.14159,stddev:0)Semantics: Creates RStats initialized with a single data point.
Assignment casts for binary serialization (backup/restore):
-- RStats to bytea (serialization)
SELECT rstats()::bytea;
-- Result: \x0000000000000000...
-- Bytea to RStats (deserialization)
SELECT '\x0000000000000000...'::bytea::rstats;
-- Result: (count:0,mean:0,min:0,max:0,stddev:0)
-- Round-trip preserves state
SELECT (10::rstats + 20 + 30)::bytea::rstats;
-- Result: (count:3,mean:20,min:10,max:30,stddev:10)Semantics: Enables binary backup/restore and external storage.
Accumulates a new value into running statistics:
-- Add double precision
SELECT 10.0::rstats + 20.0 + 30.0;
-- Result: (count:3,mean:20,min:10,max:30,stddev:10)
-- Add integer
SELECT 5::rstats + 10 + 15;
-- Result: (count:3,mean:10,min:5,max:15,stddev:5)
-- Chain multiple additions
SELECT ((10.0::rstats + 20.0) + 30.0) + 40.0;
-- Result: (count:4,mean:25,min:10,max:40,stddev:12.91)Semantics:
- Updates mean, stddev (via m2), min, max using Welford's algorithm
- Order-dependent: accumulates values sequentially
- Deterministic: same sequence → same result
- Handles NULL gracefully:
stats + NULLreturns NULL
Tests for exact identity (bit-identical state):
-- Exact match: same sequence
SELECT (10::rstats + 20 + 30) = (10::rstats + 20 + 30);
-- Result: true
-- Different sequence, same statistics
SELECT (10::rstats + 20) = (15::rstats + 15);
-- Result: false (mean is same, but internal state differs)
-- Empty state comparison
SELECT rstats() = rstats();
-- Result: trueSemantics:
- Identity, not numerical equivalence: Tests if two RStats accumulated the exact same sequence of values
- Deterministic algorithm: Welford's algorithm guarantees same inputs → bit-identical state
- Exact float comparison: No epsilon tolerance (by design)
- Use cases: Caching, deduplication, identity checks in hash tables
For numerical similarity, compare extracted statistics:
-- Check if means are approximately equal
SELECT ABS((stats1 -> 'mean') - (stats2 -> 'mean')) < 0.001;Calculates the Mahalanobis distance between two statistical distributions:
-- Compare two distributions
SELECT (10::rstats + 20 + 30) <-> (15::rstats + 25 + 35);
-- Result: small value (similar distributions)
SELECT (10::rstats + 20 + 30) <-> (100::rstats + 200 + 300);
-- Result: larger value (dissimilar distributions)
-- Find queries with statistics most different from a reference
SELECT queryid, stats <-> reference_stats AS distance
FROM pg_track_optimizer, (SELECT 0::rstats + 1 + 2 AS reference_stats) ref
ORDER BY distance DESC;Semantics:
- Returns
double precisionrepresenting statistical distance - Lower values indicate more similar distributions
- Commutative:
a <-> bequalsb <-> a - Useful for clustering queries by statistical similarity or detecting outliers
Extracts statistical properties as double precision:
SELECT
stats -> 'count' AS count,
stats -> 'mean' AS mean,
stats -> 'stddev' AS stddev,
stats -> 'min' AS min,
stats -> 'max' AS max
FROM (SELECT (10::rstats + 20 + 30) AS stats) t;Available Fields:
count: Number of accumulated values (int64 → float8)mean: Arithmetic meanstddev: Sample standard deviation (n-1 denominator)min: Minimum value observedmax: Maximum value observed
Semantics:
- Returns
double precision(cast as needed) - Stddev uses sample formula:
sqrt(m2 / (count - 1)) - Returns 0 stddev for count ≤ 1
- Works with expression indexes:
CREATE INDEX ON table ((stats -> 'mean'))
SELECT rstats();
-- Result: (count:0,mean:0,min:0,max:0,stddev:0)Semantics: Creates canonical empty state.
-- Accepts any numeric-convertible type
SELECT rstats(42::int2);
SELECT rstats(3.14::float4);
SELECT rstats(100::bigint);Semantics: Initializes with a single value (delegates to type-specific casts).
Aggregates multiple values into a single RStats object:
-- Aggregate values from a table column
SELECT rstats_agg(value) FROM measurements;
-- Result: (count:N,mean:...,min:...,max:...,stddev:...)
-- Aggregate with grouping
SELECT category, rstats_agg(price) AS price_stats
FROM products
GROUP BY category;
-- Combine with field accessor
SELECT
category,
rstats_agg(price) -> 'mean' AS avg_price,
rstats_agg(price) -> 'stddev' AS price_stddev
FROM products
GROUP BY category;Semantics:
- Collects all non-NULL values in the group into running statistics
- Returns canonical empty state if no values are aggregated
- Uses Welford's algorithm internally for numerical stability
- Order of aggregation may affect floating-point rounding (typically negligible)
- Welford's Algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
- PostgreSQL Type System: https://www.postgresql.org/docs/current/xtypes.html
- Error Codes: https://www.postgresql.org/docs/current/errcodes-appendix.html
Last Updated: 2026-01-10 Author: Andrei Lepikhov Reviewers: Claude Code Review