Performance
All performance claims in this project are backed by measured data from Criterion.rs benchmarks with 95% confidence intervals, validated across three or more consecutive runs. Improvements are accepted only when confidence intervals do not overlap.
Current Baseline
Measured on the development machine. Absolute numbers vary by hardware; relative improvements and algorithmic scaling are hardware-independent.
Headline Numbers
| Function | Scale | Wall Clock | Throughput |
|---|---|---|---|
sessionize | 1 billion | 1.20 s | 830 Melem/s |
retention (combine) | 100 million | 274 ms | 365 Melem/s |
window_funnel | 100 million | 791 ms | 126 Melem/s |
sequence_match | 100 million | 1.05 s | 95 Melem/s |
sequence_count | 100 million | 1.18 s | 85 Melem/s |
sequence_match_events | 100 million | 1.07 s | 93 Melem/s |
sequence_next_node | 10 million | 546 ms | 18 Melem/s |
Per-Element Cost
| Function | Cost per Element | Bound |
|---|---|---|
sessionize | 1.2 ns | CPU-bound (register-only loop) |
retention | 2.7 ns | CPU-bound (bitmask OR) |
window_funnel | 7.9 ns | Memory-bound at 100M (1.6 GB working set) |
sequence_match | 10.5 ns | Memory-bound (NFA + event traversal) |
sequence_count | 11.8 ns | Memory-bound (NFA restart overhead) |
Algorithmic Complexity
| Function | Update | Combine | Finalize | Space |
|---|---|---|---|---|
sessionize | O(1) | O(1) | O(1) | O(1) |
retention | O(k) | O(1) | O(k) | O(1) |
window_funnel | O(1) | O(m) | O(n*k) | O(n) |
sequence_match | O(1) | O(m) | O(n*s) | O(n) |
sequence_count | O(1) | O(m) | O(n*s) | O(n) |
sequence_match_events | O(1) | O(m) | O(n*s) | O(n) |
sequence_next_node | O(1) | O(m) | O(n*k) | O(n) |
Where n = events, m = events in other state, k = conditions (up to 32), s = pattern steps.
Optimization History
Fifteen engineering sessions have produced the current performance profile. Each optimization below was measured independently with before/after Criterion data and non-overlapping confidence intervals (except where noted as negative results).
Event Bitmask Optimization
Replaced Vec<bool> per event with a u32 bitmask, eliminating per-event heap
allocation and enabling Copy semantics. The Event struct shrank from 32+
bytes with heap allocation to 16 bytes with zero heap allocation.
| Function | Scale | Before | After | Speedup |
|---|---|---|---|---|
window_funnel | 1,000 events | 18.6 us | 1.65 us | 11.3x |
window_funnel | 100,000 events | 2.35 ms | 188 us | 12.5x |
sequence_match | 10,000 events | 260 us | 28.0 us | 9.3x |
In-Place Combine
Replaced O(N^2) merge-allocate combine with O(N) in-place extend. Switched from stable sort (TimSort) to unstable sort (pdqsort). Eliminated redundant pattern cloning.
| Function | Scale | Before | After | Speedup |
|---|---|---|---|---|
window_funnel_combine | 100 states | 10.9 us | 466 ns | 23x |
window_funnel_combine | 1,000 states | 781 us | 3.03 us | 258x |
window_funnel_combine | 10,000 states | 75.3 ms | 30.9 us | 2,436x |
NFA Lazy Matching
Fixed .* exploration order in the NFA executor. The NFA was using greedy
matching (consume event first), causing O(n^2) backtracking. Switching to lazy
matching (advance pattern first) reduced complexity to O(n * pattern_len).
| Function | Scale | Before | After | Speedup |
|---|---|---|---|---|
sequence_match | 100 events | 750 ns | 454 ns | 1.76x |
sequence_match | 1,000,000 events | 4.43 s | 2.31 ms | 1,961x |
Billion-Row Benchmarks
Expanded benchmarks to 100M for event-collecting functions and 1B for O(1)-state functions. Validated that all functions scale as documented with no hidden bottlenecks at extreme scale.
| Function | Scale | Wall Clock | Throughput |
|---|---|---|---|
sessionize_update | 1 billion | 1.21 s | 824 Melem/s |
retention_combine | 1 billion | 3.06 s | 327 Melem/s |
window_funnel | 100 million | 898 ms | 111 Melem/s |
sequence_match | 100 million | 1.08 s | 92.7 Melem/s |
sequence_count | 100 million | 1.57 s | 63.5 Melem/s |
ClickHouse Feature Parity
Architectural refactoring session. Implemented combinable FunnelMode bitflags,
three new window_funnel modes (strict_increase, strict_once,
allow_reentry), multi-step advancement per event, and the
sequence_match_events function. No performance optimization targets.
Presorted Detection + 32-Condition Support
Added an O(n) is_sorted check before sorting. When events arrive in timestamp
order (the common case for ORDER BY queries), the O(n log n) sort is skipped
entirely. The O(n) overhead for unsorted input is negligible.
Also expanded condition support from u8 (8 conditions) to u32 (32
conditions). This was a zero-cost change: the Event struct remains 16 bytes
due to alignment padding from the i64 field.
Negative Results
Three optimization hypotheses were tested and all produced negative results. These are documented to prevent redundant investigation:
-
LSD Radix Sort: 8-bit radix, 8 passes. Measured 4.3x slower at 100M elements. The scatter pattern has poor cache locality for 16-byte structs. pdqsort's in-place partitioning is superior for embedded-key structs.
-
Branchless Sessionize: Replaced conditional branches with
i64::from(bool)andmax/min. Measured 5-10% slower. The branch predictor handles the 90/10 gap-exceeds-threshold pattern with near-perfect accuracy. CMOV adds fixed overhead that exceeds the near-zero cost of correctly predicted branches. -
Retention Update Micro-optimization: Analysis showed the benchmark bottleneck is
Vec<bool>allocation in the test harness, not the bitmask update loop. No optimization possible without changing the benchmark structure.
sequence_next_node Implementation
Implemented sequence_next_node with full direction (forward/backward) and
base (head/tail/first_match/last_match) support. Uses a dedicated
NextNodeEvent struct with per-event string storage (separate from the
Copy Event struct). Simple sequential matching algorithm.
Added a dedicated benchmark (benches/sequence_next_node_bench.rs) measuring
update + finalize throughput and combine operations at multiple scales. No
performance optimization targets — this session focused on feature
completeness, achieving complete ClickHouse behavioral analytics parity.
Arc<str> Optimization
Replaced Option<String> with Option<Arc<str>> in NextNodeEvent, reducing
struct size from 40 to 32 bytes and enabling O(1) clone via reference counting.
| Function | Scale | Improvement |
|---|---|---|
sequence_next_node update+finalize | all scales | 2.1x-5.8x |
sequence_next_node combine | all scales | 2.8x-6.4x |
A string pool attempt (PooledEvent with Copy semantics, 24 bytes) was
measured 10-55% slower at most scales due to dual-vector overhead. The key
insight: below one cache line (64 bytes), clone cost matters more than absolute
struct size. Arc::clone (~1ns atomic increment) consistently beats
String::clone (copies len bytes).
Also added a realistic cardinality benchmark using a pool of 100 distinct
Arc<str> values, showing 35% improvement over unique strings at 1M events.
E2E Validation + Custom C Entry Point
Discovered and fixed 3 critical bugs that all 375 passing unit tests at the time (now 434) missed:
-
SEGFAULT on extension load:
extract_raw_connectionused incorrect pointer arithmetic. Replaced with a custom C entry point (behavioral_init_c_api) that obtainsduckdb_connectionviaduckdb_connectdirectly fromduckdb_extension_access. -
Silent function registration failure:
duckdb_aggregate_function_set_namemust be called on each function in a set. Without this, 6 of 7 functions silently failed to register. -
Incorrect combine propagation:
combine_in_placedid not propagatewindow_size_usormode. DuckDB's segment tree creates fresh zero-initialized target states, so these fields remained zero at finalize time.
No performance optimization targets. 11 E2E tests added against real DuckDB v1.4.4 CLI.
NFA Fast Paths
Two Criterion-validated optimizations for the NFA pattern executor:
-
NFA reusable stack: Pre-allocate the NFA state Vec once and reuse across all starting positions via
clear(). Eliminates O(N) alloc/free pairs insequence_count. -
Fast-path linear scan: Pattern classification dispatches common shapes to specialized O(n) scans:
- Adjacent conditions (
(?1)(?2)): sliding window scan - Wildcard-separated (
(?1).*(?2)): single-pass step counter - Complex patterns: fall back to NFA
- Adjacent conditions (
| Function | Scale | Improvement |
|---|---|---|
sequence_count | 100-1M events | 33-47% (NFA reusable stack) |
sequence_count | 100-1M events | 39-40% additional (fast-path linear scan) |
sequence_count | 100-1M events | 56-61% combined improvement |
A first-condition pre-check was tested and produced negative results (0-8% regression). The NFA already handles non-matching starts efficiently (~5ns per check), so the extra branch in the hot loop costs more than it saves. Reverted.
Benchmark Inventory
| Benchmark | Function | Input Sizes |
|---|---|---|
sessionize_update | sessionize | 100 to 1 billion |
sessionize_combine | sessionize | 100 to 1 million |
retention_update | retention | 4, 8, 16, 32 conditions |
retention_combine | retention | 100 to 1 billion |
window_funnel_finalize | window_funnel | 100 to 100 million |
window_funnel_combine | window_funnel | 100 to 1 million |
sequence_match | sequence_match | 100 to 100 million |
sequence_count | sequence_count | 100 to 100 million |
sequence_combine | sequence_* | 100 to 1 million |
sequence_match_events | sequence_match_events | 100 to 100 million |
sequence_match_events_combine | sequence_match_events | 100 to 1 million |
sort_events | (isolated) | 100 to 100 million |
sort_events_presorted | (isolated) | 100 to 100 million |
sequence_next_node | sequence_next_node | 100 to 10 million |
sequence_next_node_combine | sequence_next_node | 100 to 1 million |
sequence_next_node_realistic | sequence_next_node | 1,000 to 1 million |
Reproduction
# Run all benchmarks
cargo bench
# Run a specific group
cargo bench -- sessionize
cargo bench -- window_funnel
cargo bench -- sequence_match_events
# Results stored in target/criterion/ for comparison
Criterion automatically compares against previous runs and reports confidence intervals. Run benchmarks three or more times before drawing conclusions from the data.
Methodology
- Framework: Criterion.rs 0.8, 100 samples per benchmark, 95% CI
- Large-scale benchmarks: 10 samples with 60-second measurement time for 100M and 1B element counts
- Validation: Every claim requires three consecutive runs with non-overlapping confidence intervals
- Negative results: Documented with the same rigor as positive results, including measured data and root cause analysis
Full optimization history with raw Criterion data, confidence intervals, and
detailed analysis is maintained in
PERF.md
in the repository.