Performance

All performance claims in this project are backed by measured data from Criterion.rs benchmarks with 95% confidence intervals, validated across three or more consecutive runs. Improvements are accepted only when confidence intervals do not overlap.

Current Baseline

Measured on the development machine. Absolute numbers vary by hardware; relative improvements and algorithmic scaling are hardware-independent.

Headline Numbers

FunctionScaleWall ClockThroughput
sessionize1 billion1.20 s830 Melem/s
retention (combine)100 million274 ms365 Melem/s
window_funnel100 million791 ms126 Melem/s
sequence_match100 million1.05 s95 Melem/s
sequence_count100 million1.18 s85 Melem/s
sequence_match_events100 million1.07 s93 Melem/s
sequence_next_node10 million546 ms18 Melem/s

Per-Element Cost

FunctionCost per ElementBound
sessionize1.2 nsCPU-bound (register-only loop)
retention2.7 nsCPU-bound (bitmask OR)
window_funnel7.9 nsMemory-bound at 100M (1.6 GB working set)
sequence_match10.5 nsMemory-bound (NFA + event traversal)
sequence_count11.8 nsMemory-bound (NFA restart overhead)

Algorithmic Complexity

FunctionUpdateCombineFinalizeSpace
sessionizeO(1)O(1)O(1)O(1)
retentionO(k)O(1)O(k)O(1)
window_funnelO(1)O(m)O(n*k)O(n)
sequence_matchO(1)O(m)O(n*s)O(n)
sequence_countO(1)O(m)O(n*s)O(n)
sequence_match_eventsO(1)O(m)O(n*s)O(n)
sequence_next_nodeO(1)O(m)O(n*k)O(n)

Where n = events, m = events in other state, k = conditions (up to 32), s = pattern steps.

Optimization History

Fifteen engineering sessions have produced the current performance profile. Each optimization below was measured independently with before/after Criterion data and non-overlapping confidence intervals (except where noted as negative results).

Event Bitmask Optimization

Replaced Vec<bool> per event with a u32 bitmask, eliminating per-event heap allocation and enabling Copy semantics. The Event struct shrank from 32+ bytes with heap allocation to 16 bytes with zero heap allocation.

FunctionScaleBeforeAfterSpeedup
window_funnel1,000 events18.6 us1.65 us11.3x
window_funnel100,000 events2.35 ms188 us12.5x
sequence_match10,000 events260 us28.0 us9.3x

In-Place Combine

Replaced O(N^2) merge-allocate combine with O(N) in-place extend. Switched from stable sort (TimSort) to unstable sort (pdqsort). Eliminated redundant pattern cloning.

FunctionScaleBeforeAfterSpeedup
window_funnel_combine100 states10.9 us466 ns23x
window_funnel_combine1,000 states781 us3.03 us258x
window_funnel_combine10,000 states75.3 ms30.9 us2,436x

NFA Lazy Matching

Fixed .* exploration order in the NFA executor. The NFA was using greedy matching (consume event first), causing O(n^2) backtracking. Switching to lazy matching (advance pattern first) reduced complexity to O(n * pattern_len).

FunctionScaleBeforeAfterSpeedup
sequence_match100 events750 ns454 ns1.76x
sequence_match1,000,000 events4.43 s2.31 ms1,961x

Billion-Row Benchmarks

Expanded benchmarks to 100M for event-collecting functions and 1B for O(1)-state functions. Validated that all functions scale as documented with no hidden bottlenecks at extreme scale.

FunctionScaleWall ClockThroughput
sessionize_update1 billion1.21 s824 Melem/s
retention_combine1 billion3.06 s327 Melem/s
window_funnel100 million898 ms111 Melem/s
sequence_match100 million1.08 s92.7 Melem/s
sequence_count100 million1.57 s63.5 Melem/s

ClickHouse Feature Parity

Architectural refactoring session. Implemented combinable FunnelMode bitflags, three new window_funnel modes (strict_increase, strict_once, allow_reentry), multi-step advancement per event, and the sequence_match_events function. No performance optimization targets.

Presorted Detection + 32-Condition Support

Added an O(n) is_sorted check before sorting. When events arrive in timestamp order (the common case for ORDER BY queries), the O(n log n) sort is skipped entirely. The O(n) overhead for unsorted input is negligible.

Also expanded condition support from u8 (8 conditions) to u32 (32 conditions). This was a zero-cost change: the Event struct remains 16 bytes due to alignment padding from the i64 field.

Negative Results

Three optimization hypotheses were tested and all produced negative results. These are documented to prevent redundant investigation:

  • LSD Radix Sort: 8-bit radix, 8 passes. Measured 4.3x slower at 100M elements. The scatter pattern has poor cache locality for 16-byte structs. pdqsort's in-place partitioning is superior for embedded-key structs.

  • Branchless Sessionize: Replaced conditional branches with i64::from(bool) and max/min. Measured 5-10% slower. The branch predictor handles the 90/10 gap-exceeds-threshold pattern with near-perfect accuracy. CMOV adds fixed overhead that exceeds the near-zero cost of correctly predicted branches.

  • Retention Update Micro-optimization: Analysis showed the benchmark bottleneck is Vec<bool> allocation in the test harness, not the bitmask update loop. No optimization possible without changing the benchmark structure.

sequence_next_node Implementation

Implemented sequence_next_node with full direction (forward/backward) and base (head/tail/first_match/last_match) support. Uses a dedicated NextNodeEvent struct with per-event string storage (separate from the Copy Event struct). Simple sequential matching algorithm.

Added a dedicated benchmark (benches/sequence_next_node_bench.rs) measuring update + finalize throughput and combine operations at multiple scales. No performance optimization targets — this session focused on feature completeness, achieving complete ClickHouse behavioral analytics parity.

Arc<str> Optimization

Replaced Option<String> with Option<Arc<str>> in NextNodeEvent, reducing struct size from 40 to 32 bytes and enabling O(1) clone via reference counting.

FunctionScaleImprovement
sequence_next_node update+finalizeall scales2.1x-5.8x
sequence_next_node combineall scales2.8x-6.4x

A string pool attempt (PooledEvent with Copy semantics, 24 bytes) was measured 10-55% slower at most scales due to dual-vector overhead. The key insight: below one cache line (64 bytes), clone cost matters more than absolute struct size. Arc::clone (~1ns atomic increment) consistently beats String::clone (copies len bytes).

Also added a realistic cardinality benchmark using a pool of 100 distinct Arc<str> values, showing 35% improvement over unique strings at 1M events.

E2E Validation + Custom C Entry Point

Discovered and fixed 3 critical bugs that all 375 passing unit tests at the time (now 434) missed:

  1. SEGFAULT on extension load: extract_raw_connection used incorrect pointer arithmetic. Replaced with a custom C entry point (behavioral_init_c_api) that obtains duckdb_connection via duckdb_connect directly from duckdb_extension_access.

  2. Silent function registration failure: duckdb_aggregate_function_set_name must be called on each function in a set. Without this, 6 of 7 functions silently failed to register.

  3. Incorrect combine propagation: combine_in_place did not propagate window_size_us or mode. DuckDB's segment tree creates fresh zero-initialized target states, so these fields remained zero at finalize time.

No performance optimization targets. 11 E2E tests added against real DuckDB v1.4.4 CLI.

NFA Fast Paths

Two Criterion-validated optimizations for the NFA pattern executor:

  1. NFA reusable stack: Pre-allocate the NFA state Vec once and reuse across all starting positions via clear(). Eliminates O(N) alloc/free pairs in sequence_count.

  2. Fast-path linear scan: Pattern classification dispatches common shapes to specialized O(n) scans:

    • Adjacent conditions ((?1)(?2)): sliding window scan
    • Wildcard-separated ((?1).*(?2)): single-pass step counter
    • Complex patterns: fall back to NFA
FunctionScaleImprovement
sequence_count100-1M events33-47% (NFA reusable stack)
sequence_count100-1M events39-40% additional (fast-path linear scan)
sequence_count100-1M events56-61% combined improvement

A first-condition pre-check was tested and produced negative results (0-8% regression). The NFA already handles non-matching starts efficiently (~5ns per check), so the extra branch in the hot loop costs more than it saves. Reverted.

Benchmark Inventory

BenchmarkFunctionInput Sizes
sessionize_updatesessionize100 to 1 billion
sessionize_combinesessionize100 to 1 million
retention_updateretention4, 8, 16, 32 conditions
retention_combineretention100 to 1 billion
window_funnel_finalizewindow_funnel100 to 100 million
window_funnel_combinewindow_funnel100 to 1 million
sequence_matchsequence_match100 to 100 million
sequence_countsequence_count100 to 100 million
sequence_combinesequence_*100 to 1 million
sequence_match_eventssequence_match_events100 to 100 million
sequence_match_events_combinesequence_match_events100 to 1 million
sort_events(isolated)100 to 100 million
sort_events_presorted(isolated)100 to 100 million
sequence_next_nodesequence_next_node100 to 10 million
sequence_next_node_combinesequence_next_node100 to 1 million
sequence_next_node_realisticsequence_next_node1,000 to 1 million

Reproduction

# Run all benchmarks
cargo bench

# Run a specific group
cargo bench -- sessionize
cargo bench -- window_funnel
cargo bench -- sequence_match_events

# Results stored in target/criterion/ for comparison

Criterion automatically compares against previous runs and reports confidence intervals. Run benchmarks three or more times before drawing conclusions from the data.

Methodology

  • Framework: Criterion.rs 0.8, 100 samples per benchmark, 95% CI
  • Large-scale benchmarks: 10 samples with 60-second measurement time for 100M and 1B element counts
  • Validation: Every claim requires three consecutive runs with non-overlapping confidence intervals
  • Negative results: Documented with the same rigor as positive results, including measured data and root cause analysis

Full optimization history with raw Criterion data, confidence intervals, and detailed analysis is maintained in PERF.md in the repository.