duckdb-behavioral
Behavioral analytics for DuckDB -- session analysis, conversion funnels, retention cohorts, and event sequence pattern matching, all inside your SQL queries.
duckdb-behavioral is a loadable DuckDB extension written in Rust that brings
ClickHouse-style behavioral analytics functions
to DuckDB. It ships seven battle-tested functions that cover the core patterns
of user behavior analysis, with complete ClickHouse feature parity and
benchmark-validated performance at billion-row scale.
No external services, no data pipelines, no additional infrastructure. Load the extension, write SQL, get answers.
What Can You Do With This?
Session Analysis
Break a continuous stream of events into logical sessions based on inactivity gaps. Identify how many sessions a user has per day, how long each session lasts, and where sessions begin and end.
SELECT user_id, event_time,
sessionize(event_time, INTERVAL '30 minutes') OVER (
PARTITION BY user_id ORDER BY event_time
) as session_id
FROM events;
Conversion Funnels
Track how far users progress through a multi-step conversion funnel (page view, add to cart, checkout, purchase) within a time window. Identify exactly where users drop off.
SELECT user_id,
window_funnel(INTERVAL '1 hour', event_time,
event_type = 'page_view',
event_type = 'add_to_cart',
event_type = 'checkout',
event_type = 'purchase'
) as furthest_step
FROM events
GROUP BY user_id;
Retention Cohorts
Measure whether users who appeared in a cohort (e.g., signed up in January) returned in subsequent periods. Build the classic retention triangle directly in SQL.
SELECT cohort_month,
retention(
activity_date = cohort_month,
activity_date = cohort_month + INTERVAL '1 month',
activity_date = cohort_month + INTERVAL '2 months',
activity_date = cohort_month + INTERVAL '3 months'
) as retained
FROM user_activity
GROUP BY user_id, cohort_month;
Event Sequence Pattern Matching
Detect complex behavioral patterns using a mini-regex over event conditions. Find users who viewed a product, then purchased within one hour -- with any number of intervening events.
SELECT user_id,
sequence_match('(?1).*(?t<=3600)(?2)', event_time,
event_type = 'view',
event_type = 'purchase'
) as converted_within_hour
FROM events
GROUP BY user_id;
User Journey / Flow Analysis
Discover what users do after a specific behavioral sequence. What page do users visit after navigating from Home to Product?
SELECT
sequence_next_node('forward', 'first_match', event_time, page,
page = 'Home',
page = 'Home',
page = 'Product'
) as next_page,
COUNT(*) as user_count
FROM events
GROUP BY ALL
ORDER BY user_count DESC;
Quick Installation
Community Extension
The extension is listed in the DuckDB Community Extensions repository:
INSTALL behavioral FROM community;
LOAD behavioral;
No build tools, compilation, or -unsigned flag required.
From Source
git clone https://github.com/tomtom215/duckdb-behavioral.git
cd duckdb-behavioral
cargo build --release
Then load the extension in DuckDB:
LOAD 'path/to/target/release/libbehavioral.so'; -- Linux
LOAD 'path/to/target/release/libbehavioral.dylib'; -- macOS
Note: DuckDB requires the
-unsignedflag for locally-built extensions:duckdb -unsigned
For detailed installation instructions, troubleshooting, and a complete worked example, see the Getting Started guide.
Functions
Choosing the Right Function
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a'}}}%%
flowchart TD
Q{What do you want<br/>to analyze?}
Q -->|"Break events<br/>into sessions"| S["sessionize"]
Q -->|"Did users come<br/>back over time?"| R["retention"]
Q -->|"How far through<br/>a multi-step flow?"| WF["window_funnel"]
Q -->|"Did a specific<br/>event pattern occur?"| SM{Need details?}
SM -->|"Yes/No answer"| SEQ["sequence_match"]
SM -->|"How many times?"| SC["sequence_count"]
SM -->|"When did each<br/>step happen?"| SME["sequence_match_events"]
Q -->|"What happened<br/>next/before?"| SNN["sequence_next_node"]
style Q fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
style S fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
style R fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
style WF fill:#f0f0f0,stroke:#333333,stroke-width:2px,color:#1a1a1a
style SM fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
style SEQ fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
style SC fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
style SME fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
style SNN fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
Seven functions covering the full spectrum of behavioral analytics:
| Function | Type | Returns | Description |
|---|---|---|---|
sessionize | Window | BIGINT | Assigns session IDs based on inactivity gaps |
retention | Aggregate | BOOLEAN[] | Cohort retention analysis |
window_funnel | Aggregate | INTEGER | Conversion funnel step tracking |
sequence_match | Aggregate | BOOLEAN | Pattern matching over event sequences |
sequence_count | Aggregate | BIGINT | Count non-overlapping pattern matches |
sequence_match_events | Aggregate | LIST(TIMESTAMP) | Return matched condition timestamps |
sequence_next_node | Aggregate | VARCHAR | Next event value after pattern match |
All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. See the ClickHouse Compatibility page for the full parity matrix.
Performance
All functions are engineered for large-scale analytical workloads. Every performance claim below is backed by Criterion.rs benchmarks with 95% confidence intervals, validated across multiple runs.
| Function | Scale | Wall Clock | Throughput |
|---|---|---|---|
sessionize | 1 billion rows | 1.20 s | 830 Melem/s |
retention | 100 million rows | 274 ms | 365 Melem/s |
window_funnel | 100 million rows | 791 ms | 126 Melem/s |
sequence_match | 100 million rows | 1.05 s | 95 Melem/s |
sequence_count | 100 million rows | 1.18 s | 85 Melem/s |
sequence_match_events | 100 million rows | 1.07 s | 93 Melem/s |
sequence_next_node | 10 million rows | 546 ms | 18 Melem/s |
Key design choices that enable this performance:
- 16-byte
Copyevents withu32bitmask conditions -- four events per cache line, zero heap allocation per event - O(1) combine for
sessionizeandretentionvia boundary tracking and bitmask OR - In-place combine for event-collecting functions -- O(N) amortized instead of O(N^2) from repeated allocation
- NFA fast paths -- common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
- Presorted detection -- O(n) check skips O(n log n) sort when events arrive
in timestamp order (common for
ORDER BYqueries)
Full methodology, per-element cost analysis, and optimization history are documented in the Performance section.
Engineering Highlights
This project demonstrates depth across systems programming, database internals, algorithm design, performance engineering, and software quality practices. For a comprehensive technical overview, see the Engineering Overview.
| Area | Highlights |
|---|---|
| Language & Safety | Pure Rust core with unsafe confined to 6 FFI files. Zero clippy warnings under pedantic, nursery, and cargo lint groups. |
| Testing Rigor | 434 unit tests, 27 E2E tests against real DuckDB, 26 property-based tests (proptest), 88.4% mutation testing kill rate (cargo-mutants). |
| Performance | Fifteen sessions of measured optimization with Criterion.rs. Billion-row benchmarks with 95% confidence intervals. Five negative results documented honestly. |
| Algorithm Design | Custom NFA pattern engine with recursive descent parser, fast-path classification, and lazy backtracking. Bitmask-based retention with O(1) combine. |
| Database Internals | Raw DuckDB C API integration via custom entry point. 31 function set overloads per variadic function. Correct combine semantics for segment tree windowing. |
| CI/CD | 13 CI jobs, 4-platform release builds, SemVer validation, artifact attestation, MSRV verification. |
| Feature Completeness | Complete ClickHouse behavioral analytics parity: 7 functions, 6 combinable funnel modes, 32-condition support, time-constrained pattern syntax. |
Documentation
| Section | Contents |
|---|---|
| Engineering Overview | Technical depth, architecture, quality standards, domain significance |
| Getting Started | Installation, loading, troubleshooting, your first analysis |
| Function Reference | Detailed docs for all 7 functions with examples |
| Use Cases | Five complete real-world examples with sample data and queries |
| FAQ | Common questions about loading, patterns, modes, NULLs |
| Architecture | Module structure, design decisions, FFI bridge |
| Performance | Benchmarks, algorithmic complexity, optimization history |
| ClickHouse Compatibility | Syntax mapping, semantic parity matrix |
| Operations | CI/CD, security and supply chain, benchmarking methodology |
| Contributing | Development setup, testing expectations, PR process |
Requirements
- DuckDB 1.4.4 (C API version v1.2.0)
- Rust 1.80+ (MSRV) for building from source
- A C compiler for DuckDB system bindings
Source Code
The source code is available at github.com/tomtom215/duckdb-behavioral.
License
MIT