duckdb-behavioral

duckdb-behavioral

Behavioral analytics for DuckDB -- session analysis, conversion funnels, retention cohorts, and event sequence pattern matching, all inside your SQL queries.

duckdb-behavioral is a loadable DuckDB extension written in Rust that brings ClickHouse-style behavioral analytics functions to DuckDB. It ships seven battle-tested functions that cover the core patterns of user behavior analysis, with complete ClickHouse feature parity and benchmark-validated performance at billion-row scale.

No external services, no data pipelines, no additional infrastructure. Load the extension, write SQL, get answers.


What Can You Do With This?

Session Analysis

Break a continuous stream of events into logical sessions based on inactivity gaps. Identify how many sessions a user has per day, how long each session lasts, and where sessions begin and end.

SELECT user_id, event_time,
  sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
  ) as session_id
FROM events;

Conversion Funnels

Track how far users progress through a multi-step conversion funnel (page view, add to cart, checkout, purchase) within a time window. Identify exactly where users drop off.

SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'checkout',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id;

Retention Cohorts

Measure whether users who appeared in a cohort (e.g., signed up in January) returned in subsequent periods. Build the classic retention triangle directly in SQL.

SELECT cohort_month,
  retention(
    activity_date = cohort_month,
    activity_date = cohort_month + INTERVAL '1 month',
    activity_date = cohort_month + INTERVAL '2 months',
    activity_date = cohort_month + INTERVAL '3 months'
  ) as retained
FROM user_activity
GROUP BY user_id, cohort_month;

Event Sequence Pattern Matching

Detect complex behavioral patterns using a mini-regex over event conditions. Find users who viewed a product, then purchased within one hour -- with any number of intervening events.

SELECT user_id,
  sequence_match('(?1).*(?t<=3600)(?2)', event_time,
    event_type = 'view',
    event_type = 'purchase'
  ) as converted_within_hour
FROM events
GROUP BY user_id;

User Journey / Flow Analysis

Discover what users do after a specific behavioral sequence. What page do users visit after navigating from Home to Product?

SELECT
  sequence_next_node('forward', 'first_match', event_time, page,
    page = 'Home',
    page = 'Home',
    page = 'Product'
  ) as next_page,
  COUNT(*) as user_count
FROM events
GROUP BY ALL
ORDER BY user_count DESC;

Quick Installation

Community Extension

The extension is listed in the DuckDB Community Extensions repository:

INSTALL behavioral FROM community;
LOAD behavioral;

No build tools, compilation, or -unsigned flag required.

From Source

git clone https://github.com/tomtom215/duckdb-behavioral.git
cd duckdb-behavioral
cargo build --release

Then load the extension in DuckDB:

LOAD 'path/to/target/release/libbehavioral.so';  -- Linux
LOAD 'path/to/target/release/libbehavioral.dylib'; -- macOS

Note: DuckDB requires the -unsigned flag for locally-built extensions: duckdb -unsigned

For detailed installation instructions, troubleshooting, and a complete worked example, see the Getting Started guide.


Functions

Choosing the Right Function

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a'}}}%%
flowchart TD
    Q{What do you want<br/>to analyze?}
    Q -->|"Break events<br/>into sessions"| S["sessionize"]
    Q -->|"Did users come<br/>back over time?"| R["retention"]
    Q -->|"How far through<br/>a multi-step flow?"| WF["window_funnel"]
    Q -->|"Did a specific<br/>event pattern occur?"| SM{Need details?}
    SM -->|"Yes/No answer"| SEQ["sequence_match"]
    SM -->|"How many times?"| SC["sequence_count"]
    SM -->|"When did each<br/>step happen?"| SME["sequence_match_events"]
    Q -->|"What happened<br/>next/before?"| SNN["sequence_next_node"]

    style Q fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style S fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style R fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style WF fill:#f0f0f0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SM fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SEQ fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SC fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SME fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SNN fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a

Seven functions covering the full spectrum of behavioral analytics:

FunctionTypeReturnsDescription
sessionizeWindowBIGINTAssigns session IDs based on inactivity gaps
retentionAggregateBOOLEAN[]Cohort retention analysis
window_funnelAggregateINTEGERConversion funnel step tracking
sequence_matchAggregateBOOLEANPattern matching over event sequences
sequence_countAggregateBIGINTCount non-overlapping pattern matches
sequence_match_eventsAggregateLIST(TIMESTAMP)Return matched condition timestamps
sequence_next_nodeAggregateVARCHARNext event value after pattern match

All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. See the ClickHouse Compatibility page for the full parity matrix.


Performance

All functions are engineered for large-scale analytical workloads. Every performance claim below is backed by Criterion.rs benchmarks with 95% confidence intervals, validated across multiple runs.

FunctionScaleWall ClockThroughput
sessionize1 billion rows1.20 s830 Melem/s
retention100 million rows274 ms365 Melem/s
window_funnel100 million rows791 ms126 Melem/s
sequence_match100 million rows1.05 s95 Melem/s
sequence_count100 million rows1.18 s85 Melem/s
sequence_match_events100 million rows1.07 s93 Melem/s
sequence_next_node10 million rows546 ms18 Melem/s

Key design choices that enable this performance:

  • 16-byte Copy events with u32 bitmask conditions -- four events per cache line, zero heap allocation per event
  • O(1) combine for sessionize and retention via boundary tracking and bitmask OR
  • In-place combine for event-collecting functions -- O(N) amortized instead of O(N^2) from repeated allocation
  • NFA fast paths -- common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
  • Presorted detection -- O(n) check skips O(n log n) sort when events arrive in timestamp order (common for ORDER BY queries)

Full methodology, per-element cost analysis, and optimization history are documented in the Performance section.


Engineering Highlights

This project demonstrates depth across systems programming, database internals, algorithm design, performance engineering, and software quality practices. For a comprehensive technical overview, see the Engineering Overview.

AreaHighlights
Language & SafetyPure Rust core with unsafe confined to 6 FFI files. Zero clippy warnings under pedantic, nursery, and cargo lint groups.
Testing Rigor434 unit tests, 27 E2E tests against real DuckDB, 26 property-based tests (proptest), 88.4% mutation testing kill rate (cargo-mutants).
PerformanceFifteen sessions of measured optimization with Criterion.rs. Billion-row benchmarks with 95% confidence intervals. Five negative results documented honestly.
Algorithm DesignCustom NFA pattern engine with recursive descent parser, fast-path classification, and lazy backtracking. Bitmask-based retention with O(1) combine.
Database InternalsRaw DuckDB C API integration via custom entry point. 31 function set overloads per variadic function. Correct combine semantics for segment tree windowing.
CI/CD13 CI jobs, 4-platform release builds, SemVer validation, artifact attestation, MSRV verification.
Feature CompletenessComplete ClickHouse behavioral analytics parity: 7 functions, 6 combinable funnel modes, 32-condition support, time-constrained pattern syntax.

Documentation

SectionContents
Engineering OverviewTechnical depth, architecture, quality standards, domain significance
Getting StartedInstallation, loading, troubleshooting, your first analysis
Function ReferenceDetailed docs for all 7 functions with examples
Use CasesFive complete real-world examples with sample data and queries
FAQCommon questions about loading, patterns, modes, NULLs
ArchitectureModule structure, design decisions, FFI bridge
PerformanceBenchmarks, algorithmic complexity, optimization history
ClickHouse CompatibilitySyntax mapping, semantic parity matrix
OperationsCI/CD, security and supply chain, benchmarking methodology
ContributingDevelopment setup, testing expectations, PR process

Requirements

  • DuckDB 1.4.4 (C API version v1.2.0)
  • Rust 1.80+ (MSRV) for building from source
  • A C compiler for DuckDB system bindings

Source Code

The source code is available at github.com/tomtom215/duckdb-behavioral.

License

MIT