duckdb-behavioral

duckdb-behavioral

Behavioral analytics for DuckDB -- session analysis, conversion funnels, retention cohorts, and event sequence pattern matching, all inside your SQL queries.

duckdb-behavioral is a loadable DuckDB extension written in Rust that brings ClickHouse-style behavioral analytics functions to DuckDB. It ships seven battle-tested functions that cover the core patterns of user behavior analysis, with complete ClickHouse feature parity and benchmark-validated performance at billion-row scale.

No external services, no data pipelines, no additional infrastructure. Load the extension, write SQL, get answers.


What Can You Do With This?

Session Analysis

Break a continuous stream of events into logical sessions based on inactivity gaps. Identify how many sessions a user has per day, how long each session lasts, and where sessions begin and end.

SELECT user_id, event_time,
  sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
  ) as session_id
FROM events;

Conversion Funnels

Track how far users progress through a multi-step conversion funnel (page view, add to cart, checkout, purchase) within a time window. Identify exactly where users drop off.

SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'checkout',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id;

Retention Cohorts

Measure whether users who appeared in a cohort (e.g., signed up in January) returned in subsequent periods. Build the classic retention triangle directly in SQL.

SELECT cohort_month,
  retention(
    activity_date = cohort_month,
    activity_date = cohort_month + INTERVAL '1 month',
    activity_date = cohort_month + INTERVAL '2 months',
    activity_date = cohort_month + INTERVAL '3 months'
  ) as retained
FROM user_activity
GROUP BY user_id, cohort_month;

Event Sequence Pattern Matching

Detect complex behavioral patterns using a mini-regex over event conditions. Find users who viewed a product, then purchased within one hour -- with any number of intervening events.

SELECT user_id,
  sequence_match('(?1).*(?t<=3600)(?2)', event_time,
    event_type = 'view',
    event_type = 'purchase'
  ) as converted_within_hour
FROM events
GROUP BY user_id;

User Journey / Flow Analysis

Discover what users do after a specific behavioral sequence. What page do users visit after navigating from Home to Product?

SELECT
  sequence_next_node('forward', 'first_match', event_time, page,
    page = 'Home',
    page = 'Home',
    page = 'Product'
  ) as next_page,
  COUNT(*) as user_count
FROM events
GROUP BY ALL
ORDER BY user_count DESC;

Quick Installation

Community Extension

The extension is listed in the DuckDB Community Extensions repository:

INSTALL behavioral FROM community;
LOAD behavioral;

No build tools, compilation, or -unsigned flag required.

From Source

git clone https://github.com/tomtom215/duckdb-behavioral.git
cd duckdb-behavioral
cargo build --release

Then load the extension in DuckDB:

LOAD 'path/to/target/release/libbehavioral.so';  -- Linux
LOAD 'path/to/target/release/libbehavioral.dylib'; -- macOS

Note: DuckDB requires the -unsigned flag for locally-built extensions: duckdb -unsigned

For detailed installation instructions, troubleshooting, and a complete worked example, see the Getting Started guide.


Functions

Choosing the Right Function

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a'}}}%%
flowchart TD
    Q{What do you want<br/>to analyze?}
    Q -->|"Break events<br/>into sessions"| S["sessionize"]
    Q -->|"Did users come<br/>back over time?"| R["retention"]
    Q -->|"How far through<br/>a multi-step flow?"| WF["window_funnel"]
    Q -->|"Did a specific<br/>event pattern occur?"| SM{Need details?}
    SM -->|"Yes/No answer"| SEQ["sequence_match"]
    SM -->|"How many times?"| SC["sequence_count"]
    SM -->|"When did each<br/>step happen?"| SME["sequence_match_events"]
    Q -->|"What happened<br/>next/before?"| SNN["sequence_next_node"]

    style Q fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style S fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style R fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style WF fill:#f0f0f0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SM fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SEQ fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SC fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SME fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SNN fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a

Seven functions covering the full spectrum of behavioral analytics:

FunctionTypeReturnsDescription
sessionizeWindowBIGINTAssigns session IDs based on inactivity gaps
retentionAggregateBOOLEAN[]Cohort retention analysis
window_funnelAggregateINTEGERConversion funnel step tracking
sequence_matchAggregateBOOLEANPattern matching over event sequences
sequence_countAggregateBIGINTCount non-overlapping pattern matches
sequence_match_eventsAggregateLIST(TIMESTAMP)Return matched condition timestamps
sequence_next_nodeAggregateVARCHARNext event value after pattern match

All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. See the ClickHouse Compatibility page for the full parity matrix.


Performance

All functions are engineered for large-scale analytical workloads. Every performance claim below is backed by Criterion.rs benchmarks with 95% confidence intervals, validated across multiple runs.

FunctionScaleWall ClockThroughput
sessionize1 billion rows1.20 s830 Melem/s
retention100 million rows274 ms365 Melem/s
window_funnel100 million rows791 ms126 Melem/s
sequence_match100 million rows1.05 s95 Melem/s
sequence_count100 million rows1.18 s85 Melem/s
sequence_match_events100 million rows1.07 s93 Melem/s
sequence_next_node10 million rows546 ms18 Melem/s

Key design choices that enable this performance:

  • 16-byte Copy events with u32 bitmask conditions -- four events per cache line, zero heap allocation per event
  • O(1) combine for sessionize and retention via boundary tracking and bitmask OR
  • In-place combine for event-collecting functions -- O(N) amortized instead of O(N^2) from repeated allocation
  • NFA fast paths -- common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
  • Presorted detection -- O(n) check skips O(n log n) sort when events arrive in timestamp order (common for ORDER BY queries)

Full methodology, per-element cost analysis, and optimization history are documented in the Performance section.


Engineering Highlights

This project demonstrates depth across systems programming, database internals, algorithm design, performance engineering, and software quality practices. For a comprehensive technical overview, see the Engineering Overview.

AreaHighlights
Language & SafetyPure Rust core with unsafe confined to 6 FFI files. Zero clippy warnings under pedantic, nursery, and cargo lint groups.
Testing Rigor434 unit tests, 27 E2E tests against real DuckDB, 26 property-based tests (proptest), 88.4% mutation testing kill rate (cargo-mutants).
PerformanceFifteen sessions of measured optimization with Criterion.rs. Billion-row benchmarks with 95% confidence intervals. Five negative results documented honestly.
Algorithm DesignCustom NFA pattern engine with recursive descent parser, fast-path classification, and lazy backtracking. Bitmask-based retention with O(1) combine.
Database InternalsRaw DuckDB C API integration via custom entry point. 31 function set overloads per variadic function. Correct combine semantics for segment tree windowing.
CI/CD13 CI jobs, 4-platform release builds, SemVer validation, artifact attestation, MSRV verification.
Feature CompletenessComplete ClickHouse behavioral analytics parity: 7 functions, 6 combinable funnel modes, 32-condition support, time-constrained pattern syntax.

Documentation

SectionContents
Engineering OverviewTechnical depth, architecture, quality standards, domain significance
Getting StartedInstallation, loading, troubleshooting, your first analysis
Function ReferenceDetailed docs for all 7 functions with examples
Use CasesFive complete real-world examples with sample data and queries
FAQCommon questions about loading, patterns, modes, NULLs
ArchitectureModule structure, design decisions, FFI bridge
PerformanceBenchmarks, algorithmic complexity, optimization history
ClickHouse CompatibilitySyntax mapping, semantic parity matrix
OperationsCI/CD, security and supply chain, benchmarking methodology
ContributingDevelopment setup, testing expectations, PR process

Requirements

  • DuckDB 1.4.4 (C API version v1.2.0)
  • Rust 1.80+ (MSRV) for building from source
  • A C compiler for DuckDB system bindings

Source Code

The source code is available at github.com/tomtom215/duckdb-behavioral.

License

MIT

Getting Started

This guide walks you through installing the extension, loading it into DuckDB, verifying it works, and running your first behavioral analysis from start to finish.


Installation

The extension is listed in the DuckDB Community Extensions repository. Install with a single command:

INSTALL behavioral FROM community;
LOAD behavioral;

No build tools, compilation, or -unsigned flag required. This works with any DuckDB client (CLI, Python, Node.js, Java, etc.).

Option 2: Build from Source

Building from source gives you the latest development version and works on any platform where Rust and DuckDB are available.

Prerequisites:

  • Rust 1.80 or later (rustup recommended)
  • A C compiler (gcc, clang, or MSVC -- needed for DuckDB system bindings)
  • DuckDB CLI v1.4.4 (for running queries)

Build steps:

# Clone the repository
git clone https://github.com/tomtom215/duckdb-behavioral.git
cd duckdb-behavioral

# Build in release mode (required for loading into DuckDB)
cargo build --release

The loadable extension will be produced at:

  • Linux: target/release/libbehavioral.so
  • macOS: target/release/libbehavioral.dylib

Preparing the extension for loading:

DuckDB loadable extensions require metadata appended to the binary. The repository includes the necessary tooling:

# Initialize the submodule (first time only)
git submodule update --init --recursive

# Copy the built library
cp target/release/libbehavioral.so /tmp/behavioral.duckdb_extension

# Append extension metadata
python3 extension-ci-tools/scripts/append_extension_metadata.py \
  -l /tmp/behavioral.duckdb_extension -n behavioral \
  -p linux_amd64 -dv v1.2.0 -ev v0.2.0 --abi-type C_STRUCT \
  -o /tmp/behavioral.duckdb_extension

Platform note: Replace linux_amd64 with your platform identifier (linux_arm64, osx_amd64, osx_arm64) and .so with .dylib on macOS.


Loading the Extension

-- No special flags needed
INSTALL behavioral FROM community;
LOAD behavioral;

This works in any DuckDB client:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL behavioral FROM community")
conn.execute("LOAD behavioral")

From a Local Build

# The -unsigned flag is required for locally-built extensions
duckdb -unsigned

Then inside the DuckDB prompt:

LOAD '/tmp/behavioral.duckdb_extension';

One-liner

duckdb -unsigned -c "LOAD '/tmp/behavioral.duckdb_extension'; SELECT 'behavioral loaded';"

From a DuckDB Client Library

import duckdb

conn = duckdb.connect(config={"allow_unsigned_extensions": "true"})
conn.execute("LOAD '/tmp/behavioral.duckdb_extension'")

Once loaded, all seven functions are available in the current session: sessionize, retention, window_funnel, sequence_match, sequence_count, sequence_match_events, and sequence_next_node.


Verifying the Installation

Run these minimal queries to confirm each function category is working:

-- Sessionize: should return session ID 1
SELECT sessionize(TIMESTAMP '2024-01-01 10:00:00', INTERVAL '30 minutes')
  OVER () as session_id;

-- Retention: should return [true, false]
SELECT retention(true, false);

-- Window funnel: should return 1
SELECT window_funnel(INTERVAL '1 hour', TIMESTAMP '2024-01-01', true, false);

-- Sequence match: should return true
SELECT sequence_match('(?1).*(?2)', TIMESTAMP '2024-01-01', true, true);

-- Sequence count: should return 1
SELECT sequence_count('(?1).*(?2)', TIMESTAMP '2024-01-01', true, true);

You can also verify all functions registered correctly by querying DuckDB's function catalog:

SELECT function_name FROM duckdb_functions()
WHERE function_name IN (
  'sessionize', 'retention', 'window_funnel',
  'sequence_match', 'sequence_count',
  'sequence_match_events', 'sequence_next_node'
)
GROUP BY function_name
ORDER BY function_name;

This should return all seven function names.


Your First Analysis

This walkthrough creates sample e-commerce event data and demonstrates four core use cases: sessions, funnels, retention, and pattern matching.

Step 1: Create Sample Data

-- Create an events table with typical e-commerce data
CREATE TABLE events AS SELECT * FROM (VALUES
  (1, TIMESTAMP '2024-01-15 09:00:00', 'page_view',    'Home'),
  (1, TIMESTAMP '2024-01-15 09:05:00', 'page_view',    'Product'),
  (1, TIMESTAMP '2024-01-15 09:08:00', 'add_to_cart',  'Product'),
  (1, TIMESTAMP '2024-01-15 09:12:00', 'checkout',     'Cart'),
  (1, TIMESTAMP '2024-01-15 09:15:00', 'purchase',     'Checkout'),
  (2, TIMESTAMP '2024-01-15 10:00:00', 'page_view',    'Home'),
  (2, TIMESTAMP '2024-01-15 10:10:00', 'page_view',    'Product'),
  (2, TIMESTAMP '2024-01-15 10:20:00', 'add_to_cart',  'Product'),
  (2, TIMESTAMP '2024-01-15 14:00:00', 'page_view',    'Home'),
  (2, TIMESTAMP '2024-01-15 14:05:00', 'page_view',    'Product'),
  (3, TIMESTAMP '2024-01-15 11:00:00', 'page_view',    'Home'),
  (3, TIMESTAMP '2024-01-15 11:30:00', 'page_view',    'Blog'),
  (3, TIMESTAMP '2024-01-15 12:00:00', 'page_view',    'Home'),
  (3, TIMESTAMP '2024-01-16 09:00:00', 'page_view',    'Home'),
  (3, TIMESTAMP '2024-01-16 09:10:00', 'page_view',    'Product'),
  (3, TIMESTAMP '2024-01-16 09:15:00', 'add_to_cart',  'Product'),
  (3, TIMESTAMP '2024-01-16 09:20:00', 'checkout',     'Cart'),
  (3, TIMESTAMP '2024-01-16 09:25:00', 'purchase',     'Checkout')
) AS t(user_id, event_time, event_type, page);

Step 2: Identify User Sessions

Break events into sessions using a 30-minute inactivity threshold:

SELECT user_id, event_time, event_type,
  sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
  ) as session_id
FROM events
ORDER BY user_id, event_time;

What to expect: User 1 has a single session (all events within 30 minutes). User 2 has two sessions (the 3h 40m gap between 10:20 and 14:00 starts a new session). User 3 has three sessions (gaps between 12:00 and the next day).

Step 3: Analyze the Conversion Funnel

Track how far each user progresses through the purchase funnel within a 1-hour window:

SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'checkout',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id
ORDER BY user_id;

What to expect:

user_idfurthest_stepInterpretation
14Completed all steps (page_view -> add_to_cart -> checkout -> purchase)
22Reached add_to_cart but never checked out
34Completed all steps (on the second day's session)

Step 4: Detect Purchase Patterns

Find which users viewed a product and then purchased (with any events in between):

SELECT user_id,
  sequence_match('(?1).*(?2)', event_time,
    event_type = 'page_view',
    event_type = 'purchase'
  ) as viewed_then_purchased
FROM events
GROUP BY user_id
ORDER BY user_id;

What to expect: Users 1 and 3 return true (they both viewed and purchased). User 2 returns false (viewed but never purchased).

Step 5: User Journey Flow

Discover where users navigate after viewing the Home page then the Product page:

SELECT user_id,
  sequence_next_node('forward', 'first_match', event_time, page,
    page = 'Home',
    page = 'Home',
    page = 'Product'
  ) as next_page_after_product
FROM events
GROUP BY user_id
ORDER BY user_id;

What to expect: User 1 goes to Cart (add_to_cart on the Product page). The function returns the page value of the event immediately following the matched Home -> Product sequence.


Troubleshooting

Extension fails to load

"file was built for DuckDB C API version '...' but we can only load extensions built for DuckDB C API '...'"

The extension is built against DuckDB C API version v1.2.0 (used by DuckDB v1.4.4). You must use a DuckDB CLI version that matches. Check your version with:

duckdb --version

If you see a different version, either install DuckDB v1.4.4 or rebuild the extension against your DuckDB version (this requires updating the libduckdb-sys dependency in Cargo.toml).

"Extension ... is not signed!"

This only applies to locally-built extensions. If you installed via INSTALL behavioral FROM community, the extension is already signed and this error should not occur.

For locally-built extensions, DuckDB rejects unsigned extensions by default. Use one of these approaches:

# CLI flag
duckdb -unsigned

# Or set inside a session (before LOAD)
SET allow_unsigned_extensions = true;
# Python client
conn = duckdb.connect(config={"allow_unsigned_extensions": "true"})

"IO Error: Cannot open file"

The path to the extension must be an absolute path or a path relative to the DuckDB working directory. Verify the file exists:

ls -la /tmp/behavioral.duckdb_extension

If you skipped the metadata step, the file may exist but fail to load. Make sure you ran append_extension_metadata.py after copying the built library.

Platform mismatch

An extension built on Linux cannot be loaded on macOS, and vice versa. The extension must be built on the same platform and architecture where DuckDB is running.

Functions not found after loading

If LOAD succeeds but functions are not available, verify registration by querying the function catalog:

SELECT function_name, function_type
FROM duckdb_functions()
WHERE function_name LIKE 'session%'
   OR function_name LIKE 'retention%'
   OR function_name LIKE 'window_funnel%'
   OR function_name LIKE 'sequence%';

All seven functions should appear. If some are missing, this may indicate a version mismatch between the extension and DuckDB. Rebuild the extension from source against the DuckDB version you are running.

Query errors

"Binder Error: No function matches the given name and argument types"

This usually means the argument types do not match any registered overload. Common causes:

  • Wrong argument order: Each function has a specific parameter order. See the Function Reference for exact signatures.
  • Using INTEGER instead of INTERVAL: The window/gap parameter for sessionize and window_funnel must be a DuckDB INTERVAL, not an integer. Use INTERVAL '1 hour', not 3600.
  • Fewer than 2 boolean conditions: All condition-based functions require at least 2 boolean parameters.
  • More than 32 boolean conditions: The maximum is 32, matching ClickHouse's limit.

"NULL results when expecting values"

  • Rows with NULL timestamps are silently ignored during aggregation.
  • NULL boolean conditions are treated as false.
  • sequence_next_node returns NULL when no pattern match is found or when no adjacent event exists after the match.

Build errors

"error: linker 'cc' not found"

Install a C compiler. On Ubuntu/Debian: sudo apt install build-essential. On macOS: xcode-select --install.

"failed to run custom build command for libduckdb-sys"

The libduckdb-sys crate needs a C compiler and CMake to build DuckDB from source (for the test suite). Install CMake: sudo apt install cmake (Linux) or brew install cmake (macOS).


Running Tests

The extension includes 434 unit tests and 1 doc-test:

cargo test

All tests run in under one second. Zero clippy warnings are enforced:

cargo clippy --all-targets

Running Benchmarks

Criterion.rs benchmarks cover all functions at scales from 100 to 1 billion elements:

# Run all benchmarks
cargo bench

# Run a specific benchmark group
cargo bench -- sessionize
cargo bench -- window_funnel
cargo bench -- sequence_match

Results are stored in target/criterion/ and automatically compared against previous runs by Criterion.

Project Structure

src/
  lib.rs                  # Custom C entry point (behavioral_init_c_api)
  common/
    event.rs              # Shared Event type (16-byte bitmask)
    timestamp.rs          # Interval-to-microseconds conversion
  pattern/
    parser.rs             # Recursive descent pattern parser
    executor.rs           # NFA-based pattern matcher with fast paths
  sessionize.rs           # Session boundary tracking
  retention.rs            # Bitmask-based cohort retention
  window_funnel.rs        # Greedy forward scan with mode flags
  sequence.rs             # Pattern matching state management
  sequence_next_node.rs   # Next event value after pattern match
  ffi/
    mod.rs                # register_all_raw() dispatcher
    sessionize.rs         # Sessionize FFI callbacks
    retention.rs          # Retention FFI callbacks
    window_funnel.rs      # Window funnel FFI callbacks
    sequence.rs           # Sequence match/count FFI callbacks
    sequence_match_events.rs  # Sequence match events FFI callbacks
    sequence_next_node.rs     # Sequence next node FFI callbacks

For a detailed discussion of the architecture, see Architecture.


Next Steps

  • Function Reference -- detailed documentation for each function, including all parameters, modes, and edge case behavior
  • FAQ -- answers to common questions about patterns, modes, NULL handling, and performance
  • ClickHouse Compatibility -- how each function maps to its ClickHouse equivalent, with syntax translation examples
  • Contributing -- development setup, testing expectations, and the PR process for contributing changes

Use Cases

This page presents five real-world use cases for duckdb-behavioral, each with a problem description, sample data, the analytical query, and interpretation of results. All examples are self-contained -- you can copy and paste each section into a DuckDB session with the extension loaded.


E-Commerce Conversion Funnel Analysis

Problem

An e-commerce team wants to measure how far users progress through the purchase funnel: product page view, add to cart, begin checkout, and complete purchase. Understanding where users drop off helps prioritize UX improvements. The team needs per-user funnel progress within a 1-hour conversion window, and an aggregate drop-off report across all users.

Sample Data

CREATE TABLE ecommerce_events (
    user_id    VARCHAR NOT NULL,
    event_time TIMESTAMP NOT NULL,
    event_type VARCHAR NOT NULL,
    product_id VARCHAR,
    revenue    DECIMAL(10,2)
);

INSERT INTO ecommerce_events VALUES
    -- User A: completes full funnel
    ('user_a', '2024-01-15 10:00:00', 'page_view',    'prod_1', NULL),
    ('user_a', '2024-01-15 10:05:00', 'add_to_cart',  'prod_1', NULL),
    ('user_a', '2024-01-15 10:12:00', 'checkout',     'prod_1', NULL),
    ('user_a', '2024-01-15 10:15:00', 'purchase',     'prod_1', 49.99),

    -- User B: views and adds to cart, but abandons
    ('user_b', '2024-01-15 11:00:00', 'page_view',    'prod_2', NULL),
    ('user_b', '2024-01-15 11:03:00', 'add_to_cart',  'prod_2', NULL),
    ('user_b', '2024-01-15 11:30:00', 'page_view',    'prod_3', NULL),

    -- User C: views only, never adds to cart
    ('user_c', '2024-01-15 14:00:00', 'page_view',    'prod_1', NULL),
    ('user_c', '2024-01-15 14:10:00', 'page_view',    'prod_4', NULL),

    -- User D: completes funnel but checkout is outside window
    ('user_d', '2024-01-15 09:00:00', 'page_view',    'prod_5', NULL),
    ('user_d', '2024-01-15 09:10:00', 'add_to_cart',  'prod_5', NULL),
    ('user_d', '2024-01-15 10:30:00', 'checkout',     'prod_5', NULL),
    ('user_d', '2024-01-15 10:35:00', 'purchase',     'prod_5', 29.99);

Analytical Query

-- Step 1: Per-user funnel progress
WITH user_funnels AS (
    SELECT
        user_id,
        window_funnel(
            INTERVAL '1 hour',
            event_time,
            event_type = 'page_view',
            event_type = 'add_to_cart',
            event_type = 'checkout',
            event_type = 'purchase'
        ) AS furthest_step
    FROM ecommerce_events
    GROUP BY user_id
)
-- Step 2: Aggregate into a drop-off report
SELECT
    furthest_step,
    COUNT(*) AS user_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) AS pct
FROM user_funnels
GROUP BY furthest_step
ORDER BY furthest_step;

Expected Results

furthest_stepuser_countpct
1125.0
2250.0
4125.0

Interpretation

  • User A reached step 4 (purchase) -- the full funnel was completed within 1 hour.
  • User B reached step 2 (add to cart). The second page view did not advance the funnel further because it matched step 1 again, not step 3.
  • User C reached step 1 (page view only). No add-to-cart event occurred.
  • User D reached step 2 (add to cart). Although checkout and purchase events exist, they occurred more than 1 hour after the initial page view (09:00 to 10:30 is 90 minutes), so they fall outside the window.

The drop-off report shows 50% of users stalling at the add-to-cart stage, which suggests the checkout flow needs attention. To use strict_increase mode and ensure each step has a strictly later timestamp:

window_funnel(
    INTERVAL '1 hour',
    'strict_increase',
    event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'checkout',
    event_type = 'purchase'
)

SaaS User Retention Cohort Analysis

Problem

A SaaS product team wants to measure weekly retention: of users who were active in week 0, what fraction returned in week 1, week 2, and week 3? This cohort analysis helps track product stickiness and identify retention trends across signup cohorts.

Sample Data

CREATE TABLE saas_activity (
    user_id       VARCHAR NOT NULL,
    activity_date DATE NOT NULL,
    cohort_week   DATE NOT NULL  -- the Monday of the user's signup week
);

INSERT INTO saas_activity VALUES
    -- User A: active in weeks 0, 1, and 3
    ('user_a', '2024-01-01', '2024-01-01'),
    ('user_a', '2024-01-09', '2024-01-01'),
    ('user_a', '2024-01-22', '2024-01-01'),

    -- User B: active in weeks 0 and 1 only
    ('user_b', '2024-01-02', '2024-01-01'),
    ('user_b', '2024-01-08', '2024-01-01'),

    -- User C: active in week 0 only
    ('user_c', '2024-01-03', '2024-01-01'),

    -- User D: active in weeks 0, 1, 2, and 3
    ('user_d', '2024-01-01', '2024-01-01'),
    ('user_d', '2024-01-10', '2024-01-01'),
    ('user_d', '2024-01-15', '2024-01-01'),
    ('user_d', '2024-01-23', '2024-01-01'),

    -- User E: different cohort (week of Jan 8), active in weeks 0 and 2
    ('user_e', '2024-01-08', '2024-01-08'),
    ('user_e', '2024-01-22', '2024-01-08');

Analytical Query

-- Per-user retention array
WITH user_retention AS (
    SELECT
        user_id,
        cohort_week,
        retention(
            activity_date >= cohort_week
                AND activity_date < cohort_week + INTERVAL '7 days',
            activity_date >= cohort_week + INTERVAL '7 days'
                AND activity_date < cohort_week + INTERVAL '14 days',
            activity_date >= cohort_week + INTERVAL '14 days'
                AND activity_date < cohort_week + INTERVAL '21 days',
            activity_date >= cohort_week + INTERVAL '21 days'
                AND activity_date < cohort_week + INTERVAL '28 days'
        ) AS retained
    FROM saas_activity
    GROUP BY user_id, cohort_week
)
-- Aggregate retention rates per cohort
SELECT
    cohort_week,
    COUNT(*) AS cohort_size,
    SUM(CASE WHEN retained[1] THEN 1 ELSE 0 END) AS week_0,
    SUM(CASE WHEN retained[2] THEN 1 ELSE 0 END) AS week_1,
    SUM(CASE WHEN retained[3] THEN 1 ELSE 0 END) AS week_2,
    SUM(CASE WHEN retained[4] THEN 1 ELSE 0 END) AS week_3,
    ROUND(100.0 * SUM(CASE WHEN retained[2] THEN 1 ELSE 0 END) /
        NULLIF(SUM(CASE WHEN retained[1] THEN 1 ELSE 0 END), 0), 1)
        AS week_1_pct,
    ROUND(100.0 * SUM(CASE WHEN retained[3] THEN 1 ELSE 0 END) /
        NULLIF(SUM(CASE WHEN retained[1] THEN 1 ELSE 0 END), 0), 1)
        AS week_2_pct,
    ROUND(100.0 * SUM(CASE WHEN retained[4] THEN 1 ELSE 0 END) /
        NULLIF(SUM(CASE WHEN retained[1] THEN 1 ELSE 0 END), 0), 1)
        AS week_3_pct
FROM user_retention
GROUP BY cohort_week
ORDER BY cohort_week;

Expected Results

cohort_weekcohort_sizeweek_0week_1week_2week_3week_1_pctweek_2_pctweek_3_pct
2024-01-014431275.025.050.0
2024-01-08110100.0100.00.0

Interpretation

For the January 1 cohort (4 users):

  • All 4 users were active in week 0 (by definition, since they have activity in the cohort week).
  • 3 of 4 (75%) returned in week 1 (users A, B, D).
  • Only 1 of 4 (25%) was active in week 2 (user D).
  • 2 of 4 (50%) returned in week 3 (users A, D).

The retention curve shows a steep drop after week 1, with partial recovery in week 3. This "smile curve" pattern suggests users who survive week 1 become long-term engaged, while week 2 is a critical churn risk period.

Note that retention checks whether the anchor condition (week 0) and each subsequent condition were both satisfied somewhere in the group. The conditions do not need to be satisfied by the same row.


Web Analytics Session Analysis

Problem

A web analytics team needs to segment user activity into sessions based on a 30-minute inactivity threshold, then compute session-level metrics: page count per session, session duration, and sessions per user. This is foundational for engagement analysis, bounce rate computation, and session-level attribution.

Sample Data

CREATE TABLE page_views (
    user_id    VARCHAR NOT NULL,
    event_time TIMESTAMP NOT NULL,
    page_url   VARCHAR NOT NULL,
    referrer   VARCHAR
);

INSERT INTO page_views VALUES
    -- User A: two distinct sessions
    ('user_a', '2024-01-15 10:00:00', '/home',     'google.com'),
    ('user_a', '2024-01-15 10:05:00', '/products',  NULL),
    ('user_a', '2024-01-15 10:08:00', '/product/1', NULL),
    ('user_a', '2024-01-15 10:25:00', '/cart',       NULL),
    -- 2-hour gap
    ('user_a', '2024-01-15 12:30:00', '/home',     'email-campaign'),
    ('user_a', '2024-01-15 12:35:00', '/products',  NULL),

    -- User B: single long session
    ('user_b', '2024-01-15 09:00:00', '/home',     'direct'),
    ('user_b', '2024-01-15 09:10:00', '/about',     NULL),
    ('user_b', '2024-01-15 09:15:00', '/pricing',   NULL),
    ('user_b', '2024-01-15 09:20:00', '/signup',    NULL),

    -- User C: three short sessions (bounce-like)
    ('user_c', '2024-01-15 08:00:00', '/home',     'google.com'),
    -- 45-minute gap
    ('user_c', '2024-01-15 08:45:00', '/home',     'google.com'),
    -- 1-hour gap
    ('user_c', '2024-01-15 09:45:00', '/blog/1',   'twitter.com');

Analytical Query

-- Step 1: Assign session IDs
WITH sessionized AS (
    SELECT
        user_id,
        event_time,
        page_url,
        referrer,
        sessionize(event_time, INTERVAL '30 minutes') OVER (
            PARTITION BY user_id ORDER BY event_time
        ) AS session_id
    FROM page_views
),
-- Step 2: Session-level metrics
session_metrics AS (
    SELECT
        user_id,
        session_id,
        COUNT(*) AS pages_viewed,
        MIN(event_time) AS session_start,
        MAX(event_time) AS session_end,
        EXTRACT(EPOCH FROM MAX(event_time) - MIN(event_time)) AS duration_seconds,
        FIRST(referrer) AS entry_referrer
    FROM sessionized
    GROUP BY user_id, session_id
)
-- Step 3: User-level summary
SELECT
    user_id,
    COUNT(*) AS total_sessions,
    ROUND(AVG(pages_viewed), 1) AS avg_pages_per_session,
    ROUND(AVG(duration_seconds), 0) AS avg_duration_seconds,
    SUM(CASE WHEN pages_viewed = 1 THEN 1 ELSE 0 END) AS bounce_sessions
FROM session_metrics
GROUP BY user_id
ORDER BY user_id;

Expected Results

user_idtotal_sessionsavg_pages_per_sessionavg_duration_secondsbounce_sessions
user_a23.09000
user_b14.012000
user_c31.003

Interpretation

  • User A had 2 sessions. The first session (4 pages, 25 minutes from 10:00 to 10:25) shows engaged browsing. The 2-hour gap started a new session (2 pages, 5 minutes) from an email campaign.
  • User B had 1 session spanning 20 minutes across 4 pages, ending at the signup page -- a high-intent user.
  • User C had 3 bounce sessions (1 page each). Each visit had a gap exceeding 30 minutes, so each was a separate session. All sessions have zero duration because there is only one page view per session.

This analysis shows that User C is a repeat but non-engaged visitor, while Users A and B demonstrate meaningful engagement. The entry_referrer field from the session CTE can be used for channel attribution at the session level.


User Journey and Flow Analysis

Problem

A product analytics team wants to understand user navigation flow: after visiting the Home page and then the Product page, what page do users visit next? This "next node" analysis reveals common user journeys and helps identify navigation bottlenecks. The team also wants to understand what pages lead users to the Checkout page (backward analysis).

Sample Data

CREATE TABLE navigation_events (
    user_id    VARCHAR NOT NULL,
    event_time TIMESTAMP NOT NULL,
    page       VARCHAR NOT NULL
);

INSERT INTO navigation_events VALUES
    -- User A: Home -> Product -> Cart -> Checkout -> Confirmation
    ('user_a', '2024-01-15 10:00:00', 'Home'),
    ('user_a', '2024-01-15 10:02:00', 'Product'),
    ('user_a', '2024-01-15 10:05:00', 'Cart'),
    ('user_a', '2024-01-15 10:08:00', 'Checkout'),
    ('user_a', '2024-01-15 10:10:00', 'Confirmation'),

    -- User B: Home -> Product -> Product -> Home (browsing, no conversion)
    ('user_b', '2024-01-15 11:00:00', 'Home'),
    ('user_b', '2024-01-15 11:03:00', 'Product'),
    ('user_b', '2024-01-15 11:07:00', 'Product'),
    ('user_b', '2024-01-15 11:10:00', 'Home'),

    -- User C: Home -> Product -> Cart -> Home (cart abandonment)
    ('user_c', '2024-01-15 14:00:00', 'Home'),
    ('user_c', '2024-01-15 14:05:00', 'Product'),
    ('user_c', '2024-01-15 14:08:00', 'Cart'),
    ('user_c', '2024-01-15 14:15:00', 'Home'),

    -- User D: Home -> Product -> Checkout (skipped cart)
    ('user_d', '2024-01-15 15:00:00', 'Home'),
    ('user_d', '2024-01-15 15:02:00', 'Product'),
    ('user_d', '2024-01-15 15:05:00', 'Checkout');

Analytical Query: Forward Flow

-- What page do users visit after Home -> Product?
SELECT
    user_id,
    sequence_next_node(
        'forward',
        'first_match',
        event_time,
        page,
        page = 'Home',      -- base_condition: start from Home
        page = 'Home',      -- event1: match Home
        page = 'Product'    -- event2: then match Product
    ) AS next_page_after_product
FROM navigation_events
GROUP BY user_id
ORDER BY user_id;

Expected Results (Forward)

user_idnext_page_after_product
user_aCart
user_bProduct
user_cCart
user_dCheckout

Analytical Query: Backward Flow

-- What page leads users to arrive at Checkout?
SELECT
    user_id,
    sequence_next_node(
        'backward',
        'first_match',
        event_time,
        page,
        page = 'Checkout',     -- base_condition: anchor on Checkout
        page = 'Checkout'      -- event1: match Checkout
    ) AS page_before_checkout
FROM navigation_events
WHERE user_id IN ('user_a', 'user_d')  -- only users who reached Checkout
GROUP BY user_id
ORDER BY user_id;

Expected Results (Backward)

user_idpage_before_checkout
user_aCart
user_dProduct

Aggregate Flow Distribution

-- Distribution: what do users do after Home -> Product?
WITH next_pages AS (
    SELECT
        sequence_next_node(
            'forward',
            'first_match',
            event_time,
            page,
            page = 'Home',
            page = 'Home',
            page = 'Product'
        ) AS next_page
    FROM navigation_events
    GROUP BY user_id
)
SELECT
    COALESCE(next_page, '(end of session)') AS next_page,
    COUNT(*) AS user_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) AS pct
FROM next_pages
GROUP BY next_page
ORDER BY user_count DESC;

Interpretation

The forward analysis reveals that after the Home-to-Product sequence:

  • 50% of users proceed to Cart (users A and C) -- the intended happy path.
  • 25% stay on Product pages (user B) -- likely comparing products.
  • 25% go directly to Checkout (user D) -- possible express checkout behavior.

The backward analysis shows that the Cart page is the most common predecessor to Checkout, while some users skip the Cart entirely. This information helps the product team understand whether the Cart step adds friction or value to the conversion flow.


A/B Test Behavioral Analysis

Problem

A growth team is running an A/B test on a new onboarding flow. They need to compare behavioral patterns between the control and treatment groups. Beyond simple conversion rates, they want to measure: funnel depth, event sequence completion, and the number of times users repeat certain actions. This multi-dimensional behavioral comparison reveals whether the new flow changes user behavior patterns, not just outcomes.

Sample Data

CREATE TABLE ab_test_events (
    user_id     VARCHAR NOT NULL,
    event_time  TIMESTAMP NOT NULL,
    event_type  VARCHAR NOT NULL,
    test_group  VARCHAR NOT NULL   -- 'control' or 'treatment'
);

INSERT INTO ab_test_events VALUES
    -- Control group: User A - completes onboarding slowly
    ('user_a', '2024-01-15 10:00:00', 'signup',          'control'),
    ('user_a', '2024-01-15 10:05:00', 'profile_setup',   'control'),
    ('user_a', '2024-01-15 10:30:00', 'tutorial_start',  'control'),
    ('user_a', '2024-01-15 10:45:00', 'tutorial_end',    'control'),
    ('user_a', '2024-01-15 11:00:00', 'first_action',    'control'),

    -- Control group: User B - drops off after profile
    ('user_b', '2024-01-15 11:00:00', 'signup',          'control'),
    ('user_b', '2024-01-15 11:10:00', 'profile_setup',   'control'),
    ('user_b', '2024-01-15 11:15:00', 'tutorial_start',  'control'),

    -- Control group: User C - signs up only
    ('user_c', '2024-01-15 12:00:00', 'signup',          'control'),

    -- Treatment group: User D - completes onboarding quickly
    ('user_d', '2024-01-15 10:00:00', 'signup',          'treatment'),
    ('user_d', '2024-01-15 10:02:00', 'profile_setup',   'treatment'),
    ('user_d', '2024-01-15 10:05:00', 'tutorial_start',  'treatment'),
    ('user_d', '2024-01-15 10:08:00', 'tutorial_end',    'treatment'),
    ('user_d', '2024-01-15 10:10:00', 'first_action',    'treatment'),

    -- Treatment group: User E - completes onboarding
    ('user_e', '2024-01-15 13:00:00', 'signup',          'treatment'),
    ('user_e', '2024-01-15 13:03:00', 'profile_setup',   'treatment'),
    ('user_e', '2024-01-15 13:06:00', 'tutorial_start',  'treatment'),
    ('user_e', '2024-01-15 13:10:00', 'tutorial_end',    'treatment'),
    ('user_e', '2024-01-15 13:12:00', 'first_action',    'treatment'),

    -- Treatment group: User F - drops off after tutorial
    ('user_f', '2024-01-15 14:00:00', 'signup',          'treatment'),
    ('user_f', '2024-01-15 14:02:00', 'profile_setup',   'treatment'),
    ('user_f', '2024-01-15 14:04:00', 'tutorial_start',  'treatment'),
    ('user_f', '2024-01-15 14:07:00', 'tutorial_end',    'treatment');

Analytical Query: Multi-Dimensional Comparison

-- Dimension 1: Funnel depth (how far into onboarding?)
WITH funnel_analysis AS (
    SELECT
        user_id,
        test_group,
        window_funnel(
            INTERVAL '2 hours',
            event_time,
            event_type = 'signup',
            event_type = 'profile_setup',
            event_type = 'tutorial_start',
            event_type = 'tutorial_end',
            event_type = 'first_action'
        ) AS funnel_step
    FROM ab_test_events
    GROUP BY user_id, test_group
),
-- Dimension 2: Did the full sequence complete within 30 minutes?
sequence_analysis AS (
    SELECT
        user_id,
        test_group,
        sequence_match(
            '(?1).*(?t<=1800)(?2).*(?t<=1800)(?3).*(?t<=1800)(?4).*(?t<=1800)(?5)',
            event_time,
            event_type = 'signup',
            event_type = 'profile_setup',
            event_type = 'tutorial_start',
            event_type = 'tutorial_end',
            event_type = 'first_action'
        ) AS completed_within_30min
    FROM ab_test_events
    GROUP BY user_id, test_group
),
-- Dimension 3: Event timestamps for completed users
timing_analysis AS (
    SELECT
        user_id,
        test_group,
        sequence_match_events(
            '(?1).*(?2).*(?3).*(?4).*(?5)',
            event_time,
            event_type = 'signup',
            event_type = 'profile_setup',
            event_type = 'tutorial_start',
            event_type = 'tutorial_end',
            event_type = 'first_action'
        ) AS step_timestamps
    FROM ab_test_events
    GROUP BY user_id, test_group
)
-- Combined report per test group
SELECT
    f.test_group,
    COUNT(*) AS users,
    ROUND(AVG(f.funnel_step), 2) AS avg_funnel_depth,
    SUM(CASE WHEN f.funnel_step = 5 THEN 1 ELSE 0 END) AS completed_onboarding,
    ROUND(100.0 * SUM(CASE WHEN f.funnel_step = 5 THEN 1 ELSE 0 END) /
        COUNT(*), 1) AS completion_rate_pct,
    SUM(CASE WHEN s.completed_within_30min THEN 1 ELSE 0 END)
        AS completed_fast,
    ROUND(100.0 * SUM(CASE WHEN s.completed_within_30min THEN 1 ELSE 0 END) /
        NULLIF(SUM(CASE WHEN f.funnel_step = 5 THEN 1 ELSE 0 END), 0), 1)
        AS fast_completion_rate_pct
FROM funnel_analysis f
JOIN sequence_analysis s ON f.user_id = s.user_id
JOIN timing_analysis t ON f.user_id = t.user_id
GROUP BY f.test_group
ORDER BY f.test_group;

Expected Results

test_groupusersavg_funnel_depthcompleted_onboardingcompletion_rate_pctcompleted_fastfast_completion_rate_pct
control33.00133.300.0
treatment34.33266.72100.0

Step-by-Step Funnel Comparison

-- Detailed funnel step distribution per group
WITH user_funnels AS (
    SELECT
        user_id,
        test_group,
        window_funnel(
            INTERVAL '2 hours',
            event_time,
            event_type = 'signup',
            event_type = 'profile_setup',
            event_type = 'tutorial_start',
            event_type = 'tutorial_end',
            event_type = 'first_action'
        ) AS step
    FROM ab_test_events
    GROUP BY user_id, test_group
)
SELECT
    test_group,
    step AS reached_step,
    COUNT(*) AS user_count
FROM user_funnels
GROUP BY test_group, step
ORDER BY test_group, step;
test_groupreached_stepuser_count
control11
control31
control51
treatment41
treatment52

Interpretation

The treatment group shows stronger behavioral metrics across all dimensions:

  • Funnel depth: Average 4.33 steps vs. 3.00 -- users progress further through onboarding.
  • Completion rate: 66.7% vs. 33.3% -- twice as many users complete the full onboarding.
  • Speed: All treatment completers finished within 30 minutes between each step. The single control completer (user A) took 60 minutes overall, with a 25-minute gap between profile setup and tutorial start.
  • Drop-off pattern: In the control group, one user (C) dropped off at step 1 (signup only), suggesting the immediate post-signup experience needs work. In the treatment group, the only non-completing user (F) still reached step 4 (tutorial_end), indicating better engagement even among non-completers.

The sequence_match with time constraints ((?t<=1800)) specifically identifies users who moved through each step within 30 minutes, measuring momentum rather than just completion. The sequence_match_events output can be used for further analysis of time-between-steps to identify specific bottleneck transitions.

FAQ

Frequently asked questions about duckdb-behavioral.

Loading the Extension

How do I install the extension?

The extension is available in the DuckDB Community Extensions repository. Install and load with:

INSTALL behavioral FROM community;
LOAD behavioral;

No build tools, compilation, or -unsigned flag required.

Can I build from source instead?

Yes. Build in release mode, then load from the DuckDB CLI or any DuckDB client:

cargo build --release
-- Linux
LOAD 'target/release/libbehavioral.so';

-- macOS
LOAD 'target/release/libbehavioral.dylib';

Locally-built extensions require the -unsigned flag:

duckdb -unsigned -c "LOAD 'target/release/libbehavioral.so'; SELECT ..."

The extension fails to load. What should I check?

  1. DuckDB version mismatch: The community extension is built for DuckDB v1.4.4. If you are using a different DuckDB version, the extension may not be available for that version yet. For locally-built extensions, the C API version must match (currently v1.2.0).

  2. Missing -unsigned flag (local builds only): DuckDB rejects unsigned extensions by default. Use duckdb -unsigned or set allow_unsigned_extensions=true. This does not apply when installing via INSTALL behavioral FROM community.

  3. Wrong file path (local builds only): Ensure the path points to the actual .so or .dylib file produced by cargo build --release.

  4. Platform mismatch (local builds only): An extension built on Linux cannot be loaded on macOS, and vice versa. The community extension handles platform selection automatically.

Functions

What is the difference between sequence_match, sequence_count, and sequence_match_events?

All three use the same pattern syntax and NFA engine, but return different results:

FunctionReturnsUse Case
sequence_matchBOOLEANDid the pattern occur at all?
sequence_countBIGINTHow many times did the pattern occur (non-overlapping)?
sequence_match_eventsLIST(TIMESTAMP)At what timestamps did each step match?

Use sequence_match for filtering, sequence_count for frequency analysis, and sequence_match_events for diagnostic investigation of when each step was satisfied.

How many boolean conditions can I use?

All functions support 2 to 32 boolean condition parameters. This matches ClickHouse's limit. The conditions are stored internally as a u32 bitmask, so the 32-condition limit is a hard constraint of the data type.

How are NULL values handled?

  • NULL timestamps: Rows with NULL timestamps are ignored during update.
  • NULL conditions: NULL boolean conditions are treated as false.
  • NULL pattern: An empty or NULL pattern string results in no match.
  • sequence_next_node: NULL event column values are stored and can be returned as the result. The function returns NULL when no match is found or no adjacent event exists.

When should I use window_funnel vs. sequence_match?

Both functions analyze multi-step user journeys, but they serve different purposes:

  • window_funnel answers: "How far did the user get through a specific ordered funnel within a time window?" It returns a step count (0 to N) and supports behavioral modes (strict, strict_once, etc.). Best for conversion funnel analysis where you want drop-off rates between steps.

  • sequence_match answers: "Did a specific pattern of events occur?" It uses a flexible regex-like syntax supporting wildcards (., .*) and time constraints ((?t<=3600)). Best for detecting complex behavioral patterns that go beyond simple ordered steps.

Use window_funnel when you need step-by-step drop-off metrics. Use sequence_match when you need flexible pattern matching with arbitrary gaps and time constraints.

Pattern Syntax

Quick reference

PatternDescription
(?N)Match an event where condition N (1-indexed) is true
.Match exactly one event (any conditions)
.*Match zero or more events (any conditions)
(?t>=N)At least N seconds since previous match
(?t<=N)At most N seconds since previous match
(?t>N)More than N seconds since previous match
(?t<N)Less than N seconds since previous match
(?t==N)Exactly N seconds since previous match
(?t!=N)Not exactly N seconds since previous match

Common patterns

-- User viewed then purchased (any events in between)
'(?1).*(?2)'

-- User viewed then purchased with no intervening events
'(?1)(?2)'

-- User viewed, then purchased within 1 hour (3600 seconds)
'(?1).*(?t<=3600)(?2)'

-- Three-step funnel with time constraints
'(?1).*(?t<=3600)(?2).*(?t<=7200)(?3)'

Window Funnel Modes

What modes are available and how do they combine?

Six modes are available, each adding an independent constraint:

ModeEffect
strictCondition for step N-1 must not refire before step N matches
strict_orderNo earlier conditions may fire between matched steps
strict_deduplicationSkip events with the same timestamp as the previous step
strict_increaseRequire strictly increasing timestamps between steps
strict_onceEach event can advance the funnel by at most one step
allow_reentryReset the funnel when condition 1 fires again

Modes are independently combinable via a comma-separated string:

window_funnel(INTERVAL '1 hour', 'strict_increase, strict_once',
  ts, cond1, cond2, cond3)

Differences from ClickHouse

How does the syntax differ from ClickHouse?

ClickHouse uses a two-level call syntax where configuration parameters are separated from data parameters by an extra set of parentheses. duckdb-behavioral uses a flat parameter list, consistent with standard SQL function syntax.

-- ClickHouse syntax
windowFunnel(3600)(timestamp, cond1, cond2, cond3)
sequenceMatch('(?1).*(?2)')(timestamp, cond1, cond2)
sequenceNextNode('forward', 'head')(timestamp, value, base_cond, ev1, ev2)

-- duckdb-behavioral syntax
window_funnel(INTERVAL '1 hour', timestamp, cond1, cond2, cond3)
sequence_match('(?1).*(?2)', timestamp, cond1, cond2)
sequence_next_node('forward', 'head', timestamp, value, base_cond, ev1, ev2)

What are the naming convention differences?

ClickHouseduckdb-behavioral
windowFunnelwindow_funnel
sequenceMatchsequence_match
sequenceCountsequence_count
sequenceNextNodesequence_next_node
retentionretention

ClickHouse uses camelCase. duckdb-behavioral uses snake_case, following DuckDB's naming conventions.

Are there differences in parameter types?

ParameterClickHouseduckdb-behavioral
Window sizeInteger (seconds)DuckDB INTERVAL type
Mode stringSecond parameter in the parameter listOptional VARCHAR before timestamp
Time constraints in patternsSeconds (integer)Seconds (integer) -- same
Condition limit3232 -- same

The INTERVAL type is more expressive than raw seconds. You can write INTERVAL '1 hour', INTERVAL '30 minutes', or INTERVAL '2 days' instead of computing the equivalent number of seconds.

Does this extension have a sessionize equivalent in ClickHouse?

No. The sessionize function has no direct equivalent in ClickHouse's behavioral analytics function set. ClickHouse provides session analysis through different mechanisms (e.g., sessionTimeoutSeconds in windowFunnel). The sessionize window function is a DuckDB-specific addition for assigning session IDs based on inactivity gaps.

Do I need to set any experimental flags?

No. In ClickHouse, sequenceNextNode requires SET allow_experimental_funnel_functions = 1. In duckdb-behavioral, all seven functions are available immediately after loading the extension, with no experimental flags required.

Data Preparation

What columns does my data need?

The required columns depend on which function you use:

FunctionRequired Columns
sessionizeA TIMESTAMP column for event time
retentionBoolean expressions for each cohort period
window_funnelA TIMESTAMP column and boolean expressions for each funnel step
sequence_match / sequence_count / sequence_match_eventsA TIMESTAMP column and boolean expressions for conditions
sequence_next_nodeA TIMESTAMP column, a VARCHAR value column, and boolean expressions for conditions

The boolean "condition" parameters are typically inline expressions rather than stored columns. For example:

-- Conditions are computed inline from existing columns
window_funnel(INTERVAL '1 hour', event_time,
  event_type = 'page_view',       -- cond1: computed from event_type
  event_type = 'add_to_cart',     -- cond2: computed from event_type
  event_type = 'purchase'         -- cond3: computed from event_type
)

What is the ideal table structure for behavioral analytics?

A single event-level table with one row per user action works best. The minimum structure is:

CREATE TABLE events (
    user_id    VARCHAR NOT NULL,   -- or INTEGER, UUID, etc.
    event_time TIMESTAMP NOT NULL,
    event_type VARCHAR NOT NULL    -- the action taken
);

For richer analysis, add context columns:

CREATE TABLE events (
    user_id      VARCHAR NOT NULL,
    event_time   TIMESTAMP NOT NULL,
    event_type   VARCHAR NOT NULL,
    page_url     VARCHAR,            -- for web analytics
    product_id   VARCHAR,            -- for e-commerce
    revenue      DECIMAL(10,2),      -- for purchase events
    device_type  VARCHAR,            -- for segmentation
    campaign     VARCHAR             -- for attribution
);

All behavioral functions operate on this flat event stream. There is no need for pre-aggregated or pivoted data.

Do events need to be sorted before calling these functions?

No. All event-collecting functions (window_funnel, sequence_match, sequence_count, sequence_match_events, sequence_next_node) sort events by timestamp internally during the finalize phase. You do not need an ORDER BY clause for these aggregate functions.

However, an ORDER BY on the timestamp column can still improve performance. The extension includes a presorted detection optimization: if events arrive already sorted (which happens when DuckDB's query planner pushes down an ORDER BY), the O(n log n) sort is skipped entirely, reducing finalize to O(n).

The sessionize window function does require ORDER BY in the OVER clause because it is a window function, not an aggregate:

sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
)

Can I use these functions with Parquet, CSV, or other file formats?

Yes. DuckDB can read from Parquet, CSV, JSON, and many other formats directly. The behavioral functions operate on DuckDB's internal columnar representation, so the source format is irrelevant:

-- Query Parquet files directly
SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'view', event_type = 'cart', event_type = 'purchase')
FROM read_parquet('events/*.parquet')
GROUP BY user_id;

-- Query CSV files
SELECT user_id,
  retention(
    event_date = '2024-01-01',
    event_date = '2024-01-02',
    event_date = '2024-01-03')
FROM read_csv('events.csv')
GROUP BY user_id;

Scaling and Memory

How much data can the extension handle?

The extension has been benchmarked at scale with Criterion.rs:

FunctionTested ScaleThroughputMemory Model
sessionize1 billion rows830 Melem/sO(1) per partition segment
retention100 million rows365 Melem/sO(1) -- single u32 bitmask
window_funnel100 million rows126 Melem/sO(n) -- 16 bytes per event
sequence_match100 million rows95 Melem/sO(n) -- 16 bytes per event
sequence_count100 million rows85 Melem/sO(n) -- 16 bytes per event
sequence_match_events100 million rows93 Melem/sO(n) -- 16 bytes per event
sequence_next_node10 million rows18 Melem/sO(n) -- 32 bytes per event

In practice, real-world datasets with billions of rows are easily handled because the functions operate on partitioned groups (e.g., per-user), not the entire table at once. A table with 1 billion rows across 10 million users has only 100 events per user on average.

How much memory do the functions use?

Memory usage depends on the function category:

O(1)-state functions (constant memory regardless of group size):

  • sessionize: Tracks only first_ts, last_ts, and boundaries count. Requires a few dozen bytes per partition segment, regardless of how many events exist.
  • retention: Stores a single u32 bitmask (4 bytes) per group. One billion rows with retention requires effectively zero additional memory.

Event-collecting functions (linear memory proportional to group size):

  • window_funnel, sequence_match, sequence_count, sequence_match_events: Store every event as a 16-byte Event struct. For a group with 10,000 events, this requires approximately 160 KB.
  • sequence_next_node: Stores every event as a 32-byte NextNodeEvent struct (includes an Arc<str> reference to the value column). For a group with 10,000 events, this requires approximately 320 KB plus string storage.

Rule of thumb: For event-collecting functions, estimate memory as 16 bytes * (events in largest group) for most functions, or 32 bytes * (events in largest group) for sequence_next_node. The group is defined by GROUP BY for aggregate functions or PARTITION BY for sessionize.

What happens if a single user has millions of events?

The extension will process it correctly, but memory usage scales linearly with the number of events in that group. A single user with 10 million events will require approximately 160 MB of memory for event-collecting functions (16 bytes times 10M).

If you have users with extremely large event counts, consider pre-filtering to a relevant time window before applying behavioral functions:

-- Pre-filter to last 90 days before funnel analysis
SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'view', event_type = 'cart', event_type = 'purchase')
FROM events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '90 days'
GROUP BY user_id;

Does the extension support out-of-core processing?

The extension itself does not implement out-of-core (disk-spilling) strategies. However, DuckDB's query engine handles memory management for the overall query pipeline. The aggregate state for each group is held in memory during execution. For most workloads where GROUP BY partitions data into reasonably sized groups (thousands to tens of thousands of events per user), this is not a concern.

Combine and Aggregate Behavior

How does GROUP BY work with these functions?

The aggregate functions (retention, window_funnel, sequence_match, sequence_count, sequence_match_events, sequence_next_node) follow standard SQL GROUP BY semantics. Each group produces one output row:

-- One result per user
SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'view', event_type = 'purchase') as steps
FROM events
GROUP BY user_id;

-- One result per user per day
SELECT user_id, event_time::DATE as day,
  sequence_match('(?1).*(?2)', event_time,
    event_type = 'view', event_type = 'purchase') as converted
FROM events
GROUP BY user_id, day;

If you omit GROUP BY, the function aggregates over the entire table (all rows form a single group):

-- Single result for the entire table
SELECT retention(
    event_date = '2024-01-01',
    event_date = '2024-01-02'
  ) as overall_retention
FROM events;

What is the combine operation and why does it matter?

DuckDB processes aggregate functions using a segment tree, which requires merging partial aggregate states. The combine operation merges two states into one. For event-collecting functions, this means appending one state's events to another's.

This is an internal implementation detail that you do not need to worry about for correctness. However, it affects performance:

  • sessionize and retention have O(1) combine (constant time regardless of data size), making them extremely fast even at billion-row scale.
  • Event-collecting functions have O(m) combine where m is the number of events in the source state. This is still fast -- 100 million events process in under 1 second -- but it is the dominant cost at scale.

Can I use these functions with PARTITION BY (window functions)?

Only sessionize is a window function. All other functions are aggregate functions.

-- Correct: sessionize is a window function
SELECT sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
) as session_id
FROM events;

-- Correct: window_funnel is an aggregate function
SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time, cond1, cond2)
FROM events
GROUP BY user_id;

You can use GROUP BY with the aggregate functions to partition results by any column or expression -- user ID, date, campaign, device type, or any combination.

Can I nest these functions or use them in subqueries?

Yes. The results of behavioral functions can be used in subqueries, CTEs, and outer queries like any other SQL expression:

-- Use window_funnel results in a downstream aggregation
WITH user_funnels AS (
    SELECT user_id,
      window_funnel(INTERVAL '1 hour', event_time,
        event_type = 'view', event_type = 'cart', event_type = 'purchase'
      ) as steps
    FROM events
    GROUP BY user_id
)
SELECT steps, COUNT(*) as user_count
FROM user_funnels
GROUP BY steps
ORDER BY steps;

Integration

How do I use this extension with Python?

Use the duckdb Python package. Install and load the extension from the community repository:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL behavioral FROM community")
conn.execute("LOAD behavioral")

# Run behavioral queries and get results as a DataFrame
df = conn.execute("""
    SELECT user_id,
      window_funnel(INTERVAL '1 hour', event_time,
        event_type = 'view',
        event_type = 'cart',
        event_type = 'purchase') as steps
    FROM events
    GROUP BY user_id
""").fetchdf()

You can also pass pandas DataFrames directly to DuckDB:

import pandas as pd

events_df = pd.DataFrame({
    'user_id': ['u1', 'u1', 'u1', 'u2', 'u2'],
    'event_time': pd.to_datetime([
        '2024-01-01 10:00', '2024-01-01 10:30', '2024-01-01 11:00',
        '2024-01-01 09:00', '2024-01-01 09:15'
    ]),
    'event_type': ['view', 'cart', 'purchase', 'view', 'cart']
})

result = conn.execute("""
    SELECT user_id,
      window_funnel(INTERVAL '1 hour', event_time,
        event_type = 'view', event_type = 'cart', event_type = 'purchase'
      ) as steps
    FROM events_df
    GROUP BY user_id
""").fetchdf()

How do I use this extension with Node.js?

Use the duckdb-async or duckdb npm package:

const duckdb = require('duckdb');
const db = new duckdb.Database(':memory:');

db.run("INSTALL behavioral FROM community", (err) => {
    if (err) throw err;
    db.run("LOAD behavioral", (err) => {
        if (err) throw err;

        db.all(`
            SELECT user_id,
              window_funnel(INTERVAL '1 hour', event_time,
                event_type = 'view',
                event_type = 'cart',
                event_type = 'purchase') as steps
            FROM read_parquet('events.parquet')
            GROUP BY user_id
        `, (err, rows) => {
            console.log(rows);
        });
    });
});

How do I use this extension with dbt?

dbt-duckdb supports loading community extensions. Add the extension to your profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ':memory:'
      extensions:
        - name: behavioral
          repo: community

Then use the behavioral functions directly in your dbt models:

-- models/staging/stg_user_funnels.sql
SELECT
    user_id,
    window_funnel(
        INTERVAL '1 hour',
        event_time,
        event_type = 'view',
        event_type = 'cart',
        event_type = 'purchase'
    ) as furthest_step
FROM {{ ref('stg_events') }}
GROUP BY user_id
-- models/staging/stg_user_sessions.sql
SELECT
    user_id,
    event_time,
    sessionize(event_time, INTERVAL '30 minutes') OVER (
        PARTITION BY user_id ORDER BY event_time
    ) as session_id
FROM {{ ref('stg_events') }}

Can I use the extension in DuckDB's WASM or HTTP client?

No. The extension is a native loadable binary (.so on Linux, .dylib on macOS) and requires the native DuckDB runtime. It cannot be loaded in DuckDB-WASM (browser) or through the DuckDB HTTP API without a native DuckDB server process backing the connection.

Performance

How fast is the extension?

Headline benchmarks (Criterion.rs, 95% CI):

FunctionScaleThroughput
sessionize1 billion830 Melem/s
retention100 million365 Melem/s
window_funnel100 million126 Melem/s
sequence_match100 million95 Melem/s
sequence_count100 million85 Melem/s
sequence_match_events100 million93 Melem/s
sequence_next_node10 million18 Melem/s

See the Performance page for full methodology and optimization history.

Can I run the benchmarks myself?

Yes:

cargo bench                    # Run all benchmarks
cargo bench -- sessionize      # Run a specific group
cargo bench -- sequence_match  # Another specific group

Results are stored in target/criterion/ and automatically compared against previous runs.

What can I do to improve query performance?

  1. Pre-filter by time range. Reduce the number of events before applying behavioral functions. A WHERE event_time >= '2024-01-01' clause can dramatically reduce the working set.

  2. Use GROUP BY to keep group sizes reasonable. Partitioning by user_id ensures each group contains only one user's events rather than the entire table.

  3. Order by timestamp. While not required for correctness, an ORDER BY on the timestamp column enables the presorted detection optimization, which skips the internal O(n log n) sort.

  4. Use Parquet format. Parquet's columnar storage and predicate pushdown work well with DuckDB's query optimizer, reducing I/O for behavioral queries that typically read only a few columns.

sessionize

Window function that assigns monotonically increasing session IDs. A new session begins when the gap between consecutive events exceeds a configurable threshold.

Signature

sessionize(timestamp TIMESTAMP, gap INTERVAL) -> BIGINT

Parameters:

ParameterTypeDescription
timestampTIMESTAMPEvent timestamp
gapINTERVALMaximum allowed inactivity gap between events in the same session

Returns: BIGINT -- the session ID (1-indexed, monotonically increasing within each partition).

Usage

sessionize is used as a window function with OVER (PARTITION BY ... ORDER BY ...).

SELECT user_id, event_time,
  sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
  ) as session_id
FROM events;

Behavior

  • The first event in each partition is assigned session ID 1.
  • Each subsequent event is compared to the previous event's timestamp.
  • If the gap exceeds the threshold, the session ID increments.
  • A gap exactly equal to the threshold does not start a new session; the gap must strictly exceed the threshold.

Example

Given events for a single user with a 30-minute threshold:

event_timesession_idReason
10:001First event
10:15115 min gap (within threshold)
10:25110 min gap (within threshold)
11:30265 min gap (exceeds threshold)
11:45215 min gap (within threshold)
13:00375 min gap (exceeds threshold)

Implementation

The state tracks the first timestamp, last timestamp, and the number of session boundaries (gaps exceeding the threshold). The combine operation is O(1), which enables efficient evaluation via DuckDB's segment tree windowing machinery.

OperationComplexity
UpdateO(1)
CombineO(1)
FinalizeO(1)
SpaceO(1) per partition segment

At benchmark scale, sessionize processes 1 billion rows in 1.20 seconds (830 Melem/s).

See Also

  • retention -- cohort retention analysis using boolean conditions
  • window_funnel -- conversion funnel step tracking within time windows

retention

Aggregate function for cohort retention analysis. Takes N boolean conditions and returns an array indicating which conditions co-occurred with the anchor condition.

Signature

retention(cond1 BOOLEAN, cond2 BOOLEAN [, ...]) -> BOOLEAN[]

Parameters:

ParameterTypeDescription
cond1BOOLEANAnchor condition (e.g., user appeared in cohort)
cond2..condNBOOLEANRetention conditions for subsequent periods

Supports 2 to 32 boolean parameters.

Returns: BOOLEAN[] -- an array of length N where element i indicates whether condition i was satisfied alongside the anchor condition.

Usage

SELECT cohort_month,
  retention(
    activity_date = cohort_month,
    activity_date = cohort_month + INTERVAL '1 month',
    activity_date = cohort_month + INTERVAL '2 months',
    activity_date = cohort_month + INTERVAL '3 months'
  ) as retained
FROM user_activity
GROUP BY user_id, cohort_month;

Behavior

  • result[0] is true if the anchor condition (cond1) was satisfied by any row in the group.
  • result[i] (for i > 0) is true if both the anchor condition and condition i were satisfied by some row(s) in the group. The conditions do not need to be satisfied by the same row.
  • If the anchor condition was never satisfied, all results are false. A user who never appeared in the cohort cannot be retained.

Example

Given three rows for a user in a monthly cohort analysis:

Rowcond1 (month 0)cond2 (month 1)cond3 (month 2)
1truefalsefalse
2falsetruefalse
3falsefalsefalse

Result: [true, true, false]

  • The anchor was met (row 1).
  • Month 1 retention is confirmed (row 2 satisfies cond2, anchor was met by row 1).
  • Month 2 was never satisfied.

Implementation

Conditions are tracked as a u32 bitmask, where bit i is set when condition i evaluates to true for any row. The combine operation is a single bitwise OR.

OperationComplexity
UpdateO(k) where k = number of conditions
CombineO(1)
FinalizeO(k)
SpaceO(1) -- a single u32 bitmask

At benchmark scale, retention combines 100 million states in 274 ms (365 Melem/s).

See Also

  • sessionize -- session assignment based on timestamp gaps
  • window_funnel -- conversion funnel step tracking within time windows

window_funnel

Aggregate function for conversion funnel analysis. Searches for the longest chain of sequential conditions within a time window. Returns the maximum funnel step reached.

Signature

window_funnel(window INTERVAL, timestamp TIMESTAMP,
              cond1 BOOLEAN, cond2 BOOLEAN [, ...]) -> INTEGER

window_funnel(window INTERVAL, mode VARCHAR, timestamp TIMESTAMP,
              cond1 BOOLEAN, cond2 BOOLEAN [, ...]) -> INTEGER

Parameters:

ParameterTypeDescription
windowINTERVALMaximum time window from the first step
modeVARCHAROptional comma-separated mode string
timestampTIMESTAMPEvent timestamp
cond1..condNBOOLEANFunnel step conditions (2 to 32)

Returns: INTEGER -- the number of matched funnel steps (0 to N). A return value of 0 means the entry condition was never satisfied.

Usage

Default Mode

SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'checkout',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id;

With Mode String

SELECT user_id,
  window_funnel(INTERVAL '1 hour', 'strict_increase, strict_once',
    event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id;

Behavior

  1. Events are sorted by timestamp.
  2. For each event matching the entry condition (cond1), a forward scan begins.
  3. The scan attempts to match cond2, cond3, ..., condN in order, subject to the time window constraint: each subsequent event must occur within window of the entry event.
  4. The maximum step reached across all entry points is returned.

A single event can advance multiple funnel steps in default mode. For example, if an event satisfies both cond2 and cond3, it advances the funnel by two steps in a single pass.

Example

Given events for a user with a 1-hour window and 3-step funnel:

event_timepage_viewadd_to_cartpurchase
10:00truefalsefalse
10:20falsetruefalse
11:30falsefalsetrue

Result: 2

  • Step 1 matched at 10:00 (page_view).
  • Step 2 matched at 10:20 (add_to_cart, within 1 hour of 10:00).
  • Step 3 at 11:30 is outside the 1-hour window from the entry at 10:00.

Modes

Modes are independently combinable via a comma-separated string parameter. Each mode adds an additional constraint on top of the default greedy scan.

ModeDescription
strictIf condition i fires, condition i-1 must not fire again before condition i+1. Prevents backwards movement.
strict_orderEvents must satisfy conditions in exact sequential order. No events matching earlier conditions are allowed between matched steps.
strict_deduplicationEvents with the same timestamp as the previously matched step are skipped for the next step.
strict_increaseRequires strictly increasing timestamps between consecutive matched steps. Same-timestamp events cannot advance the funnel.
strict_onceEach event can advance the funnel by at most one step, even if it satisfies multiple consecutive conditions.
allow_reentryIf the entry condition fires again after step 1, the funnel resets from that new entry point.

Mode Combinations

Modes can be combined freely:

-- Require strictly increasing timestamps and one step per event
window_funnel(INTERVAL '1 hour', 'strict_increase, strict_once',
  ts, cond1, cond2, cond3)

-- Strict order with reentry
window_funnel(INTERVAL '1 hour', 'strict_order, allow_reentry',
  ts, cond1, cond2, cond3)

Implementation

Events are collected during the update phase and sorted by timestamp during finalize. A greedy forward scan from each entry point finds the longest chain.

OperationComplexity
UpdateO(1) amortized (event append)
CombineO(m) where m = events in other state
FinalizeO(n * k) where n = events, k = conditions
SpaceO(n) -- all collected events

At benchmark scale, window_funnel processes 100 million events in 791 ms (126 Melem/s).

See Also

sequence_match

Aggregate function that checks whether a sequence of events matches a pattern. Uses a mini-regex syntax over condition references, executed by an NFA engine.

Signature

sequence_match(pattern VARCHAR, timestamp TIMESTAMP,
               cond1 BOOLEAN, cond2 BOOLEAN [, ...]) -> BOOLEAN

Parameters:

ParameterTypeDescription
patternVARCHARPattern string using the syntax described below
timestampTIMESTAMPEvent timestamp
cond1..condNBOOLEANEvent conditions (2 to 32)

Returns: BOOLEAN -- true if the event stream contains a subsequence matching the pattern, false otherwise.

Usage

-- Did the user view a product and then purchase?
SELECT user_id,
  sequence_match('(?1).*(?2)', event_time,
    event_type = 'view',
    event_type = 'purchase'
  ) as converted
FROM events
GROUP BY user_id;

Pattern Syntax

Patterns are composed of the following elements:

PatternDescription
(?N)Match an event where condition N (1-indexed) is true
.Match exactly one event (any conditions)
.*Match zero or more events (any conditions)
(?t>=N)Time constraint: at least N seconds since previous match
(?t<=N)Time constraint: at most N seconds since previous match
(?t>N)Time constraint: more than N seconds since previous match
(?t<N)Time constraint: less than N seconds since previous match
(?t==N)Time constraint: exactly N seconds since previous match
(?t!=N)Time constraint: not exactly N seconds since previous match

Pattern Examples

-- View then purchase (any events in between)
'(?1).*(?2)'

-- View then purchase with no intervening events
'(?1)(?2)'

-- View, exactly one event, then purchase
'(?1).(?2)'

-- View, then purchase within 1 hour
'(?1).*(?t<=3600)(?2)'

-- Three-step sequence with time constraints
'(?1).*(?t<=3600)(?2).*(?t<=7200)(?3)'

Behavior

  1. Events are sorted by timestamp.
  2. The pattern is compiled into an NFA (nondeterministic finite automaton).
  3. The NFA is executed against the event stream using lazy matching semantics for .* (prefer advancing the pattern over consuming additional events).
  4. Returns true if any accepting state is reached.

Time constraints are evaluated relative to the timestamp of the previously matched condition step. The time difference is computed in seconds, matching ClickHouse semantics.

Implementation

OperationComplexity
UpdateO(1) amortized (event append)
CombineO(m) where m = events in other state
FinalizeO(n * s) NFA execution, where n = events, s = pattern steps
SpaceO(n) -- all collected events

At benchmark scale, sequence_match processes 100 million events in 1.05 s (95 Melem/s).

See Also

sequence_count

Aggregate function that counts the number of non-overlapping occurrences of a pattern in the event stream.

Signature

sequence_count(pattern VARCHAR, timestamp TIMESTAMP,
               cond1 BOOLEAN, cond2 BOOLEAN [, ...]) -> BIGINT

Parameters:

ParameterTypeDescription
patternVARCHARPattern string (same syntax as sequence_match)
timestampTIMESTAMPEvent timestamp
cond1..condNBOOLEANEvent conditions (2 to 32)

Returns: BIGINT -- the number of non-overlapping matches of the pattern in the event stream.

Usage

-- Count how many times a user viewed then purchased
SELECT user_id,
  sequence_count('(?1)(?2)', event_time,
    event_type = 'view',
    event_type = 'purchase'
  ) as conversion_count
FROM events
GROUP BY user_id;

Behavior

  1. Events are sorted by timestamp.
  2. The pattern is compiled and executed using the NFA engine.
  3. Each time the pattern matches, the count increments and the NFA restarts from the event following the last matched event.
  4. Matches are non-overlapping: once a set of events is consumed by a match, those events cannot participate in another match.

Example

Given events for a user with pattern (?1)(?2):

event_timecond1 (view)cond2 (purchase)
10:00truefalse
10:10falsetrue
10:20truefalse
10:30falsetrue
10:40truefalse

Result: 2

  • First match: events at 10:00 and 10:10.
  • Second match: events at 10:20 and 10:30.
  • The event at 10:40 has no subsequent cond2 event.

Pattern Syntax

Uses the same pattern syntax as sequence_match. Refer to that page for the full syntax reference.

Implementation

OperationComplexity
UpdateO(1) amortized (event append)
CombineO(m) where m = events in other state
FinalizeO(n * s) NFA execution, where n = events, s = pattern steps
SpaceO(n) -- all collected events

At benchmark scale, sequence_count processes 100 million events in 1.18 s (85 Melem/s).

See Also

sequence_match_events

Aggregate function that returns the timestamps of each matched condition step in a pattern. This is the diagnostic counterpart to sequence_match: instead of returning a boolean, it returns the actual timestamps where each (?N) step was satisfied.

Signature

sequence_match_events(pattern VARCHAR, timestamp TIMESTAMP,
                      cond1 BOOLEAN, cond2 BOOLEAN [, ...]) -> LIST(TIMESTAMP)

Parameters:

ParameterTypeDescription
patternVARCHARPattern string (same syntax as sequence_match)
timestampTIMESTAMPEvent timestamp
cond1..condNBOOLEANEvent conditions (2 to 32)

Returns: LIST(TIMESTAMP) -- a list of timestamps corresponding to each matched (?N) step in the pattern. Returns an empty list if no match is found.

Usage

-- Get the timestamps where view and purchase occurred
SELECT user_id,
  sequence_match_events('(?1).*(?2)', event_time,
    event_type = 'view',
    event_type = 'purchase'
  ) as match_timestamps
FROM events
GROUP BY user_id;
-- Returns e.g. [2024-01-01 10:00:00, 2024-01-01 11:30:00]

Behavior

  1. Events are sorted by timestamp.
  2. The pattern is compiled and executed using a timestamp-collecting variant of the NFA engine.
  3. When a (?N) condition step matches, the event's timestamp is recorded.
  4. If the full pattern matches, the collected timestamps are returned as a list.
  5. If no match is found, an empty list is returned.

The returned list contains one timestamp per (?N) step in the pattern. Wildcard steps (. and .*) and time constraint steps do not contribute timestamps to the output.

Example

Given events for a user with pattern (?1).*(?2):

event_timecond1 (view)cond2 (purchase)
10:00truefalse
10:15falsefalse
10:30falsetrue

Result: [10:00, 10:30]

The timestamp of the (?1) match (10:00) and the (?2) match (10:30).

Pattern Syntax

Uses the same pattern syntax as sequence_match. Refer to that page for the full syntax reference.

Implementation

The NFA executor uses a separate NfaStateWithTimestamps type that tracks collected: Vec<i64> alongside the standard state index. This keeps the timestamp collection path separate from the performance-critical execute_pattern and count_pattern code paths.

OperationComplexity
UpdateO(1) amortized (event append)
CombineO(m) where m = events in other state
FinalizeO(n * s) NFA execution, where n = events, s = pattern steps
SpaceO(n) -- all collected events

At benchmark scale, sequence_match_events processes 100 million events in 1.07 s (93 Melem/s).

See Also

sequence_next_node

Aggregate function that returns the value of the next event after a matched sequential pattern. Implements ClickHouse's sequenceNextNode for flow analysis.

Signature

sequence_next_node(direction VARCHAR, base VARCHAR, timestamp TIMESTAMP,
                   event_column VARCHAR, base_condition BOOLEAN,
                   event1 BOOLEAN [, event2 BOOLEAN, ...]) -> VARCHAR

Parameters:

ParameterTypeDescription
directionVARCHAR'forward' or 'backward'
baseVARCHAR'head', 'tail', 'first_match', or 'last_match'
timestampTIMESTAMPEvent timestamp
event_columnVARCHARValue column (returned as result)
base_conditionBOOLEANCondition for the base/anchor event
event1..eventNBOOLEANSequential event conditions (1 to 32)

Returns: VARCHAR (nullable) -- the value of the adjacent event after a successful sequential match, or NULL if no match or no adjacent event exists.

Direction

Controls which direction to scan for the next event:

DirectionBehavior
'forward'Match events earliest-to-latest, return the event after the last matched step
'backward'Match events latest-to-earliest, return the event before the earliest matched step

Base

Controls which starting point to use when multiple matches exist:

BaseForward behaviorBackward behavior
'head'Start from the first base_condition eventStart from the first base_condition event
'tail'Start from the last base_condition eventStart from the last base_condition event
'first_match'Return the first complete match resultReturn the first complete match result (scanning right-to-left)
'last_match'Return the last complete match resultReturn the last complete match result (scanning right-to-left)

Usage

-- What page do users visit after Home → Product?
SELECT user_id,
  sequence_next_node('forward', 'first_match', event_time, page,
    page = 'Home',        -- base_condition
    page = 'Home',        -- event1
    page = 'Product'      -- event2
  ) as next_page
FROM events
GROUP BY user_id;

-- What page did users come from before reaching Checkout?
SELECT user_id,
  sequence_next_node('backward', 'tail', event_time, page,
    page = 'Checkout',    -- base_condition
    page = 'Checkout'     -- event1
  ) as previous_page
FROM events
GROUP BY user_id;

-- Flow analysis: what happens after the first Home → Product → Cart sequence?
SELECT user_id,
  sequence_next_node('forward', 'first_match', event_time, page,
    page = 'Home',        -- base_condition
    page = 'Home',        -- event1
    page = 'Product',     -- event2
    page = 'Cart'         -- event3
  ) as next_after_cart
FROM events
GROUP BY user_id;

Behavior

  1. Events are sorted by timestamp.
  2. The function scans in the specified direction to find a sequential chain of events matching event1, event2, ..., eventN.
  3. The starting event must satisfy base_condition.
  4. For forward: returns the value of the event immediately after the last matched step.
  5. For backward: event1 matches at the starting position (later timestamp), then event2 matches at an earlier position, etc. Returns the value of the event immediately before the earliest matched step.
  6. Returns NULL if no complete match is found, or if no adjacent event exists.

Differences from ClickHouse

AspectClickHouseduckdb-behavioral
SyntaxsequenceNextNode(direction, base)(ts, val, base_cond, ev1, ...)sequence_next_node(direction, base, ts, val, base_cond, ev1, ...)
Function namecamelCasesnake_case
ParametersTwo-level call syntaxFlat parameter list
Return typeNullable(String)VARCHAR (nullable)
Experimental flagRequires allow_experimental_funnel_functions = 1Always available

Implementation

OperationComplexity
UpdateO(1) amortized (event append)
CombineO(m) where m = events in other state
FinalizeO(n * k) sequential scan, where n = events, k = event conditions
SpaceO(n) -- all events stored (each includes an Arc<str> value)

Note: Unlike other event-collecting functions where the Event struct is Copy (16 bytes), sequence_next_node uses a dedicated NextNodeEvent struct (32 bytes) that stores an Arc<str> value per event. The Arc<str> enables O(1) clone via reference counting, which significantly reduces combine overhead compared to per-event deep string copying.

See Also

Engineering Overview

This page provides a technical summary of the engineering decisions, quality standards, and design trade-offs that underpin duckdb-behavioral. It is intended for technical evaluators, contributors, and anyone seeking to understand the depth and rigor of this project.


Project Scope

duckdb-behavioral is a loadable extension for DuckDB that implements the complete set of behavioral analytics functions found in ClickHouse. The extension is written entirely in Rust, compiles to a shared library (.so / .dylib), and integrates with DuckDB via the C extension API.

The project spans several distinct engineering disciplines:

DisciplineScope
Systems programmingRust FFI, raw C API callbacks, memory-safe aggregate state management, unsafe code confinement
Database internalsDuckDB's segment tree windowing, aggregate function lifecycle (init, update, combine, finalize, destroy), data chunk format
Algorithm designNFA-based pattern matching, recursive descent parsing, greedy funnel search, bitmask-based retention analysis
Performance engineeringCache-aware data structures, algorithmic complexity analysis, Criterion.rs benchmarking with confidence intervals, negative result documentation
Software quality434 unit tests, 27 E2E tests, property-based testing (proptest), mutation testing (cargo-mutants, 88.4% kill rate), zero clippy warnings under pedantic lints
CI/CD and release engineeringMulti-platform builds (Linux x86/ARM, macOS x86/ARM), SemVer validation, artifact attestation, reproducible builds
Technical writingmdBook documentation site, function reference pages, optimization history with measured data, ClickHouse compatibility matrix

Architecture

Separation of Concerns

The codebase enforces a strict separation between business logic and FFI integration:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a', 'mainBkg': '#ffffff', 'clusterBkg': '#f5f5f5', 'clusterBorder': '#333333'}}}%%
graph TB
    subgraph "DuckDB Process"
        DDB[DuckDB Engine]
        SEG[Segment Tree<br/>Windowing]
    end

    subgraph "Extension Entry Point"
        EP[behavioral_init_c_api<br/>src/lib.rs]
    end

    subgraph "FFI Bridge — src/ffi/"
        REG[register_all_raw]
        FS[sessionize.rs]
        FR[retention.rs]
        FW[window_funnel.rs]
        FQ[sequence.rs]
        FE[sequence_match_events.rs]
        FN[sequence_next_node.rs]
    end

    subgraph "Business Logic — Pure Safe Rust"
        SS[sessionize.rs]
        SR[retention.rs]
        SW[window_funnel.rs]
        SQ[sequence.rs]
        SN[sequence_next_node.rs]
        subgraph "Pattern Engine"
            PP[parser.rs]
            PE[executor.rs]
        end
        subgraph "Common"
            EV[event.rs — 16-byte Copy]
            TS[timestamp.rs]
        end
    end

    DDB -->|LOAD extension| EP
    EP --> REG
    REG --> FS & FR & FW & FQ & FE & FN
    SEG -->|init/update/combine/finalize| FS & FR & FW & FQ & FE & FN
    FS --> SS
    FR --> SR
    FW --> SW
    FQ --> SQ
    FE --> SQ
    FN --> SN
    SQ --> PP & PE
    SW --> EV
    SQ --> EV
    SN --> EV

    style EP fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style DDB fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SEG fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style REG fill:#f0f0f0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FS fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FR fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FW fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FQ fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FE fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FN fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SS fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SR fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SW fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SQ fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SN fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style PP fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style PE fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style EV fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style TS fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
  • Business logic (src/*.rs, src/common/, src/pattern/): Pure safe Rust. No FFI types, no unsafe blocks. All algorithmic work -- pattern parsing, NFA execution, funnel search, retention bitmask logic -- lives here. This layer is independently testable via cargo test with no DuckDB dependency.

  • FFI bridge (src/ffi/): Contains all unsafe code. Each function has a dedicated FFI module implementing the five DuckDB aggregate callbacks (state_size, init, update, combine, finalize). Every unsafe block has a // SAFETY: comment documenting the invariants it relies on.

  • Entry point (src/lib.rs): A custom C entry point (behavioral_init_c_api) that bypasses the duckdb Rust crate's connection wrapper entirely, using duckdb_connect directly from the extension access struct. This eliminates struct layout fragility across DuckDB versions.

Why This Matters

This architecture enables:

  • Independent unit testing: Business logic tests run in < 1 second with no DuckDB instance. All 434 tests exercise Rust structs directly.
  • Safe evolution: Updating the DuckDB version only requires updating libduckdb-sys in Cargo.toml and re-running E2E tests. Business logic is decoupled from the database.
  • Auditable unsafe scope: The unsafe boundary is confined to src/ffi/ (6 files). Reviewers can audit the safety-critical code without reading the entire codebase.

Testing Philosophy

The Test Pyramid

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a', 'clusterBkg': '#f5f5f5', 'clusterBorder': '#333333'}}}%%
graph TB
    subgraph "Complementary Test Levels"
        L3["Mutation Testing<br/>88.4% kill rate (130/147)<br/>cargo-mutants"]
        L2["E2E Tests (27)<br/>Real DuckDB CLI, SQL execution<br/>Extension load, registration, results"]
        L1["Unit Tests (434)<br/>State lifecycle, edge cases, combine correctness<br/>Property-based (26 proptest), mutation-guided (51)"]
    end

    style L1 fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style L2 fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style L3 fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a

This project implements a rigorous multi-level testing strategy:

Level 1: Unit Tests (434 tests)

Organized by category within each module:

  • State lifecycle tests -- empty state, single update, multiple updates, finalize
  • Edge cases -- threshold boundaries, NULL handling, empty inputs, overflow at type boundaries (u32::MAX, i64::MIN)
  • Combine correctness -- empty-into-empty, empty-into-populated, populated-into-empty, associativity verification, configuration propagation
  • Property-based tests (26 proptest) -- algebraic properties required by DuckDB's segment tree: combine associativity, commutativity, identity element, idempotency, monotonicity
  • Mutation-testing-guided tests (51) -- tests written specifically to kill mutants that survived initial test suites

Level 2: E2E Tests (27 tests)

Integration tests against a real DuckDB CLI instance that validate the complete chain: extension loading, function registration, SQL execution, and result correctness. These tests caught three critical bugs that all 434 unit tests missed:

  1. A segmentation fault on extension load (incorrect pointer arithmetic)
  2. Six of seven functions silently failing to register (missing API call)
  3. window_funnel returning incorrect results (combine not propagating configuration)

Level 3: Mutation Testing (88.4% kill rate)

cargo-mutants systematically replaces operators, removes branches, and changes return values across the codebase. Of 147 generated mutants, 130 were caught by the test suite. The 17 survivors are documented and represent code paths where mutations produce semantically equivalent behavior (e.g., OR vs XOR on non-overlapping bitmasks).

Key Insight

Unit tests validate algorithmic correctness in isolation. E2E tests validate the integration boundary. Neither alone is sufficient. The three bugs found during E2E validation were entirely in the FFI layer -- a domain that unit tests by definition cannot exercise.


Performance Engineering

Methodology

Every performance claim in this project is backed by:

  • Criterion.rs benchmarks with 95% confidence intervals
  • Multiple runs (minimum 3) to establish baselines
  • Throughput reporting (elements per second) alongside wall clock time
  • Negative results documented honestly -- five attempted optimizations were measured, found to be regressions, reverted, and documented

Scale

Function ClassMax Benchmark ScaleConstraint
O(1) state (sessionize, retention)1 billion elementsCompute-bound; no per-event memory
Event-collecting (window_funnel, sequence_*)100 million elements1.6 GB working set at 16 bytes/event
String-carrying (sequence_next_node)10 million elements32 bytes/event with Arc<str> allocation

Optimization History

Fifteen rounds of measured optimization, each following a documented protocol:

  1. Establish baseline with 3 Criterion runs
  2. Implement one optimization per commit
  3. Measure before/after with confidence intervals
  4. Accept only when confidence intervals do not overlap
  5. Revert and document if the improvement is not statistically significant

Selected highlights:

OptimizationImprovementScale
Event bitmask (Vec<bool> to u32)5--13xAll event functions
In-place combine (O(N^2) to O(N))Up to 2,436xCombine operations at 10K states
NFA lazy matching (greedy to lazy)1,961xsequence_match at 1M events
Arc<str> (String to reference counted)2.1--5.8xsequence_next_node
NFA fast paths (linear scan)39--61%sequence_count

Five attempted optimizations were negative results:

  • LSD radix sort: 4.3x slower than pdqsort for 16-byte embedded-key structs
  • Branchless sessionize: 5--10% regression; branch predictor handles 90/10 patterns efficiently
  • String pool with index vectors: 10--55% slower than Arc<str> at most scales
  • Compiled pattern preservation: Caching parsed patterns did not improve performance
  • First-condition pre-check: Adding an early-exit branch for non-matching first conditions increased overhead

Full optimization history with confidence intervals: Performance


Domain Significance

Why Behavioral Analytics Matters

Behavioral analytics answers a class of questions that traditional SQL aggregations cannot express efficiently:

  • Sessionization: Given a stream of timestamped events, partition them into logical sessions based on inactivity gaps. This is fundamental to web analytics, mobile app analysis, and IoT telemetry.

  • Retention cohorts: Given users who first appeared at time T, what fraction returned at T+1, T+2, T+N? This is the core metric for SaaS businesses, subscription services, and any product measuring long-term engagement.

  • Conversion funnels: Given a multi-step process (browse, add to cart, checkout, purchase), how far does each user progress within a time window? This directly drives revenue optimization in e-commerce.

  • Event sequence patterns: Given a temporal sequence of heterogeneous events, does a specific pattern occur? How many times? What happened next? This generalizes to fraud detection, clinical event monitoring, user journey analysis, and process mining.

Who Needs This

AudienceApplication
Product analytics teamsFunnel analysis, retention metrics, session analysis, user journey mapping
Data engineersReplacing external analytics services (Amplitude, Mixpanel) with in-database SQL
Security analystsDetecting temporal attack patterns in log data
Clinical researchersAnalyzing sequential treatment events in patient data
DevOps/SRESessionizing log streams, identifying incident escalation patterns
Academic researchersProcess mining, sequential pattern mining on large datasets

Scale Context

These problems are inherently large-scale. A mid-sized SaaS application generates millions of events per day. An e-commerce platform during a sale event can produce hundreds of millions of events per hour. The ability to process one billion sessionize events in 1.20 seconds on a single core means these analyses can run as interactive queries rather than batch jobs.


Implementation Challenges

DuckDB Aggregate Function Registration

DuckDB's Rust crate does not provide high-level aggregate function registration. This project uses the raw C API (libduckdb-sys) directly, implementing five callback functions per aggregate:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#f5f5f5', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#e8e8e8', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a', 'noteBkgColor': '#e8e8e8', 'noteTextColor': '#1a1a1a', 'noteBorderColor': '#333333', 'labelTextColor': '#1a1a1a', 'transitionColor': '#333333'}}}%%
stateDiagram-v2
    [*] --> state_size: DuckDB queries state byte size

    state_size --> init: Allocate state buffer
    init --> update: Process input rows

    state "update" as update {
        [*] --> ProcessRow
        ProcessRow --> AccumulateState
        AccumulateState --> ProcessRow: Next row
    }

    update --> combine: Segment tree merge
    note right of combine
        Called O(n log n) times.
        Target state is zero-initialized.
        Must propagate ALL config fields.
    end note

    combine --> finalize: Produce result
    finalize --> destroy: Free resources
    destroy --> [*]
state_size  -- Returns the byte size of the aggregate state
init        -- Initializes a new state (called by DuckDB's segment tree)
update      -- Processes one row of input into the state
combine     -- Merges two states (called O(n log n) times by segment tree)
finalize    -- Produces the final result from the accumulated state

Because duckdb_aggregate_function_set_varargs does not exist, each function that accepts a variable number of boolean conditions (retention, window_funnel, sequence_match, sequence_count, sequence_match_events, sequence_next_node) must register 31 overloads (2--32 parameters) via a function set.

NFA Pattern Engine

The sequence functions use a custom NFA (Nondeterministic Finite Automaton) pattern engine:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a'}}}%%
flowchart LR
    SQL["Pattern String<br/>'(?1).*(?t<=3600)(?2)'"] --> PARSE["Recursive Descent<br/>Parser"]
    PARSE --> STEPS["CompiledPattern<br/>Condition(1), AnyEvents,<br/>TimeConstraint(<=, 3600),<br/>Condition(2)"]
    STEPS --> CLASS{Classify}
    CLASS -->|"All adjacent"| FAST1["O(n) Sliding<br/>Window"]
    CLASS -->|"Wildcard-separated"| FAST2["O(n) Linear<br/>Scan"]
    CLASS -->|"Time constraints<br/>or mixed"| NFA["NFA<br/>Backtracking"]

    style SQL fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style PARSE fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style STEPS fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style CLASS fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FAST1 fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style FAST2 fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style NFA fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
  • A recursive descent parser that compiles pattern strings (e.g., (?1).*(?t<=3600)(?2)) into an intermediate representation of typed steps
  • An NFA executor that evaluates the pattern against a sorted event stream using lazy backtracking with optional time constraints
  • Fast-path classifiers that detect common pattern shapes (adjacent conditions, wildcard-separated conditions) and dispatch to specialized O(n) linear scans, avoiding the full NFA for the majority of real-world patterns

Combine Semantics in DuckDB's Segment Tree

DuckDB's windowing and grouping machinery uses a segment tree that creates fresh zero-initialized states and combines source states into them. This creates a subtle correctness requirement: combine must propagate all configuration fields (window size, mode flags, pattern string, direction/base parameters), not just the accumulated data. Missing this caused silent incorrect results that passed all unit tests but failed E2E validation.


Quality Standards

MetricValue
Unit tests434
Doc-tests1
E2E tests27
Property-based tests26 (proptest)
Mutation-guided tests51
Mutation kill rate88.4% (130/147)
Clippy warnings0 (pedantic + nursery + cargo)
Unsafe block countConfined to src/ffi/ (6 files)
MSRVRust 1.80
Criterion benchmark files7
Max benchmark scale1 billion elements
CI jobs13 (check, test, clippy, fmt, doc, MSRV, bench, deny, semver, coverage, cross-platform, extension-build)
Documented negative results5 (radix sort, branchless, string pool, compiled pattern, first-condition pre-check)
ClickHouse parityComplete (7/7 functions, all modes, 32 conditions)

Technology Stack

LayerTechnologyPurpose
LanguageRust (stable, MSRV 1.80)Memory safety, zero-cost abstractions, unsafe confinement
DatabaseDuckDB 1.4.4Analytical SQL engine, segment tree windowing
FFIlibduckdb-sys (C API)Raw aggregate function registration
BenchmarkingCriterion.rsStatistical benchmarking with confidence intervals
Property testingproptestAlgebraic property verification
Mutation testingcargo-mutantsTest suite effectiveness measurement
DocumentationmdBookStatic site generation for GitHub Pages
CI/CDGitHub ActionsMulti-platform build, test, release, attestation
Dependency auditcargo-denyLicense compliance, advisory database checks
SemVer validationcargo-semver-checksAPI compatibility verification

Architecture

This page describes the internal architecture of the duckdb-behavioral extension, including the module structure, design decisions, and the FFI bridge between Rust business logic and DuckDB's C API.

Module Structure

src/
  lib.rs                       Custom C entry point (behavioral_init_c_api)
  common/
    mod.rs
    event.rs                   Event type (u32 bitmask, Copy, 16 bytes)
    timestamp.rs               Interval-to-microseconds conversion
  pattern/
    mod.rs
    parser.rs                  Recursive descent parser for pattern strings
    executor.rs                NFA-based pattern matcher with fast paths
  sessionize.rs                Session boundary tracking (O(1) combine)
  retention.rs                 Bitmask-based cohort retention (O(1) combine)
  window_funnel.rs             Greedy forward scan with combinable mode bitflags
  sequence.rs                  Sequence match/count/events state management
  sequence_next_node.rs        Next event value after pattern match (Arc<str>)
  ffi/
    mod.rs                     register_all_raw() dispatcher
    sessionize.rs              FFI callbacks for sessionize
    retention.rs               FFI callbacks for retention
    window_funnel.rs           FFI callbacks for window_funnel
    sequence.rs                FFI callbacks for sequence_match + sequence_count
    sequence_match_events.rs   FFI callbacks for sequence_match_events
    sequence_next_node.rs      FFI callbacks for sequence_next_node

Design Principles

Pure Rust Core with FFI Bridge

All business logic lives in the top-level modules (sessionize.rs, retention.rs, window_funnel.rs, sequence.rs, sequence_next_node.rs, pattern/) with zero FFI dependencies. These modules are fully testable in isolation using standard Rust unit tests.

The ffi/ submodules handle DuckDB C API registration exclusively. This separation ensures that:

  • Business logic can be tested without a DuckDB instance.
  • The unsafe FFI boundary is confined to a single directory.
  • Every unsafe block has a // SAFETY: comment documenting its invariants.

Aggregate Function Lifecycle

DuckDB requires five callbacks for aggregate function registration:

CallbackPurpose
state_sizeReturns the byte size of the aggregate state
state_initInitializes a new state (heap-allocates the Rust struct)
state_updateProcesses input rows, updating the state
state_combineMerges two states (used by DuckDB's segment tree)
state_finalizeProduces the output value from a state

Each FFI module wraps a FfiState struct containing a pointer to the heap-allocated Rust state. The state_destroy callback reclaims memory via Box::from_raw.

Function Sets for Variadic Signatures

DuckDB does not provide a duckdb_aggregate_function_set_varargs API. To support variable numbers of boolean condition parameters (2 to 32), each function is registered as a function set containing 31 overloads, one for each arity.

This is necessary for retention, window_funnel, sequence_match, sequence_count, sequence_match_events, and sequence_next_node. The sessionize function has a fixed two-parameter signature and does not require a function set.

Custom C Entry Point

The extension uses a hand-written behavioral_init_c_api entry point that obtains duckdb_connection directly via duckdb_connect from the duckdb_extension_access struct. This bypasses duckdb::Connection entirely, eliminating fragile struct layout assumptions that caused SEGFAULTs in earlier versions.

Only libduckdb-sys (with the loadable-extension feature) is needed at runtime. The duckdb crate is used only in #[cfg(test)] for unit tests.

Event Representation

The Event struct is shared across window_funnel, sequence_match, sequence_count, and sequence_match_events:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct Event {
    pub timestamp_us: i64,   // 8 bytes
    pub conditions: u32,     // 4 bytes
    // 4 bytes padding (alignment to 8)
}                            // Total: 16 bytes
}

Design choices:

  • u32 bitmask instead of Vec<bool> eliminates per-event heap allocation and enables O(1) condition checks via bitwise AND. Supports up to 32 conditions, matching ClickHouse's limit.
  • Copy semantics enable zero-cost event duplication in combine operations.
  • 16 bytes packs four events per 64-byte cache line, maximizing spatial locality during sorting and scanning.
  • Expanding from u8 (8 conditions) to u32 (32 conditions) was a zero-cost change: the struct is 16 bytes in both cases due to alignment padding from the i64 field.

Pattern Engine

The pattern engine (src/pattern/) compiles pattern strings into a structured AST and executes them via an NFA.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a'}}}%%
flowchart LR
    SQL["SQL Pattern String<br/>'(?1).*(?t<=3600)(?2)'"]
    PARSE["Recursive Descent<br/>Parser"]
    CP["CompiledPattern<br/>Vec of PatternStep"]
    CLASS["Pattern<br/>Classifier"]

    SQL --> PARSE --> CP --> CLASS

    CLASS -->|Adjacent| ADJ["O(n) Sliding Window"]
    CLASS -->|Wildcard-separated| WC["O(n) Linear Scan"]
    CLASS -->|Complex| NFA["NFA Backtracking"]

    ADJ --> RES["MatchResult"]
    WC --> RES
    NFA --> RES

    style SQL fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style PARSE fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style CP fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style CLASS fill:#ffffff,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style ADJ fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style WC fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style NFA fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style RES fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a

Parser

The recursive descent parser (parser.rs) converts pattern strings like (?1).*(?t<=3600)(?2) into a CompiledPattern containing a sequence of PatternStep variants:

  • Condition(usize) -- match an event where condition N is true
  • AnyEvents -- match zero or more events (.*)
  • OneEvent -- match exactly one event (.)
  • TimeConstraint(TimeOp, i64) -- time constraint between steps

NFA Executor

The executor (executor.rs) classifies patterns at execution time and dispatches to specialized code paths:

  • Adjacent conditions ((?1)(?2)(?3)): O(n) sliding window scan
  • Wildcard-separated ((?1).*(?2).*(?3)): O(n) single-pass linear scan
  • Complex patterns: Full NFA with backtracking

Three execution modes are supported:

ModeFunctionReturns
execute_patternsequence_match / sequence_countMatchResult (match + count)
execute_pattern_eventssequence_match_eventsVec<i64> (matched timestamps)

The NFA uses lazy matching for .*: it prefers advancing the pattern over consuming additional events. This is critical for performance -- greedy matching causes O(n^2) behavior at scale because the NFA consumes all events before backtracking to try advancing the pattern.

Combine Strategy

DuckDB's segment tree windowing calls combine O(n log n) times. The combine implementation is the dominant cost for event-collecting functions.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#ffffff', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#e0e0e0', 'textColor': '#1a1a1a', 'clusterBkg': '#f5f5f5', 'clusterBorder': '#333333'}}}%%
flowchart TB
    subgraph "DuckDB Segment Tree"
        R["Root State"]
        L1["State A"]
        L2["State B"]
        L3["State C"]
        L4["State D"]
        R --- L1 & L2
        L1 --- L3 & L4
    end

    subgraph "Combine Operation"
        T["Target<br/>(zero-initialized)"]
        S["Source State"]
        T -->|"extend events<br/>propagate config"| M["Merged State"]
        S -->|"append events"| M
    end

    subgraph "Finalize"
        M -->|"sort events"| SORT["Sorted Events"]
        SORT -->|"scan / match"| OUT["Result"]
    end

    style R fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style L1 fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style L2 fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style L3 fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style L4 fill:#e8e8e8,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style T fill:#f0f0f0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style S fill:#f0f0f0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style M fill:#d9d9d9,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style SORT fill:#e0e0e0,stroke:#333333,stroke-width:2px,color:#1a1a1a
    style OUT fill:#f5f5f5,stroke:#333333,stroke-width:2px,color:#1a1a1a
FunctionCombine StrategyComplexity
sessionizeCompare boundary timestampsO(1)
retentionBitwise OR of bitmasksO(1)
window_funnelIn-place event appendO(m)
sequence_*In-place event appendO(m)
sequence_next_nodeIn-place event appendO(m)

The in-place combine (combine_in_place) extends self.events directly instead of allocating a new Vec. This reduces a left-fold chain from O(N^2) total copies to O(N) amortized. Events are sorted once during finalize, not during combine. This deferred sorting strategy avoids redundant work.

Sorting

Events are sorted by timestamp using sort_unstable_by_key (pdqsort):

  1. An O(n) presorted check (Iterator::windows(2)) detects already-sorted input and skips the sort entirely. This is the common case for ORDER BY queries.
  2. For unsorted input, pdqsort provides O(n log n) worst-case with excellent cache locality for 16-byte Copy types.

Stable sort is not used because same-timestamp event order has no defined semantics in ClickHouse's behavioral analytics functions.

Performance

All performance claims in this project are backed by measured data from Criterion.rs benchmarks with 95% confidence intervals, validated across three or more consecutive runs. Improvements are accepted only when confidence intervals do not overlap.

Current Baseline

Measured on the development machine. Absolute numbers vary by hardware; relative improvements and algorithmic scaling are hardware-independent.

Headline Numbers

FunctionScaleWall ClockThroughput
sessionize1 billion1.20 s830 Melem/s
retention (combine)100 million274 ms365 Melem/s
window_funnel100 million791 ms126 Melem/s
sequence_match100 million1.05 s95 Melem/s
sequence_count100 million1.18 s85 Melem/s
sequence_match_events100 million1.07 s93 Melem/s
sequence_next_node10 million546 ms18 Melem/s

Per-Element Cost

FunctionCost per ElementBound
sessionize1.2 nsCPU-bound (register-only loop)
retention2.7 nsCPU-bound (bitmask OR)
window_funnel7.9 nsMemory-bound at 100M (1.6 GB working set)
sequence_match10.5 nsMemory-bound (NFA + event traversal)
sequence_count11.8 nsMemory-bound (NFA restart overhead)

Algorithmic Complexity

FunctionUpdateCombineFinalizeSpace
sessionizeO(1)O(1)O(1)O(1)
retentionO(k)O(1)O(k)O(1)
window_funnelO(1)O(m)O(n*k)O(n)
sequence_matchO(1)O(m)O(n*s)O(n)
sequence_countO(1)O(m)O(n*s)O(n)
sequence_match_eventsO(1)O(m)O(n*s)O(n)
sequence_next_nodeO(1)O(m)O(n*k)O(n)

Where n = events, m = events in other state, k = conditions (up to 32), s = pattern steps.

Optimization History

Fifteen engineering sessions have produced the current performance profile. Each optimization below was measured independently with before/after Criterion data and non-overlapping confidence intervals (except where noted as negative results).

Event Bitmask Optimization

Replaced Vec<bool> per event with a u32 bitmask, eliminating per-event heap allocation and enabling Copy semantics. The Event struct shrank from 32+ bytes with heap allocation to 16 bytes with zero heap allocation.

FunctionScaleBeforeAfterSpeedup
window_funnel1,000 events18.6 us1.65 us11.3x
window_funnel100,000 events2.35 ms188 us12.5x
sequence_match10,000 events260 us28.0 us9.3x

In-Place Combine

Replaced O(N^2) merge-allocate combine with O(N) in-place extend. Switched from stable sort (TimSort) to unstable sort (pdqsort). Eliminated redundant pattern cloning.

FunctionScaleBeforeAfterSpeedup
window_funnel_combine100 states10.9 us466 ns23x
window_funnel_combine1,000 states781 us3.03 us258x
window_funnel_combine10,000 states75.3 ms30.9 us2,436x

NFA Lazy Matching

Fixed .* exploration order in the NFA executor. The NFA was using greedy matching (consume event first), causing O(n^2) backtracking. Switching to lazy matching (advance pattern first) reduced complexity to O(n * pattern_len).

FunctionScaleBeforeAfterSpeedup
sequence_match100 events750 ns454 ns1.76x
sequence_match1,000,000 events4.43 s2.31 ms1,961x

Billion-Row Benchmarks

Expanded benchmarks to 100M for event-collecting functions and 1B for O(1)-state functions. Validated that all functions scale as documented with no hidden bottlenecks at extreme scale.

FunctionScaleWall ClockThroughput
sessionize_update1 billion1.21 s824 Melem/s
retention_combine1 billion3.06 s327 Melem/s
window_funnel100 million898 ms111 Melem/s
sequence_match100 million1.08 s92.7 Melem/s
sequence_count100 million1.57 s63.5 Melem/s

ClickHouse Feature Parity

Architectural refactoring session. Implemented combinable FunnelMode bitflags, three new window_funnel modes (strict_increase, strict_once, allow_reentry), multi-step advancement per event, and the sequence_match_events function. No performance optimization targets.

Presorted Detection + 32-Condition Support

Added an O(n) is_sorted check before sorting. When events arrive in timestamp order (the common case for ORDER BY queries), the O(n log n) sort is skipped entirely. The O(n) overhead for unsorted input is negligible.

Also expanded condition support from u8 (8 conditions) to u32 (32 conditions). This was a zero-cost change: the Event struct remains 16 bytes due to alignment padding from the i64 field.

Negative Results

Three optimization hypotheses were tested and all produced negative results. These are documented to prevent redundant investigation:

  • LSD Radix Sort: 8-bit radix, 8 passes. Measured 4.3x slower at 100M elements. The scatter pattern has poor cache locality for 16-byte structs. pdqsort's in-place partitioning is superior for embedded-key structs.

  • Branchless Sessionize: Replaced conditional branches with i64::from(bool) and max/min. Measured 5-10% slower. The branch predictor handles the 90/10 gap-exceeds-threshold pattern with near-perfect accuracy. CMOV adds fixed overhead that exceeds the near-zero cost of correctly predicted branches.

  • Retention Update Micro-optimization: Analysis showed the benchmark bottleneck is Vec<bool> allocation in the test harness, not the bitmask update loop. No optimization possible without changing the benchmark structure.

sequence_next_node Implementation

Implemented sequence_next_node with full direction (forward/backward) and base (head/tail/first_match/last_match) support. Uses a dedicated NextNodeEvent struct with per-event string storage (separate from the Copy Event struct). Simple sequential matching algorithm.

Added a dedicated benchmark (benches/sequence_next_node_bench.rs) measuring update + finalize throughput and combine operations at multiple scales. No performance optimization targets — this session focused on feature completeness, achieving complete ClickHouse behavioral analytics parity.

Arc<str> Optimization

Replaced Option<String> with Option<Arc<str>> in NextNodeEvent, reducing struct size from 40 to 32 bytes and enabling O(1) clone via reference counting.

FunctionScaleImprovement
sequence_next_node update+finalizeall scales2.1x-5.8x
sequence_next_node combineall scales2.8x-6.4x

A string pool attempt (PooledEvent with Copy semantics, 24 bytes) was measured 10-55% slower at most scales due to dual-vector overhead. The key insight: below one cache line (64 bytes), clone cost matters more than absolute struct size. Arc::clone (~1ns atomic increment) consistently beats String::clone (copies len bytes).

Also added a realistic cardinality benchmark using a pool of 100 distinct Arc<str> values, showing 35% improvement over unique strings at 1M events.

E2E Validation + Custom C Entry Point

Discovered and fixed 3 critical bugs that all 375 passing unit tests at the time (now 434) missed:

  1. SEGFAULT on extension load: extract_raw_connection used incorrect pointer arithmetic. Replaced with a custom C entry point (behavioral_init_c_api) that obtains duckdb_connection via duckdb_connect directly from duckdb_extension_access.

  2. Silent function registration failure: duckdb_aggregate_function_set_name must be called on each function in a set. Without this, 6 of 7 functions silently failed to register.

  3. Incorrect combine propagation: combine_in_place did not propagate window_size_us or mode. DuckDB's segment tree creates fresh zero-initialized target states, so these fields remained zero at finalize time.

No performance optimization targets. 11 E2E tests added against real DuckDB v1.4.4 CLI.

NFA Fast Paths

Two Criterion-validated optimizations for the NFA pattern executor:

  1. NFA reusable stack: Pre-allocate the NFA state Vec once and reuse across all starting positions via clear(). Eliminates O(N) alloc/free pairs in sequence_count.

  2. Fast-path linear scan: Pattern classification dispatches common shapes to specialized O(n) scans:

    • Adjacent conditions ((?1)(?2)): sliding window scan
    • Wildcard-separated ((?1).*(?2)): single-pass step counter
    • Complex patterns: fall back to NFA
FunctionScaleImprovement
sequence_count100-1M events33-47% (NFA reusable stack)
sequence_count100-1M events39-40% additional (fast-path linear scan)
sequence_count100-1M events56-61% combined improvement

A first-condition pre-check was tested and produced negative results (0-8% regression). The NFA already handles non-matching starts efficiently (~5ns per check), so the extra branch in the hot loop costs more than it saves. Reverted.

Benchmark Inventory

BenchmarkFunctionInput Sizes
sessionize_updatesessionize100 to 1 billion
sessionize_combinesessionize100 to 1 million
retention_updateretention4, 8, 16, 32 conditions
retention_combineretention100 to 1 billion
window_funnel_finalizewindow_funnel100 to 100 million
window_funnel_combinewindow_funnel100 to 1 million
sequence_matchsequence_match100 to 100 million
sequence_countsequence_count100 to 100 million
sequence_combinesequence_*100 to 1 million
sequence_match_eventssequence_match_events100 to 100 million
sequence_match_events_combinesequence_match_events100 to 1 million
sort_events(isolated)100 to 100 million
sort_events_presorted(isolated)100 to 100 million
sequence_next_nodesequence_next_node100 to 10 million
sequence_next_node_combinesequence_next_node100 to 1 million
sequence_next_node_realisticsequence_next_node1,000 to 1 million

Reproduction

# Run all benchmarks
cargo bench

# Run a specific group
cargo bench -- sessionize
cargo bench -- window_funnel
cargo bench -- sequence_match_events

# Results stored in target/criterion/ for comparison

Criterion automatically compares against previous runs and reports confidence intervals. Run benchmarks three or more times before drawing conclusions from the data.

Methodology

  • Framework: Criterion.rs 0.8, 100 samples per benchmark, 95% CI
  • Large-scale benchmarks: 10 samples with 60-second measurement time for 100M and 1B element counts
  • Validation: Every claim requires three consecutive runs with non-overlapping confidence intervals
  • Negative results: Documented with the same rigor as positive results, including measured data and root cause analysis

Full optimization history with raw Criterion data, confidence intervals, and detailed analysis is maintained in PERF.md in the repository.

ClickHouse Compatibility

duckdb-behavioral implements behavioral analytics functions inspired by ClickHouse's parametric aggregate functions. This page documents the compatibility status, semantic differences, and remaining gaps.

Compatibility Matrix

ClickHouse Functionduckdb-behavioralStatus
retention(cond1, cond2, ...)retention(cond1, cond2, ...)Complete
windowFunnel(window)(timestamp, cond1, ...)window_funnel(window, timestamp, cond1, ...)Complete
windowFunnel(window, 'strict')(...)window_funnel(window, 'strict', ...)Complete
windowFunnel(window, 'strict_order')(...)window_funnel(window, 'strict_order', ...)Complete
windowFunnel(window, 'strict_deduplication')(...)window_funnel(window, 'strict_deduplication', ...)Complete
windowFunnel(window, 'strict_increase')(...)window_funnel(window, 'strict_increase', ...)Complete
windowFunnel(window, 'strict_once')(...)window_funnel(window, 'strict_once', ...)Complete
windowFunnel(window, 'allow_reentry')(...)window_funnel(window, 'allow_reentry', ...)Complete
sequenceMatch(pattern)(timestamp, cond1, ...)sequence_match(pattern, timestamp, cond1, ...)Complete
sequenceCount(pattern)(timestamp, cond1, ...)sequence_count(pattern, timestamp, cond1, ...)Complete
N/A (duckdb-behavioral extension)sequence_match_events(pattern, timestamp, cond1, ...)Complete
sequenceNextNode(dir, base)(ts, val, base_cond, ev1, ...)sequence_next_node(dir, base, ts, val, base_cond, ev1, ...)Complete

Syntax Differences

ClickHouse uses a two-level call syntax for parametric aggregate functions:

-- ClickHouse syntax
windowFunnel(3600)(timestamp, cond1, cond2, cond3)

-- duckdb-behavioral syntax
window_funnel(INTERVAL '1 hour', timestamp, cond1, cond2, cond3)

Key differences:

AspectClickHouseduckdb-behavioral
Window parameterSeconds as integerDuckDB INTERVAL type
Mode parameterSecond argument in parameter listOptional VARCHAR before timestamp
Function namecamelCasesnake_case
Session functionNot a built-in behavioral functionsessionize (window function)
Condition limit3232

The sessionize function has no direct ClickHouse equivalent. ClickHouse provides session analysis through different mechanisms.

Semantic Compatibility

retention

Fully compatible. The anchor condition semantics match ClickHouse: result[0] reflects the anchor, result[i] requires both the anchor and condition i to have been true in the group.

window_funnel

All six modes are implemented and match ClickHouse behavior:

  • Default: Greedy forward scan, multi-step advancement per event.
  • strict: Previous step condition must not refire before next step matches.
  • strict_order: No earlier conditions may fire between matched steps.
  • strict_deduplication: Same-timestamp events deduplicated per condition.
  • strict_increase: Strictly increasing timestamps required between steps.
  • strict_once: At most one step advancement per event.
  • allow_reentry: Entry condition resets the funnel mid-chain.

Modes are independently combinable (e.g., 'strict_increase, strict_once'), matching ClickHouse semantics.

sequence_match and sequence_count

Pattern syntax matches ClickHouse:

  • (?N) condition references (1-indexed)
  • . single event wildcard
  • .* zero-or-more event wildcard
  • (?t>=N), (?t<=N), (?t>N), (?t<N), (?t==N), (?t!=N) time constraints (seconds)

The NFA executor uses lazy matching for .*, consistent with ClickHouse semantics.

sequence_match_events

Analogous to ClickHouse's pattern matching with timestamp extraction. Returns the timestamps of each (?N) step as a LIST(TIMESTAMP).

sequence_next_node

Implements ClickHouse's sequenceNextNode for flow analysis. Uses simple sequential matching (not NFA patterns) with direction and base parameters.

The direction parameter ('forward' / 'backward') controls scan direction. The base parameter ('head' / 'tail' / 'first_match' / 'last_match') controls which starting point to use.

Uses a dedicated NextNodeEvent struct with per-event Arc<str> storage (separate from the Copy Event struct used by other functions).

Feature Parity

All ClickHouse behavioral analytics functions are implemented. There are no remaining feature gaps.

CI/CD

The project uses GitHub Actions for continuous integration, end-to-end testing, and release management. All workflows are defined in .github/workflows/.

Workflows

CI (ci.yml)

Runs on every push to main and every pull request. 13 independent jobs ensure code quality across multiple dimensions.

JobPurposeTool
checkVerify compilationcargo check --all-targets
testRun 434 unit tests + 1 doc-testcargo test
clippyZero-warning lint enforcementcargo clippy with -D warnings
fmtFormatting verificationcargo fmt --check
docDocumentation builds without warningscargo doc with -Dwarnings
msrvMinimum Supported Rust Version (1.80)cargo check with pinned toolchain
bench-compileBenchmarks compile (no execution)cargo bench --no-run
denyLicense, advisory, and source auditingcargo-deny
semverSemver compatibility checkcargo-semver-checks
coverageCode coverage reportingcargo-tarpaulin + Codecov
cross-platformLinux + macOS test matrixcargo test on both OSes
extension-buildCommunity extension packagingmake configure && make release

CodeQL (codeql.yml)

Runs GitHub's CodeQL static analysis for Rust on every push to main, every pull request, and on a weekly schedule (Monday 06:00 UTC). Uses the security-and-quality query suite for comprehensive coverage.

  • Triggers: push to main, PRs, weekly cron
  • Language: Rust
  • Action version: github/codeql-action v4.32.3 (SHA-pinned)
  • Permissions: security-events: write (required to upload SARIF results)

Prerequisite — Disable Default Setup:

This workflow uses CodeQL's "advanced setup" (explicit workflow file). GitHub does not allow both Default Setup and advanced setup to be active simultaneously. If Default Setup is enabled, the SARIF upload will fail with:

CodeQL analyses from advanced configurations cannot be processed when the default setup is enabled

The workflow includes a pre-flight check that detects this conflict and fails fast with actionable remediation steps. To resolve:

  1. Go to Settings → Code security → Code scanning → CodeQL analysis
  2. Click the menu → Disable CodeQL
  3. Or via CLI: gh api --method PATCH repos/OWNER/REPO/code-scanning/default-setup -f state=not-configured

E2E Tests (e2e.yml)

Runs on every push to main and every pull request. Builds the extension using the community extension Makefile and tests it against a real DuckDB instance.

Test coverage:

  • Extension loading verification
  • All 7 functions (sessionize, retention, window_funnel, sequence_match, sequence_count, sequence_match_events, sequence_next_node)
  • Mode parameters (strict_increase)
  • GROUP BY aggregation
  • Load test with 100K events across all aggregate functions
  • No-match and empty-input edge cases

Platforms tested: Linux x86_64, macOS ARM64

Release (release.yml)

Triggered on git tag push (v*) or manual dispatch. Builds the extension for 4 platform targets, runs SQL integration tests, and creates a GitHub release with SHA256 checksums and build provenance attestations.

Build targets:

PlatformRunnerTarget
Linux x86_64ubuntu-latestx86_64-unknown-linux-gnu
Linux ARM64ubuntu-22.04aarch64-unknown-linux-gnu (cross-compiled)
macOS x86_64macos-latestx86_64-apple-darwin
macOS ARM64macos-latestaarch64-apple-darwin

Supply chain security:

  • SHA256 checksums for all artifacts
  • GitHub artifact attestation via actions/attest-build-provenance@v2
  • Immutable artifacts with 30-day retention
  • Build provenance tied to specific git commit

Community Submission (community-submission.yml)

On-demand workflow for preparing updates to the extension listing in the DuckDB Community Extensions repository. The extension was accepted via PR #1306 (merged 2026-02-15). This workflow is used for subsequent version updates. Triggered via workflow_dispatch with a dry_run toggle.

Phases:

PhasePurpose
Validatedescription.yml schema, version consistency (Cargo.toml vs description.yml), required files
Quality Gatecargo test, cargo clippy, cargo fmt, cargo doc
Build & Testmake configure && make release && make test_release (community Makefile toolchain)
Pin RefUpdates description.yml ref to the validated commit SHA (skipped in dry run)
Submission PackageUploads description.yml artifact, generates step-by-step PR commands

Usage:

# Dry run — validate everything without making changes
gh workflow run community-submission.yml -f dry_run=true

# Full run — validate, build, test, pin ref, generate submission package
gh workflow run community-submission.yml -f dry_run=false

After a full run, the workflow summary contains the exact gh CLI commands to create a branch in the duckdb/community-extensions fork, update the ref, and open a PR — ensuring deterministic, repeatable version updates.

Pages (pages.yml)

Deploys mdBook documentation to GitHub Pages on push to main. Uses mdBook v0.4.40 with custom CSS styling.

Reproducing CI Locally

# Run the same checks as CI
cargo check --all-targets
cargo test --all-targets && cargo test --doc
cargo clippy --all-targets -- -D warnings
cargo fmt --all -- --check
RUSTDOCFLAGS=-Dwarnings cargo doc --no-deps --document-private-items
cargo deny check advisories bans licenses sources

# Build extension (requires submodule)
git submodule update --init
make configure
make release

# Run SQL integration tests
make test_release

Security & Supply Chain

This page documents the security model and supply chain integrity measures for the duckdb-behavioral extension.

Versioning

This project follows Semantic Versioning (SemVer).

Version format: MAJOR.MINOR.PATCH (e.g., 0.1.0, 1.0.0, 1.2.3)

ComponentBumped when
MAJORBreaking changes to SQL function signatures, return types, or behavior
MINORNew functions, new modes, new pattern syntax operators
PATCHBug fixes, performance improvements, documentation updates

Pre-1.0 policy: While the version is 0.x.y, MINOR bumps may include breaking changes per SemVer section 4.

Version sources: The version is maintained in three files that must stay in sync:

  • Cargo.toml (version = "X.Y.Z")
  • description.yml (version: X.Y.Z)
  • Git tag (vX.Y.Z)

The release workflow validates that all three match before building.

Dependency Audit

Runtime Dependencies

The extension has exactly one runtime dependency:

CrateVersionPurpose
libduckdb-sys=1.4.4DuckDB C API bindings for aggregate function registration

The version is pinned exactly (=1.4.4) to prevent silent dependency updates. The loadable-extension feature provides runtime function pointer stubs via global atomic statics.

Dev-Only Dependencies

These are used only in #[cfg(test)] modules and are NOT linked into the release extension binary:

CratePurpose
duckdbIn-memory connection for unit tests
criterionBenchmarking framework
proptestProperty-based testing
randRandom data generation for benchmarks

License Compliance

All dependencies are audited via cargo-deny in CI:

Allowed licenses: MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause,
CC0-1.0, CDLA-Permissive-2.0, ISC, Unicode-3.0, Zlib

Build Integrity

Reproducible Builds

Release builds use the following Cargo profile for deterministic output:

[profile.release]
opt-level = 3
lto = true          # Full link-time optimization
codegen-units = 1   # Single codegen unit
panic = "abort"     # No unwinding overhead
strip = true        # Strip debug symbols

Using codegen-units = 1 and lto = true reduces non-determinism from parallel compilation. The Cargo.lock file is committed to the repository to pin all transitive dependency versions.

Build Provenance

Release artifacts include:

  1. SHA256 checksums (SHA256SUMS.txt) for every release artifact
  2. GitHub artifact attestations via actions/attest-build-provenance@v2, which provides a cryptographic link between the artifact and the GitHub Actions build that produced it
  3. Immutable build logs in GitHub Actions with full command output

Verification

# Verify downloaded artifact checksum
sha256sum -c SHA256SUMS.txt

# Verify GitHub attestation (requires gh CLI)
gh attestation verify behavioral-v0.2.0-linux_amd64.tar.gz \
  --repo tomtom215/duckdb-behavioral

Code Safety

Unsafe Code Confinement

All unsafe code is confined to src/ffi/ modules. Business logic in src/sessionize.rs, src/retention.rs, src/window_funnel.rs, src/sequence.rs, src/sequence_next_node.rs, and src/pattern/ is 100% safe Rust.

Every unsafe block has a // SAFETY: documentation comment explaining why the invariants are upheld.

No Network Access

The extension makes zero network calls at runtime. It operates purely on data provided by DuckDB through the C API.

No File System Access

The extension does not read from or write to the file system. All state is managed in-memory through DuckDB's aggregate function lifecycle.

Extension Loading Security

DuckDB requires the -unsigned flag to load locally-built extensions. The community extension distribution path uses DuckDB's built-in extension signing and verification system, which the build system handles automatically.

Reporting Vulnerabilities

Report security issues via GitHub Issues at github.com/tomtom215/duckdb-behavioral/issues.

Benchmarking

This page documents the benchmark infrastructure, methodology, and reproduction instructions for the duckdb-behavioral extension.

Methodology

All benchmarks use Criterion.rs with the following configuration:

  • Sample size: 100 samples (default), reduced to 10 for inputs >= 100M
  • Confidence level: 95%
  • Measurement time: Extended to 30-60 seconds for large inputs
  • Outlier detection: Criterion's built-in outlier classification
  • Iterations: Criterion automatically determines iteration count
  • Warm-up: Criterion's default warm-up period

Results are reported with confidence intervals. A result is considered statistically significant only when 95% confidence intervals do not overlap.

Benchmark Suite

Seven benchmark files cover all functions:

BenchmarkFunctionsMax ScaleMemory Limit
sessionize_benchupdate, combine1 billionO(1) state, no limit
retention_benchupdate, combine1 billionO(1) state, no limit
window_funnel_benchfinalize, combine100 million16 bytes/event
sequence_benchmatch, count, combine100 million16 bytes/event
sort_benchrandom, presorted100 million16 bytes/event + clone
sequence_next_node_benchupdate, combine, realistic10 million32 bytes/event
sequence_match_events_benchupdate, combine100 million16 bytes/event

Scale Limits

Event-collecting functions store events in a Vec<Event> (16 bytes per event). At 100 million events, this requires 1.6 GB of memory. Criterion clones the data per iteration, requiring approximately 3.2 GB total. On a 21 GB system, 100 million is the practical maximum for event-collecting benchmarks.

O(1)-state functions (sessionize, retention) store no per-event data, enabling benchmarks up to 1 billion elements.

sequence_next_node uses 32-byte NextNodeEvent structs with Arc<str> string storage, further reducing the maximum feasible scale.

Running Benchmarks

# Run all benchmarks (may require several hours for full suite)
cargo bench

# Run a specific benchmark
cargo bench --bench sessionize_bench
cargo bench --bench retention_bench
cargo bench --bench window_funnel_bench
cargo bench --bench sequence_bench
cargo bench --bench sort_bench
cargo bench --bench sequence_next_node_bench
cargo bench --bench sequence_match_events_bench

# Run a specific benchmark group
cargo bench --bench sequence_bench -- sequence_match

# Compile benchmarks without running (for CI)
cargo bench --no-run

Reading Results

Criterion generates HTML reports in target/criterion/. Open target/criterion/report/index.html in a browser for interactive charts.

Command-line output includes:

sequence_match/100000000
    time:   [1.0521 s 1.0789 s 1.1043 s]
    thrpt:  [90.551 Melem/s 92.687 Melem/s 95.049 Melem/s]

The three values in brackets represent the lower bound, point estimate, and upper bound of the 95% confidence interval.

Baseline Validation Protocol

Before any optimization work:

  1. Run cargo bench 3 times to establish a stable baseline
  2. Compare against the documented baseline in PERF.md
  3. Environment differences (CPU, memory, OS) produce different absolute numbers but the same relative improvements

After optimization:

  1. Run cargo bench 3 times with the optimization
  2. Compare 95% CIs against the pre-optimization baseline
  3. Accept only when CIs do not overlap (statistically significant)
  4. If CIs overlap, revert and document as a negative result

Benchmark Data Generation

Each benchmark generates synthetic data with realistic patterns:

  • Sessionize: Events every 5 minutes with a 33-minute gap every 10 events (simulates real session boundaries)
  • Retention: Rotating true conditions per event
  • Window funnel: Conditions fire for 1/3 of events in round-robin
  • Sequence: Alternating condition patterns for match/count testing
  • Sort: Reverse-order timestamps with jitter (worst case), and forward-order (best case for presorted detection)
  • Sequence next node: Unique strings per event (worst case) and pooled 100-value cardinality (realistic case)

Contributing

Guidelines for contributing to duckdb-behavioral.

Development Setup

Prerequisites

  • Rust 1.80+ (the project's MSRV)
  • A C compiler (for DuckDB system bindings)
  • DuckDB CLI v1.4.4 (for E2E testing)

Building

git clone https://github.com/tomtom215/duckdb-behavioral.git
cd duckdb-behavioral
cargo build

Running Tests

# Unit tests (434 tests + 1 doc-test, runs in <1 second)
cargo test

# Clippy (zero warnings required)
cargo clippy --all-targets

# Format check
cargo fmt -- --check

All three checks must pass before submitting changes.

Code Style

Lint Standards

The project enforces zero clippy warnings with pedantic, nursery, and cargo lint groups enabled. See Cargo.toml [lints.clippy] for the full configuration and allowed exceptions (FFI callbacks, analytics math casts).

Architecture

  • Pure Rust core: Business logic in top-level modules (sessionize.rs, retention.rs, etc.) with zero FFI dependencies.
  • FFI bridge: DuckDB C API registration confined to src/ffi/. Every unsafe block has a // SAFETY: comment.
  • All public items documented: Functions, structs, and modules must have doc comments.

Adding a New Function

  1. Create src/new_function.rs with state struct implementing new(), update(), combine(), combine_in_place(), and finalize().
  2. Create src/ffi/new_function.rs with FFI callbacks.
    • If using function sets, call duckdb_aggregate_function_set_name on each function in the set.
    • combine_in_place must propagate all configuration fields from the source state (not just events).
  3. Register in src/ffi/mod.rs register_all_raw().
  4. Add pub mod new_function; to src/lib.rs.
  5. Add a benchmark in benches/.
  6. Write unit tests in the module.
  7. E2E test: Build release, load in DuckDB CLI, verify SQL produces correct results.

Testing Expectations

Unit Tests

  • Cover state lifecycle: empty state, single update, multiple updates, finalize.
  • Cover edge cases: threshold boundaries, NULL handling, empty inputs.
  • Cover combine correctness: empty combine, boundary detection, associativity.
  • Cover the DuckDB zero-initialized target combine pattern (DuckDB creates fresh states and combines source states into them).
  • Property-based tests (proptest) for algebraic properties where applicable.

E2E Tests

Unit tests validate business logic in isolation. E2E tests validate the FFI boundary, DuckDB's data chunk format, state lifecycle management, and the extension loading mechanism. Both levels are mandatory.

# Build release
cargo build --release

# Copy and append metadata
cp target/release/libbehavioral.so /tmp/behavioral.duckdb_extension
python3 extension-ci-tools/scripts/append_extension_metadata.py \
  -l /tmp/behavioral.duckdb_extension -n behavioral \
  -p linux_amd64 -dv v1.2.0 -ev v0.2.0 --abi-type C_STRUCT \
  -o /tmp/behavioral.duckdb_extension

# Load and test
duckdb -unsigned -c "LOAD '/tmp/behavioral.duckdb_extension'; SELECT ..."

Benchmark Protocol

Performance claims must follow the Session Improvement Protocol documented in PERF.md:

  1. Baseline first: Run cargo bench 3 times before any changes. Record baseline with confidence intervals.
  2. One optimization per commit: Each optimization must be independently measurable.
  3. Non-overlapping CIs required: Accept only when 95% confidence intervals do not overlap between before and after measurements.
  4. Negative results documented: If an optimization produces overlapping CIs or regression, revert it and document the result honestly in PERF.md.
  5. Update documentation: Add an optimization history entry to PERF.md and update the current baseline table.

Running Benchmarks

# Run all benchmarks
cargo bench

# Run a specific group
cargo bench -- sessionize
cargo bench -- sequence_match_events

# Results stored in target/criterion/

PR Process

  1. Create a feature branch from main.
  2. Make changes following the code style and testing expectations above.
  3. Ensure all checks pass: cargo test, cargo clippy --all-targets, cargo fmt -- --check.
  4. If performance-related, include Criterion benchmark data with confidence intervals.
  5. Update documentation if function signatures, test counts, or benchmark numbers changed.
  6. Submit a pull request with a clear description of changes.

Documentation

  • CLAUDE.md: AI assistant context. Update test counts, benchmark counts, session entries, and lessons learned when making significant changes.
  • PERF.md: Performance optimization history. Follow the session protocol.
  • README.md: Public-facing documentation. Keep performance tables and test counts current.
  • docs/: GitHub Pages documentation (mdBook). Keep function signatures and performance claims in sync with the source code.