Architecture

Overview

Mallard Metrics is a single Rust binary that handles the complete analytics lifecycle: event ingestion, storage, querying, authentication, and dashboard serving. There are no external services, no message queues, and no separate database process.

flowchart TD
    TS["Tracking Script\nmallard.js <1KB"]
    DASH["Dashboard SPA\nPreact + HTM"]

    TS -->|"POST /api/event"| AXUM
    DASH <-->|"GET /api/stats/*\nGET /api/keys/*"| AXUM

    subgraph BINARY["Single Binary — Single Process"]
        AXUM["Axum HTTP Server\nport 8000"]

        subgraph INGEST["Ingestion Pipeline"]
            direction LR
            OC["Origin Check\nRate Limiter"] --> BF["Bot Filter\nUA Parser"]
            BF --> GEO["GeoIP Lookup\nVisitor ID Hash"]
            GEO --> BUF["In-Memory\nEvent Buffer"]
        end

        subgraph STORE["Two-Tier Storage"]
            direction LR
            DB["DuckDB disk-based\nmallard.duckdb\nWAL durability"]
            PQ["Parquet Files\nsite_id=*/date=*/*.parquet\nZSTD-compressed"]
            VIEW["events_all VIEW\nhot union cold"]
            DB -->|"COPY TO"| PQ
            DB --> VIEW
            PQ -->|"read_parquet()"| VIEW
        end

        subgraph QUERY["Query Engine"]
            direction LR
            CACHE["TTL Query Cache"] --> QH["Stats\nSessions\nFunnels\nRetention\nSequences\nFlow"]
            EXT["behavioral extension\nsessionize\nwindow_funnel\nretention\nsequence_match"] -.->|"optional"| CACHE
        end

        AUTH["Auth Layer\nArgon2id passwords\n256-bit session tokens\nAPI keys SHA-256"] -.->|"guards"| AXUM

        AXUM --> OC
        BUF -->|"flush"| DB
        VIEW --> CACHE
        QH --> AXUM
    end

Event Ingestion Pipeline

Every POST /api/event request passes through a sequential pipeline of validation and enrichment steps before being buffered.

flowchart TD
    START(["POST /api/event\nJSON body"])

    START --> SZ{"Body size\n&le; 64 KB?"}
    SZ -->|"No"| R413["413 Request\nEntity Too Large"]
    SZ -->|"Yes"| OC

    OC{"Origin in\nallowlist?"}
    OC -->|"No (if configured)"| R403["403 Forbidden"]
    OC -->|"Yes"| RL

    RL{"Rate limit\nexceeded?"}
    RL -->|"Yes"| R429["429 Too Many Requests\nRetry-After header"]
    RL -->|"No"| SITEID

    SITEID{"site_id valid?\na-z A-Z 0-9 .-: max 256 chars"}
    SITEID -->|"No"| R400["400 Bad Request"]
    SITEID -->|"Yes"| BOT

    BOT{"Bot\nUser-Agent?"}
    BOT -->|"Yes"| DISCARD["Silently discarded\n202 Accepted"]
    BOT -->|"No"| UA

    UA["Parse User-Agent\nbrowser, OS, device type"]
    UA --> GEO

    GEO["GeoIP Lookup\ncountry, region, city\nGraceful fallback if no DB"]
    GEO --> VID

    VID["Compute Visitor ID\nHMAC-SHA256\nIP plus UA plus daily-salt\nDiscard IP immediately"]
    VID --> URL

    URL["Parse URL\npathname, hostname\nUTM parameters"]
    URL --> BUF

    BUF["Push to In-Memory Buffer"]
    BUF --> THR{"Buffer count\n>= flush_event_count?"}
    THR -->|"Yes"| FLUSH["Flush to DuckDB\nAppender API batch insert"]
    THR -->|"No"| R202
    FLUSH --> R202

    R202(["202 Accepted"])

Two-Tier Storage Model

Mallard Metrics stores events in two complementary tiers, always queried together via the events_all VIEW.

flowchart LR
    INGEST["Ingestion\nEvent Buffer"]

    subgraph HOT["Hot Tier — DuckDB (mallard.duckdb)"]
        EVENTS["events table\nrecently arrived events\nWAL-backed, survives SIGKILL"]
    end

    subgraph COLD["Cold Tier — Parquet on Disk"]
        P1["site_id=example.com/\ndate=2024-01-15/\n0001.parquet"]
        P2["site_id=example.com/\ndate=2024-01-16/\n0001.parquet"]
        P3["site_id=other.org/\ndate=2024-01-15/\n0001.parquet"]
    end

    subgraph UNIFIED["Unified Query Layer"]
        VIEW["events_all VIEW\nSELECT * FROM events\nUNION ALL\nSELECT * FROM read_parquet(...)"]
    end

    INGEST -->|"flush"| EVENTS
    EVENTS -->|"COPY TO ZSTD"| P1
    EVENTS -->|"COPY TO ZSTD"| P2
    EVENTS -->|"COPY TO ZSTD"| P3
    EVENTS -->|"hot events"| VIEW
    P1 -->|"read_parquet()"| VIEW
    P2 -->|"read_parquet()"| VIEW
    P3 -->|"read_parquet()"| VIEW
    VIEW --> ANALYTICS["Analytics Queries\nGET /api/stats/*"]

Hot tier (data/mallard.duckdb): Stores events that have been buffered but not yet flushed. Events here are immediately queryable. The DuckDB WAL provides durability — hot events survive a SIGKILL (crash), not just a graceful SIGTERM.

Cold tier (.parquet files): After flushing, events are written as ZSTD-compressed Parquet files partitioned by site and date. These files are the primary durability layer for historical data and can be queried independently with any Parquet-compatible tool (DuckDB CLI, pandas, Apache Spark).

The events_all VIEW is created at startup and refreshed after each flush. It transparently unions the hot and cold tiers so all analytics queries work correctly regardless of which tier the data resides in.

The cold-tier directory layout:

data/events/
├── site_id=example.com/
│   ├── date=2024-01-15/
│   │   ├── 0001.parquet
│   │   └── 0002.parquet
│   └── date=2024-01-16/
│       └── 0001.parquet
└── site_id=other-site.org/
    └── date=2024-01-15/
        └── 0001.parquet

Authentication Architecture

flowchart TD
    subgraph CREDS["Credentials at Rest"]
        HASH["Admin Password\nArgon2id hash PHC defaults\nmemory-only at runtime"]
        KEYS["API Keys\nmm_ prefix plus 256-bit random\nSHA-256 hash on disk\nJSON file in data_dir"]
        SESS["Session Tokens\n256-bit OS CSPRNG\nHashMap with TTL expiry\nHttpOnly SameSite=Strict"]
    end

    BROWSER["Browser"] -->|"POST /api/auth/login\npassword"| ARGON
    ARGON["Argon2id verify"] -->|"match"| SESS
    SESS -->|"session cookie\nHttpOnly Secure SameSite=Strict"| BROWSER

    APICLIENT["API Client"] -->|"Authorization: Bearer mm_xxx\nor X-API-Key: mm_xxx"| KEYCHECK
    KEYCHECK["SHA-256 hash\nconstant-time compare"] -->|"valid"| SCOPE

    SCOPE{"Scope check"}
    SCOPE -->|"ReadOnly key"| READONLY["GET /api/stats/*\nGET /api/keys/*"]
    SCOPE -->|"Admin key"| ADMIN["All routes\nincluding POST /api/keys\nDELETE /api/keys/*"]

    BROWSER -->|"GET /api/stats/*\nauto-sent cookie"| SESSMW
    SESSMW["Session middleware\nTTL check"] -->|"valid"| ROUTE

    ROUTE["Route Handler"]

    CSRF["CSRF check\nOrigin vs dashboard_origin"] -.->|"state-mutating\nroutes only"| ROUTE
    BF["Brute-force check\nper-IP attempt counting\nconfigurable lockout"] -.->|"login endpoint"| ARGON

Key Security Properties

PropertyImplementation
Password storageArgon2id hash (PHC defaults), never stored in plaintext
Session tokens256-bit OS CSPRNG; HashMap with TTL; cleared on restart
API key storageSHA-256 hash on disk; plaintext returned only at creation
Timing attacksConstant-time comparison for API key validation
Session cookiesHttpOnly; Secure; SameSite=Strict
CSRFOrigin/Referer validation on all state-mutating session-auth routes
Brute forcePer-IP attempt counting; configurable lockout and Retry-After

Behavioral Extension

Advanced analytics rely on the DuckDB behavioral extension, which provides window aggregate functions purpose-built for clickstream analysis.

flowchart LR
    subgraph EXT["behavioral extension"]
        SESS_F["sessionize()\nGroup events into sessions\nby visitor and time gap"]
        FUNNEL_F["window_funnel()\nMulti-step ordered\nconversion funnel"]
        RET_F["retention()\nWeekly cohort\nretention grid"]
        SEQ_F["sequence_match()\nBehavioral pattern\ndetection"]
        FLOW_F["sequence_next_node()\nNext-page\nflow analysis"]
    end

    subgraph API["Behavioral Endpoints"]
        direction TB
        S["/api/stats/sessions"]
        FU["/api/stats/funnel"]
        R["/api/stats/retention"]
        SQ["/api/stats/sequences"]
        FL["/api/stats/flow"]
    end

    SESS_F --> S
    FUNNEL_F --> FU
    RET_F --> R
    SEQ_F --> SQ
    FLOW_F --> FL

    CORE["Core analytics\n/api/stats/main\n/api/stats/timeseries\n/api/stats/breakdown/*"] -.->|"no extension\nrequired"| ALWAYS["Always available"]

The extension is loaded at startup:

INSTALL behavioral FROM community;
LOAD behavioral;

If loading fails (network unavailable, air-gapped environment), all extension-dependent endpoints return graceful defaults (zeroes or empty arrays). Core analytics continue working normally. The GET /health/detailed JSON response and GET /metrics Prometheus output both report whether the extension loaded successfully.


Module Map

ModulePurpose
config.rsTOML + environment variable configuration
server.rsAxum router with CORS configuration and middleware stack
ingest/handler.rsPOST /api/event ingestion handler
ingest/buffer.rsIn-memory event buffer with periodic flush
ingest/visitor_id.rsHMAC-SHA256 privacy-safe visitor ID
ingest/useragent.rsUser-Agent parsing
ingest/geoip.rsMaxMind GeoIP reader with graceful fallback
ingest/ratelimit.rsPer-site token-bucket rate limiter
storage/schema.rsDuckDB table definitions and events_all view
storage/parquet.rsParquet write/read/partitioning
storage/migrations.rsSchema versioning
query/metrics.rsCore metric calculations
query/breakdowns.rsDimension breakdown queries
query/timeseries.rsTime-bucketed aggregations
query/sessions.rssessionize-based session queries
query/funnel.rswindow_funnel query builder
query/retention.rsRetention cohort query execution
query/sequences.rssequence_match query execution
query/flow.rssequence_next_node flow analysis
query/cache.rsTTL-based query result cache
api/stats.rsAll analytics API handlers
api/errors.rsAPI error types
api/auth.rsOrigin validation, session auth, API key management
dashboard/Embedded SPA (Preact + HTM)