Mallard Metrics

Self-hosted, privacy-focused web analytics powered by DuckDB and the behavioral extension.

Single binary. Single process. Zero external dependencies.

What is Mallard Metrics?

Mallard Metrics is a lightweight, GDPR/CCPA-compliant alternative to cloud analytics platforms. It runs entirely on your infrastructure, stores no personally identifiable information, and requires no cookies or consent banners.

Built in Rust for predictable, low resource usage. The embedded DuckDB database — combined with the behavioral extension — provides SQL-native behavioral analytics: funnels, retention cohorts, session analysis, sequence matching, and flow analysis. No third-party services involved.

Mallard Metrics Dashboard

Core Properties

Property	Value
Language	Rust (MSRV 1.94.0)
Web framework	Axum 0.8.x
Database	DuckDB (disk-based, embedded, in-process)
Analytics	`behavioral` extension (loaded at runtime)
Storage	Date-partitioned Parquet files (ZSTD-compressed)
Frontend	Preact + HTM (no build step, embedded in binary)
Deployment	Static musl binary, `FROM scratch` Docker image
Tests	333 passing (262 unit + 71 integration)

Key Features

Privacy by Design

No cookies — Visitor identification uses a daily-rotating HMAC-SHA256 hash of IP + User-Agent + daily salt.
No PII storage — IP addresses are hashed and discarded; they are never written to disk.
Daily salt rotation — Visitor IDs change every 24 hours, preventing long-term tracking.
Privacy-preserving — Pseudonymous visitor IDs; no cookies; no raw IP storage. See Security & Privacy for details.

Single Binary Deployment

One process handles ingestion, storage, querying, authentication, and the dashboard.
DuckDB is embedded — no separate database to install or operate.
FROM scratch Docker image: the binary is the only file in the container.
WAL-based durability: disk-backed DuckDB preserves hot events through crashes.

Analytical Power

Category	Capabilities
Core metrics	Unique visitors, pageviews, bounce rate, pages/session
Breakdowns	Pages, referrers, browsers, OS, devices, countries
Time-series	Hourly and daily aggregations
Funnel analysis	Multi-step conversion funnels via `window_funnel()`
Retention cohorts	Weekly retention grids via `retention()`
Session analytics	Duration, depth via `sessionize()`
Sequence matching	Behavioral patterns via `sequence_match()`
Flow analysis	Next-page navigation via `sequence_next_node()`

Production Ready

Argon2id authentication — Password-protected dashboard with cryptographic session tokens.
API key management — Programmatic access with SHA-256 hashed keys (mm_ prefix), disk-persisted.
Rate limiting — Per-site token-bucket rate limiter on the ingestion endpoint.
Query caching — TTL-based in-memory cache for analytics queries.
Bot filtering — Automatic filtering of known bot User-Agents.
GeoIP — MaxMind GeoLite2 integration with graceful fallback.
Data retention — Configurable automatic cleanup of old Parquet partitions.
Graceful shutdown — Buffered events are flushed before process exit.
Prometheus metrics — GET /metrics for scraping with counters for ingestion, auth, cache, and rate limiting.
OWASP security headers — Including HSTS, CSP, Permissions-Policy, and X-Request-ID.
CSRF protection — Origin/Referer validation on all state-mutating endpoints.
Brute-force protection — Per-IP login lockout with configurable threshold and lockout duration.
GDPR-friendly mode — Single MALLARD_GDPR_MODE=true toggle strips referrers, rounds timestamps, reduces GeoIP precision, and enables the Art. 17 data-erasure API.

When Should You Use Mallard Metrics?

Mallard Metrics is a good fit when you:

Want full control over your analytics data on your own server.
Need GDPR/CCPA compliance without third-party data processors.
Are running a small-to-medium website and want low operational overhead.
Need advanced behavioral analytics (funnels, retention, sequences) without a SaaS subscription.
Want to demonstrate the power of DuckDB's behavioral extension in a real-world production context.

It is not designed for:

Multi-region distributed analytics at very high volume (millions of events/minute).
Real-time dashboards with sub-second latency requirements.
Replacing a full data warehouse.

Project Status

Mallard Metrics is actively developed and production-ready. See GitHub for the latest releases and issue tracker.

The behavioral extension powering advanced analytics is developed at github.com/tomtom215/duckdb-behavioral.

Quick Start

This guide gets Mallard Metrics running and collecting events in a few minutes.

Prerequisites

Docker (recommended), or a Linux/macOS host with Rust 1.94+ for building from source.
A web property you want to track.

Option 1: Docker (Recommended)

docker run -p 8000:8000 \
  -v mallard-data:/data \
  -e MALLARD_SECRET=your-random-32-char-secret \
  -e MALLARD_ADMIN_PASSWORD=your-dashboard-password \
  ghcr.io/tomtom215/mallard-metrics

Open http://localhost:8000 to access the dashboard.

Option 2: Docker Compose

Download docker-compose.yml from the repository root and run:

docker compose up -d

The compose file includes persistent storage, restart policy, and environment variable configuration. Set MALLARD_SECRET and MALLARD_ADMIN_PASSWORD in your shell or a .env file before running.

Option 3: Build from Source

git clone https://github.com/tomtom215/mallardmetrics
cd mallardmetrics
cargo build --release
./target/release/mallard-metrics mallard-metrics.toml.example

Note: The bundled feature for DuckDB means no external libduckdb is required. The build will take a few minutes the first time as DuckDB is compiled from source.

Step 2: Embed the Tracking Script

Add the tracking script to every page you want to track. Place it in the <head> or at the end of <body>:

<script
  async
  defer
  src="https://your-mallard-instance.com/mallard.js"
  data-domain="your-site.com">
</script>

Replace:

https://your-mallard-instance.com with the URL of your Mallard Metrics instance.
your-site.com with the domain you configured in site_ids (or any domain if site_ids is empty).

The script is under 1 KB, loads asynchronously, sets no cookies, and automatically tracks pageview events including URL, referrer, UTM parameters, screen size, and User-Agent.

See Tracking Script for the full API including custom events and revenue tracking.

Step 3: Verify Events Are Arriving

Check the health endpoint:

curl http://localhost:8000/health
# ok

curl http://localhost:8000/health/detailed
# {"status":"ok","version":"0.1.0","buffered_events":3,...}

Events are held in a memory buffer before being flushed to disk. You can query the dashboard immediately — the events_all view unions the hot buffer and all persisted Parquet data automatically.

Step 4: Dashboard

Navigate to http://localhost:8000 in your browser.

If you set MALLARD_ADMIN_PASSWORD, you will be prompted to log in. The dashboard shows:

Overview — Unique visitors, pageviews, bounce rate, session metrics.
Timeseries — Visitors and pageviews charted over your selected period.
Breakdowns — Top pages, referrer sources, browsers, OS, devices, countries.
Funnel — Define a conversion funnel with up to N steps.
Retention — Weekly cohort retention grid.
Sequences — Behavioral pattern matching and conversion rates.
Flow — Next-page navigation from any starting page.

What's Next?

Configuration — All configuration options.
Tracking Script — Custom events and revenue tracking.
API Reference — Integrate programmatically.
Deployment — Production deployment guides.

Tracking Script

The Mallard Metrics tracking script (mallard.js) is served by the server at GET /mallard.js. It is under 1 KB, sets no cookies, and loads asynchronously.

Basic Embed

<script
  async
  defer
  src="https://your-instance.com/mallard.js"
  data-domain="your-site.com">
</script>

Attributes:

Attribute	Required	Description
`data-domain`	Yes	The site ID to record events under. Must match an entry in `site_ids` if that config option is set.

Automatic Tracking

Once embedded, the script automatically fires a pageview event on every page load with the following data:

Field	Source
`pathname`	`window.location.pathname + search + hash`
`referrer`	`document.referrer`
`screen_width`	`window.innerWidth` (viewport width in pixels)
User-Agent	Sent in request header, parsed server-side
UTM parameters	Extracted from URL query string

Custom Events

Use window.mallard(eventName, options) to track custom actions:

// Simple event
window.mallard('signup');

// Event with custom properties
window.mallard('purchase', {
  props: { plan: 'pro', coupon: 'SAVE20' }
});

// Revenue event
window.mallard('checkout', {
  revenue: 99.00,
  currency: 'USD'
});

// Event with callback
window.mallard('form_submit', {
  props: { form: 'contact' },
  callback: function() {
    console.log('Event recorded');
  }
});

Options

Option	Type	Description
`props`	`object`	Custom properties stored as JSON in the `props` column. Queryable via `json_extract`.
`revenue`	`number`	Revenue amount (stored as `DECIMAL(12,2)`).
`currency`	`string`	ISO 4217 currency code (3 characters, e.g. `"USD"`).
`callback`	`function`	Called after the event is successfully recorded.

Outbound Link Tracking

To track outbound link clicks, call window.mallard before navigating:

document.querySelectorAll('a[href^="http"]').forEach(function(link) {
  link.addEventListener('click', function(e) {
    window.mallard('outbound_link', {
      props: { url: link.href },
      callback: function() { window.location = link.href; }
    });
    e.preventDefault();
  });
});

Single-Page App Support

For SPAs, call window.mallard('pageview') manually after each route change:

// Example with a router
router.afterEach(function(to) {
  window.mallard('pageview');
});

Server-Side Events (No Script)

You can also send events directly to the API without the browser script. This is useful for server-rendered pages or background jobs:

curl -X POST https://your-instance.com/api/event \
  -H 'Content-Type: application/json' \
  -d '{
    "d": "your-site.com",
    "n": "signup",
    "u": "https://your-site.com/signup"
  }'

See Event Ingestion API for the full request schema.

Configuration

Mallard Metrics is configured through a TOML file and environment variables. All settings have sensible defaults; you can start without any configuration file.

Loading Configuration

Pass the path to a TOML file as the first command-line argument:

mallard-metrics /etc/mallard-metrics/config.toml

If no argument is provided, defaults are used.

Environment Variables

These two values are secrets and must not be stored in files committed to source control. Set them in your shell or a .env file:

Variable	Required	Description
`MALLARD_SECRET`	Recommended	HMAC key for visitor ID hashing. If unset, a UUID is auto-generated on first start and persisted to `data_dir/.secret` (survives restarts). Set explicitly in production for portability across hosts.
`MALLARD_ADMIN_PASSWORD`	Recommended	Dashboard password. If unset, the dashboard is unauthenticated.
`MALLARD_MAX_LOGIN_ATTEMPTS`	Optional	Override `max_login_attempts` at runtime.
`MALLARD_LOGIN_LOCKOUT`	Optional	Override `login_lockout_secs` at runtime.
`MALLARD_LOG_FORMAT`	Optional	Set to `json` for structured JSON log output. Omit or set to any other value for human-readable text logs.
`MALLARD_SECURE_COOKIES`	Optional	Set to `true` to add the `Secure` flag to session cookies (required behind TLS).
`MALLARD_METRICS_TOKEN`	Optional	Bearer token protecting the `/metrics` endpoint.
`MALLARD_GEOIP_DB`	Optional	Path to MaxMind GeoLite2-City `.mmdb` file.
`MALLARD_DASHBOARD_ORIGIN`	Optional	Restrict dashboard CORS and enable CSRF protection.
`MALLARD_MAX_CONCURRENT_QUERIES`	Optional	Max concurrent analytical queries (default 10). Returns 429 when exhausted.
`MALLARD_CACHE_MAX_ENTRIES`	Optional	Max query cache entries (default 10000).
`MALLARD_GDPR_MODE`	Optional	Enable GDPR-friendly preset (see PRIVACY.md).
`MALLARD_GEOIP_PRECISION`	Optional	GeoIP precision: `city`, `region`, `country`, or `none`.
`MALLARD_HOST`	Optional	Server bind address (default `0.0.0.0`).
`MALLARD_PORT`	Optional	Server listen port (default `8000`).
`MALLARD_DATA_DIR`	Optional	Data directory for Parquet files and DuckDB (default `data`).
`MALLARD_FLUSH_COUNT`	Optional	Events buffered before flushing to disk (default `1000`).
`MALLARD_FLUSH_INTERVAL`	Optional	Seconds between periodic buffer flushes (default `60`).
`MALLARD_FILTER_BOTS`	Optional	Filter known bot User-Agents (default `true`).
`MALLARD_RETENTION_DAYS`	Optional	Auto-delete data older than N days; 0 = unlimited (default `0`).
`MALLARD_SESSION_TTL`	Optional	Dashboard session TTL in seconds (default `86400`).
`MALLARD_SHUTDOWN_TIMEOUT`	Optional	Graceful shutdown timeout in seconds (default `30`).
`MALLARD_RATE_LIMIT`	Optional	Max events/sec per site; 0 = unlimited (default `0`).
`MALLARD_CACHE_TTL`	Optional	Query cache TTL in seconds (default `60`).
`MALLARD_STRIP_REFERRER_QUERY`	Optional	Strip query/fragment from stored referrers (default `false`).
`MALLARD_ROUND_TIMESTAMPS`	Optional	Round timestamps to the nearest hour (default `false`).
`MALLARD_SUPPRESS_VISITOR_ID`	Optional	Replace HMAC hash with per-request UUID (default `false`).
`MALLARD_SUPPRESS_BROWSER_VERSION`	Optional	Store browser name only (default `false`).
`MALLARD_SUPPRESS_OS_VERSION`	Optional	Store OS name only (default `false`).
`MALLARD_SUPPRESS_SCREEN_SIZE`	Optional	Omit screen width and device type (default `false`).

TOML Configuration Reference

A complete example is shipped as mallard-metrics.toml.example. Every field has a default and is optional.

# Network binding
host = "0.0.0.0"   # default
port = 8000         # default

# Storage
data_dir = "data"   # relative or absolute path; events and Parquet files are stored here

# Event buffer
flush_event_count = 1000   # flush buffer to Parquet when this many events accumulate
flush_interval_secs = 60   # also flush on this interval (seconds)

# Site allowlist — leave empty to accept events from any origin
# site_ids = ["example.com", "other-site.org"]
site_ids = []

# GeoIP database (optional — gracefully skipped if missing)
# geoip_db_path = "/path/to/GeoLite2-City.mmdb"

# Dashboard CORS origin (optional — set when dashboard is on a different origin)
# dashboard_origin = "https://analytics.example.com"

# Bot filtering (default: true — filters known bot User-Agents from event ingestion)
filter_bots = true

# Data retention: delete Parquet partitions older than this many days
# Set to 0 for unlimited retention (default)
retention_days = 0

# Session authentication TTL in seconds (default: 86400 = 24 hours)
session_ttl_secs = 86400

# Brute-force protection: lock out an IP after this many failed login attempts (0 = disabled)
max_login_attempts = 5

# Duration in seconds to lock out an IP after exceeding max_login_attempts
login_lockout_secs = 300

# Graceful shutdown timeout in seconds (default: 30)
shutdown_timeout_secs = 30

# Ingestion rate limit per site_id (events/second, 0 = unlimited)
rate_limit_per_site = 0

# Query cache TTL in seconds (0 = no caching, default: 60)
cache_ttl_secs = 60

# Log format: "text" (default) or "json"
log_format = "text"

# Query cache max entries (0 = unlimited, default: 10000)
cache_max_entries = 10000

# Max concurrent analytics queries (0 = unlimited, default: 10)
# Excess requests receive HTTP 429
max_concurrent_queries = 10

# Cookie Secure flag (set to true when behind TLS)
secure_cookies = false

# ── GDPR / Privacy Flags ──────────────────────────────────────────────
# gdpr_mode = false            # convenience preset — enables all flags below
# strip_referrer_query = false  # strip ?query and #fragment from referrers
# round_timestamps = false      # round timestamps to the nearest hour
# suppress_visitor_id = false   # replace HMAC hash with per-request UUID
# suppress_browser_version = false
# suppress_os_version = false
# suppress_screen_size = false
# geoip_precision = "city"      # city | region | country | none

Configuration Field Details

`host` / `port`

The address and port the HTTP server listens on.

Default: 0.0.0.0:8000
To restrict to localhost: host = "127.0.0.1"

`data_dir`

Root directory for all persistent data. Mallard Metrics creates subdirectories:

data/
└── events/
    └── site_id=example.com/
        └── date=2024-01-15/
            ├── 0001.parquet
            └── 0002.parquet

Parquet files are ZSTD-compressed. The directory is created automatically.

`flush_event_count` / `flush_interval_secs`

Events arrive into a memory buffer before being flushed to Parquet. Flushing happens when either threshold is reached. The buffer is also flushed on graceful shutdown.

Lower values reduce data loss on crash; higher values reduce I/O.
Queries always see both buffered (hot) and persisted (cold) data via the events_all view.

`site_ids`

An allowlist of site identifiers. If non-empty, the Origin header of each ingestion request must exactly match one of the listed values. Requests from unlisted origins receive a 403 Forbidden response.

The comparison is exact: example.com matches https://example.com and http://example.com:8080 (with explicit port) but not example.com.other.io.

`geoip_db_path`

Path to a MaxMind GeoLite2-City .mmdb file. GeoLite2 databases are free for non-commercial use and available at maxmind.com.

If the file is not specified or does not exist, country/region/city fields are stored as NULL. This is the default behavior and does not cause any errors.

`rate_limit_per_site`

Maximum events per second accepted per site_id. Uses a token-bucket algorithm. Set to 0 (default) for no limit.

`cache_ttl_secs`

Query results for /api/stats/main and /api/stats/timeseries are cached in memory for this duration. Setting to 0 disables caching (useful for development). Default is 60 seconds.

`retention_days`

Parquet partition directories older than retention_days days are deleted automatically by a background task that runs daily. Set to 0 (default) for unlimited retention.

`max_login_attempts` / `login_lockout_secs`

Brute-force protection for the dashboard login endpoint. After max_login_attempts consecutive failures from the same IP, that IP is blocked for login_lockout_secs seconds. The server responds with 429 Too Many Requests and a Retry-After header during the lockout period.

max_login_attempts: Default 5. Set to 0 to disable brute-force protection entirely.
login_lockout_secs: Default 300 (5 minutes).

These can also be set via MALLARD_MAX_LOGIN_ATTEMPTS and MALLARD_LOGIN_LOCKOUT environment variables.

API Reference

Mallard Metrics exposes a JSON HTTP API. All endpoints are served by the same process as the dashboard.

Base URL

http://your-instance.com

Authentication

Most /api/stats/* and /api/keys/* endpoints require authentication. Provide one of:

Session cookie — Set after POST /api/auth/login. Sent automatically by browsers.
Bearer token — An API key in the Authorization: Bearer mm_... header.
X-API-Key header — An API key in the X-API-Key: mm_... header.

Endpoints that do not require authentication:

POST /api/event — Event ingestion (uses Origin allowlist instead).
GET /api/event — Pixel tracking (same parameters as POST via query string; returns 1×1 GIF).
POST /api/auth/login, POST /api/auth/setup, GET /api/auth/status, POST /api/auth/logout
GET /health, GET /health/ready, GET /health/detailed
GET /metrics — optionally protected by MALLARD_METRICS_TOKEN bearer token.
GET /robots.txt, GET /.well-known/security.txt
GET / (dashboard)

Content Type

All request bodies are application/json. All responses are application/json unless otherwise noted.

Error Responses

Errors are returned as JSON objects:

{
  "error": "human-readable description"
}

HTTP Status Codes

Code	Meaning
200	Success
202	Event accepted (ingestion only)
400	Bad request — missing or invalid parameters
401	Unauthenticated — no valid session or API key
403	Forbidden — origin not in allowlist, or CSRF check failed
404	Not found
408	Request timeout (30-second server-side limit)
409	Conflict — resource already exists (e.g. password already set)
413	Request body too large (limit: 64 KB on ingestion routes)
422	Unprocessable — JSON validation failed
429	Rate limited or concurrent query limit — includes `Retry-After` header
503	Service unavailable — database not ready
500	Internal server error

Sections

Event Ingestion — POST /api/event, GET /api/event
Analytics Stats — GET /api/stats/*
Authentication — POST /api/auth/*, GET /api/keys/*, POST /api/keys, DELETE /api/keys/*
Health & Metrics — GET /health, GET /health/ready, GET /health/detailed, GET /metrics

Event Ingestion

`POST /api/event`

Records a single analytics event. This endpoint is called by the tracking script automatically and can also be called directly for server-side event recording.

Authentication: None required. The Origin header is validated against site_ids if that config option is set.

CORS: Fully permissive (Access-Control-Allow-Origin: *) to allow cross-origin calls from the tracking script.

Request Body

{
  "d": "example.com",
  "n": "pageview",
  "u": "https://example.com/pricing",
  "r": "https://google.com/",
  "w": 1920,
  "p": "{\"plan\": \"pro\"}",
  "ra": 99.00,
  "rc": "USD"
}

Field	Type	Required	Description
`d`	string	Yes	Domain / site identifier. Max 256 chars; alphanumeric plus `.`, `-`, `_`, `:` only.
`n`	string	Yes	Event name (e.g. `"pageview"`, `"signup"`, `"purchase"`).
`u`	string	Yes	Full URL of the page where the event occurred.
`r`	string	No	Referrer URL.
`w`	number	No	Screen width in pixels (for device-type detection).
`p`	string	No	Custom properties as a JSON-encoded string. Stored in the `props` column and queryable via `json_extract`.
`ra`	number	No	Revenue amount (stored as `DECIMAL(12,2)`).
`rc`	string	No	ISO 4217 currency code (e.g. `"USD"`, `"EUR"`). Maximum 3 characters.

Response

HTTP/1.1 202 Accepted

The response body is empty. 202 means the event was accepted into the buffer. It will be flushed to Parquet on the next flush cycle or when the buffer threshold is reached.

Validation Errors

Condition	Status
Missing required field (`d`, `n`, or `u`)	422 Unprocessable Entity
Empty `d`, `n`, or `u`	400 Bad Request
`d` contains invalid characters or exceeds 256 chars	400 Bad Request
Field exceeds length limit (`n` > 256, `u` > 2048, `r` > 2048, `p` > 4096)	400 Bad Request
Request body exceeds 64 KB	413 Payload Too Large
`Origin` header does not match `site_ids`	403 Forbidden
Rate limit exceeded for this `site_id`	429 Too Many Requests

`GET /api/event`

Pixel-tracking endpoint for environments where JavaScript is unavailable (email, AMP pages, RSS readers). Returns a 1x1 transparent GIF (43 bytes, Content-Type: image/gif).

Authentication: None required.

Query Parameters

Parameter	Type	Required	Description
`d`	string	Yes	Domain / site identifier.
`n`	string	No	Event name (defaults to `"pageview"`).
`u`	string	Yes	Full URL of the page.
`r`	string	No	Referrer URL.
`w`	number	No	Screen width in pixels.

Revenue (ra, rc) and custom properties (p) are not supported on the GET endpoint.

Response

HTTP/1.1 200 OK
Content-Type: image/gif
Content-Length: 43

Usage

<img src="https://analytics.example.com/api/event?d=example.com&u=https://example.com/page" width="1" height="1" alt="">

Bot Filtering

When filter_bots = true (default), the server inspects the User-Agent header and discards the event if it matches known bot patterns. A 202 is still returned — the event is silently dropped rather than returning an error.

Privacy Processing

Before the event is stored:

The client IP address is extracted from the request.
A daily-rotating HMAC-SHA256 visitor_id is computed from IP + User-Agent + today's UTC date + MALLARD_SECRET.
The IP address is discarded. It is never written to disk or the database.

Server-Side Example

curl -X POST https://your-instance.com/api/event \
  -H 'Content-Type: application/json' \
  -d '{
    "d": "example.com",
    "n": "server_signup",
    "u": "https://example.com/signup"
  }'

Analytics Stats API

All stats endpoints require authentication (session cookie, Authorization: Bearer API key, or X-API-Key header).

Query results for /api/stats/main and /api/stats/timeseries are cached per (site_id, period) for cache_ttl_secs seconds (default 60).

Common Query Parameters

Parameter	Type	Description
`site_id`	string	Required. The site to query.
`period`	string	Optional. One of `day`, `today`, `7d`, `30d`, `90d`. Defaults to `30d`.
`start_date`	string	Optional. Explicit start date (`YYYY-MM-DD`). Both `start_date` and `end_date` must be provided together; a lone date is ignored. Overrides `period`.
`end_date`	string	Optional. Explicit end date (`YYYY-MM-DD`, exclusive). Maximum range: 366 days.

`site_id` Validation

All endpoints validate site_id and return 400 Bad Request if any of the following conditions are not met:

Non-empty string.
At most 256 characters.
ASCII alphanumeric characters plus ., -, _, and : only.

// 400 response for invalid site_id
{"error": "Invalid site_id"}

`GET /api/stats/main`

Returns core aggregate metrics.

Response

{
  "unique_visitors": 1423,
  "total_pageviews": 5812,
  "bounce_rate": 0.42,
  "avg_visit_duration_secs": 0.0,
  "pages_per_visit": 4.08
}

Field	Type	Notes
`unique_visitors`	integer	Distinct `visitor_id` values in the period.
`total_pageviews`	integer	Events where `event_name = 'pageview'`.
`bounce_rate`	float	Sessions with exactly one pageview / total sessions. Requires behavioral extension; returns `0.0` if unavailable.
`avg_visit_duration_secs`	float	Always `0.0` in this version (requires behavioral extension integration; computed separately via `/api/stats/sessions`).
`pages_per_visit`	float	`total_pageviews / unique_visitors`.

`GET /api/stats/timeseries`

Returns visitors and pageviews bucketed by time.

Granularity is determined automatically from the period: day/today returns hourly buckets; all other periods return daily buckets.

Response

[
  {"date": "2024-01-15", "visitors": 142, "pageviews": 518},
  {"date": "2024-01-16", "visitors": 167, "pageviews": 603}
]

For period=day the date field includes the hour (e.g. "2024-01-15 10:00").

`GET /api/stats/breakdown/{dimension}`

Returns visitor and pageview counts grouped by a single dimension.

Dimensions

Path	Grouped by
`/breakdown/pages`	`pathname`
`/breakdown/sources`	`referrer_source`
`/breakdown/browsers`	`browser`
`/breakdown/os`	`os`
`/breakdown/devices`	`device_type`
`/breakdown/countries`	`country_code`

Additional Parameters

Parameter	Type	Description
`limit`	integer	Maximum rows to return. Default 10, maximum 1000. Returns 400 if exceeded.

Response

[
  {"value": "/pricing", "visitors": 312, "pageviews": 489},
  {"value": "/about",   "visitors": 201, "pageviews": 247}
]

Unknown/null dimension values are represented as "(unknown)".

`GET /api/stats/sessions`

Returns session-level aggregates using the sessionize behavioral function.

Requires the behavioral extension. Returns zeroes if the extension is not loaded.

Response

{
  "total_sessions": 892,
  "avg_session_duration_secs": 124.7,
  "avg_pages_per_session": 3.2
}

`GET /api/stats/funnel`

Returns a conversion funnel where each step is a filter condition.

Additional Parameters

Parameter	Type	Description
`steps`	string	Comma-separated list of steps. Format: `page:/path` or `event:name`.
`window`	string	Session window duration. Default `"1 day"`. Must be of the form `N unit` (e.g. `"30 minutes"`, `"2 hours"`).

Step Format

Format	Meaning
`page:/pricing`	`pathname = '/pricing'`
`event:signup`	`event_name = 'signup'`

Example Request

GET /api/stats/funnel?site_id=example.com&steps=page:/pricing,event:signup&window=1+hour

Response

[
  {"step": 1, "visitors": 500},
  {"step": 2, "visitors": 120}
]

Requires behavioral extension. Returns empty array if unavailable.

`GET /api/stats/retention`

Returns weekly retention cohorts using the retention behavioral function.

Additional Parameters

Parameter	Type	Description
`weeks`	integer	Number of cohort weeks to compute. Range: 1–52. Default 4.

Response

[
  {
    "cohort_date": "2024-01-08",
    "retained": [true, true, false, true]
  }
]

Each retained boolean corresponds to one cohort week.

Requires behavioral extension. Returns empty array if unavailable.

`GET /api/stats/sequences`

Returns conversion metrics for a sequence of behavioral steps using sequence_match.

Additional Parameters

Parameter	Type	Description
`steps`	string	Comma-separated steps in `page:/path` or `event:name` format. Minimum 2 steps required.

Response

{
  "converting_visitors": 89,
  "total_visitors": 500,
  "conversion_rate": 0.178
}

Requires behavioral extension. Returns zeroes if unavailable.

`GET /api/stats/flow`

Returns the most common next pages after a given starting page using sequence_next_node.

Additional Parameters

Parameter	Type	Description
`page`	string	The target page path to start from (e.g. `/pricing`).

Response

[
  {"next_page": "/signup",  "visitors": 234},
  {"next_page": "/contact", "visitors": 89}
]

Returns up to 10 results. Requires behavioral extension.

`GET /api/stats/export`

Exports daily aggregated stats as CSV or JSON.

Additional Parameters

Parameter	Type	Description
`format`	string	`csv` (default) or `json`. Any other value returns 400.

CSV Response

date,visitors,pageviews,top_page,top_source
2024-01-15,142,518,/pricing,(direct)
2024-01-16,167,603,/pricing,google

CSV fields that might trigger formula injection (start with =, +, -, @) are prefixed with a single quote.

Content-Disposition: attachment; filename="export.csv" is set so browsers prompt a download.

JSON Response

[
  {
    "date": "2024-01-15",
    "visitors": 142,
    "pageviews": 518,
    "top_page": "/pricing",
    "top_source": "(direct)"
  }
]

top_page and top_source reflect the single highest-traffic page and referrer source for the entire queried period, not per-day.

Authentication API

Mallard Metrics supports two forms of authentication:

Session cookies — For human dashboard users.
API keys — For programmatic access (CI/CD, integrations, monitoring).

Dashboard Authentication

`POST /api/auth/setup`

Sets the admin password for the first time. Returns 409 Conflict if a password is already configured.

No authentication required.

// Request — password must be at least 8 characters
{"password": "your-secure-password"}

// Response 200 — also sets HttpOnly, SameSite=Strict cookie mm_session
{"token": "<session-token>"}

// Response 400 — password too short
{"error": "Password must be at least 8 characters"}

// Response 409 — password already configured
{"error": "Admin password already configured"}

Passwords are hashed with Argon2id before storage. The plaintext password is never persisted.

`POST /api/auth/login`

Authenticates with the admin password and creates a session.

No authentication required.

// Request
{"password": "your-secure-password"}

// Response 200 — sets HttpOnly, SameSite=Strict cookie mm_session
{"token": "<session-token>"}

// Response 400 — no password configured yet
{"error": "No admin password configured. Use /api/auth/setup first."}

// Response 401 — wrong password
{"error": "Invalid password"}

// Response 429 — Too Many Requests (IP locked out after max failed attempts)
// Retry-After header contains the remaining lockout seconds
{"error": "Too many failed login attempts. Try again later."}

Sessions are stored in memory and expire after session_ttl_secs (default 24 hours). Sessions are cleared on server restart.

Brute-force protection: After max_login_attempts (default 5) consecutive failures from the same IP, the IP is locked out for login_lockout_secs (default 300 seconds). A successful login clears the failure count. Configure via MALLARD_MAX_LOGIN_ATTEMPTS and MALLARD_LOGIN_LOCKOUT environment variables, or the corresponding TOML fields. Set max_login_attempts = 0 to disable.

`POST /api/auth/logout`

Invalidates the current session.

No authentication required. If a valid session cookie is present, it is invalidated. Otherwise the endpoint is a no-op (always returns 200).

// Response 200 — clears mm_session cookie
{"status": "logged_out"}

`GET /api/auth/status`

Returns the current authentication state.

// No password configured (open access mode)
{"setup_required": true, "authenticated": true}

// Password configured, not logged in
{"setup_required": false, "authenticated": false}

// Password configured, logged in
{"setup_required": false, "authenticated": true}

Field	Type	Notes
`setup_required`	boolean	`true` when no admin password has been set. System is in open-access mode.
`authenticated`	boolean	`true` when the request carries a valid session or API key, or when `setup_required` is `true`.

API Key Management

API keys are prefixed with mm_ and are SHA-256 hashed before storage. The plaintext key is only returned once at creation time.

All key management endpoints require authentication.

`POST /api/keys`

Creates a new API key.

// Request
{"name": "ci-pipeline", "scope": "ReadOnly"}

// Response 201
{
  "key": "mm_abc123...",
  "key_hash": "a1b2c3...",
  "name": "ci-pipeline",
  "scope": "ReadOnly"
}

The key field is the only time the plaintext key is returned. Store it securely.

Scopes:

Value	Access
`ReadOnly`	Read-only access to stats queries.
`Admin`	Full admin access (key management, config).

`GET /api/keys`

Lists all API keys (without plaintext values).

[
  {
    "key_hash": "a1b2c3...",
    "name": "ci-pipeline",
    "scope": "ReadOnly",
    "created_at": "2024-01-15T10:00:00Z",
    "revoked": false
  }
]

`DELETE /api/keys/{key_hash}`

Revokes an API key by its SHA-256 hex hash.

// Response 200
{"status": "revoked"}

// Response 404 if hash not found
{"error": "Key not found"}

Using API Keys

API keys can be passed in two ways:

Authorization header (Bearer token):

curl "https://your-instance.com/api/stats/main?site_id=example.com&period=30d" \
  -H "Authorization: Bearer mm_abc123..."

X-API-Key header:

curl "https://your-instance.com/api/stats/main?site_id=example.com&period=30d" \
  -H "X-API-Key: mm_abc123..."

Both headers are accepted on all stats and admin endpoints. ReadOnly keys can access stats endpoints; all key management endpoints (GET /api/keys, POST /api/keys, DELETE /api/keys/{hash}) require an Admin-scoped key.

Health & Metrics Endpoints

These endpoints are publicly accessible (no authentication required) and are designed for monitoring and orchestration systems.

`GET /health`

Simple liveness check. Returns HTTP 200 when the server process is running.

HTTP/1.1 200 OK
Content-Type: text/plain

ok

Use this with your load balancer or container orchestrator liveness probe.

`GET /health/ready`

Readiness probe. Executes a lightweight DuckDB query to verify the database is operational.

Success (200):

HTTP/1.1 200 OK
Content-Type: text/plain

ready

Not ready (503):

HTTP/1.1 503 Service Unavailable
Content-Type: text/plain

database not ready

Use this as your Kubernetes readiness probe or Docker health check. Do not use it as a liveness probe — a 503 here means the database is temporarily unavailable, not that the process is dead.

`GET /health/detailed`

Detailed system status in JSON. Returns component-level health information.

{
  "status": "ok",
  "version": "0.1.0",
  "buffered_events": 42,
  "auth_configured": true,
  "geoip_loaded": false,
  "behavioral_extension_loaded": true,
  "filter_bots": true,
  "cache_entries": 3,
  "cache_empty": false
}

Field	Type	Description
`status`	string	Always `"ok"` when the server is running.
`version`	string	Binary version from `Cargo.toml`.
`buffered_events`	integer	Events in the in-memory buffer, not yet flushed to Parquet.
`auth_configured`	boolean	Whether an admin password has been set.
`geoip_loaded`	boolean	Whether a MaxMind GeoLite2 database was successfully loaded.
`behavioral_extension_loaded`	boolean	Whether the DuckDB `behavioral` extension loaded successfully at startup.
`filter_bots`	boolean	Whether bot filtering is active.
`cache_entries`	integer	Number of cached query results currently in memory.
`cache_empty`	boolean	`true` if the query cache is empty.

`GET /metrics`

Prometheus-compatible metrics in text exposition format (text/plain; version=0.0.4).

If MALLARD_METRICS_TOKEN is set, this endpoint requires Authorization: Bearer <token>. Returns 401 Unauthorized without a valid token.

Gauges

# HELP mallard_buffered_events Number of events in the in-memory buffer
# TYPE mallard_buffered_events gauge
mallard_buffered_events 42

# HELP mallard_cache_entries Number of cached query results
# TYPE mallard_cache_entries gauge
mallard_cache_entries 3

# HELP mallard_auth_configured Whether admin password is set
# TYPE mallard_auth_configured gauge
mallard_auth_configured 1

# HELP mallard_geoip_loaded Whether GeoIP database is loaded
# TYPE mallard_geoip_loaded gauge
mallard_geoip_loaded 0

# HELP mallard_filter_bots Whether bot filtering is enabled
# TYPE mallard_filter_bots gauge
mallard_filter_bots 1

# HELP mallard_behavioral_extension Whether behavioral extension is loaded
# TYPE mallard_behavioral_extension gauge
mallard_behavioral_extension 1

Counters

# HELP mallard_events_ingested_total Total events ingested via POST /api/event
# TYPE mallard_events_ingested_total counter
mallard_events_ingested_total 158432

# HELP mallard_flush_failures_total Total buffer flush failures
# TYPE mallard_flush_failures_total counter
mallard_flush_failures_total 0

# HELP mallard_rate_limit_rejections_total Total requests rejected by per-site rate limiter
# TYPE mallard_rate_limit_rejections_total counter
mallard_rate_limit_rejections_total 17

# HELP mallard_login_failures_total Total failed login attempts
# TYPE mallard_login_failures_total counter
mallard_login_failures_total 3

# HELP mallard_cache_hits_total Total query cache hits
# TYPE mallard_cache_hits_total counter
mallard_cache_hits_total 9871

# HELP mallard_cache_misses_total Total query cache misses
# TYPE mallard_cache_misses_total counter
mallard_cache_misses_total 1204

Prometheus Scrape Configuration

scrape_configs:
  - job_name: mallard_metrics
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics
    scrape_interval: 30s
    # If MALLARD_METRICS_TOKEN is set:
    authorization:
      credentials: your-metrics-bearer-token

Architecture

Overview

Mallard Metrics is a single Rust binary that handles the complete analytics lifecycle: event ingestion, storage, querying, authentication, and dashboard serving. There are no external services, no message queues, and no separate database process.

flowchart TD
    TS["Tracking Script\nmallard.js &lt;1KB"]
    DASH["Dashboard SPA\nPreact + HTM"]

    TS -->|"POST /api/event"| AXUM
    DASH <-->|"GET /api/stats/*\nGET /api/keys/*"| AXUM

    subgraph BINARY["Single Binary — Single Process"]
        AXUM["Axum HTTP Server\nport 8000"]

        subgraph INGEST["Ingestion Pipeline"]
            direction LR
            OC["Origin Check\nRate Limiter"] --> BF["Bot Filter\nUA Parser"]
            BF --> GEO["GeoIP Lookup\nVisitor ID Hash"]
            GEO --> BUF["In-Memory\nEvent Buffer"]
        end

        subgraph STORE["Two-Tier Storage"]
            direction LR
            DB["DuckDB disk-based\nmallard.duckdb\nWAL durability"]
            PQ["Parquet Files\nsite_id=*/date=*/*.parquet\nZSTD-compressed"]
            VIEW["events_all VIEW\nhot union cold"]
            DB -->|"COPY TO"| PQ
            DB --> VIEW
            PQ -->|"read_parquet()"| VIEW
        end

        subgraph QUERY["Query Engine"]
            direction LR
            CACHE["TTL Query Cache"] --> QH["Stats\nSessions\nFunnels\nRetention\nSequences\nFlow"]
            EXT["behavioral extension\nsessionize\nwindow_funnel\nretention\nsequence_match"] -.->|"optional"| CACHE
        end

        AUTH["Auth Layer\nArgon2id passwords\n256-bit session tokens\nAPI keys SHA-256"] -.->|"guards"| AXUM

        AXUM --> OC
        BUF -->|"flush"| DB
        VIEW --> CACHE
        QH --> AXUM
    end

Event Ingestion Pipeline

Every POST /api/event request passes through a sequential pipeline of validation and enrichment steps before being buffered.

flowchart TD
    START(["POST /api/event\nJSON body"])

    START --> SZ{"Body size\n&le; 64 KB?"}
    SZ -->|"No"| R413["413 Request\nEntity Too Large"]
    SZ -->|"Yes"| OC

    OC{"Origin in\nallowlist?"}
    OC -->|"No (if configured)"| R403["403 Forbidden"]
    OC -->|"Yes"| RL

    RL{"Rate limit\nexceeded?"}
    RL -->|"Yes"| R429["429 Too Many Requests\nRetry-After header"]
    RL -->|"No"| SITEID

    SITEID{"site_id valid?\na-z A-Z 0-9 .-: max 256 chars"}
    SITEID -->|"No"| R400["400 Bad Request"]
    SITEID -->|"Yes"| BOT

    BOT{"Bot\nUser-Agent?"}
    BOT -->|"Yes"| DISCARD["Silently discarded\n202 Accepted"]
    BOT -->|"No"| UA

    UA["Parse User-Agent\nbrowser, OS, device type"]
    UA --> GEO

    GEO["GeoIP Lookup\ncountry, region, city\nGraceful fallback if no DB"]
    GEO --> VID

    VID["Compute Visitor ID\nHMAC-SHA256\nIP plus UA plus daily-salt\nDiscard IP immediately"]
    VID --> URL

    URL["Parse URL\npathname, hostname\nUTM parameters"]
    URL --> BUF

    BUF["Push to In-Memory Buffer"]
    BUF --> THR{"Buffer count\n>= flush_event_count?"}
    THR -->|"Yes"| FLUSH["Flush to DuckDB\nAppender API batch insert"]
    THR -->|"No"| R202
    FLUSH --> R202

    R202(["202 Accepted"])

Two-Tier Storage Model

Mallard Metrics stores events in two complementary tiers, always queried together via the events_all VIEW.

flowchart LR
    INGEST["Ingestion\nEvent Buffer"]

    subgraph HOT["Hot Tier — DuckDB (mallard.duckdb)"]
        EVENTS["events table\nrecently arrived events\nWAL-backed, survives SIGKILL"]
    end

    subgraph COLD["Cold Tier — Parquet on Disk"]
        P1["site_id=example.com/\ndate=2024-01-15/\n0001.parquet"]
        P2["site_id=example.com/\ndate=2024-01-16/\n0001.parquet"]
        P3["site_id=other.org/\ndate=2024-01-15/\n0001.parquet"]
    end

    subgraph UNIFIED["Unified Query Layer"]
        VIEW["events_all VIEW\nSELECT * FROM events\nUNION ALL\nSELECT * FROM read_parquet(...)"]
    end

    INGEST -->|"flush"| EVENTS
    EVENTS -->|"COPY TO ZSTD"| P1
    EVENTS -->|"COPY TO ZSTD"| P2
    EVENTS -->|"COPY TO ZSTD"| P3
    EVENTS -->|"hot events"| VIEW
    P1 -->|"read_parquet()"| VIEW
    P2 -->|"read_parquet()"| VIEW
    P3 -->|"read_parquet()"| VIEW
    VIEW --> ANALYTICS["Analytics Queries\nGET /api/stats/*"]

Hot tier (data/mallard.duckdb): Stores events that have been buffered but not yet flushed. Events here are immediately queryable. The DuckDB WAL provides durability — hot events survive a SIGKILL (crash), not just a graceful SIGTERM.

Cold tier (.parquet files): After flushing, events are written as ZSTD-compressed Parquet files partitioned by site and date. These files are the primary durability layer for historical data and can be queried independently with any Parquet-compatible tool (DuckDB CLI, pandas, Apache Spark).

The events_all VIEW is created at startup and refreshed after each flush. It transparently unions the hot and cold tiers so all analytics queries work correctly regardless of which tier the data resides in.

The cold-tier directory layout:

data/events/
├── site_id=example.com/
│   ├── date=2024-01-15/
│   │   ├── 0001.parquet
│   │   └── 0002.parquet
│   └── date=2024-01-16/
│       └── 0001.parquet
└── site_id=other-site.org/
    └── date=2024-01-15/
        └── 0001.parquet

Authentication Architecture

flowchart TD
    subgraph CREDS["Credentials at Rest"]
        HASH["Admin Password\nArgon2id hash PHC defaults\nmemory-only at runtime"]
        KEYS["API Keys\nmm_ prefix plus 256-bit random\nSHA-256 hash on disk\nJSON file in data_dir"]
        SESS["Session Tokens\n256-bit OS CSPRNG\nHashMap with TTL expiry\nHttpOnly SameSite=Strict"]
    end

    BROWSER["Browser"] -->|"POST /api/auth/login\npassword"| ARGON
    ARGON["Argon2id verify"] -->|"match"| SESS
    SESS -->|"session cookie\nHttpOnly Secure SameSite=Strict"| BROWSER

    APICLIENT["API Client"] -->|"Authorization: Bearer mm_xxx\nor X-API-Key: mm_xxx"| KEYCHECK
    KEYCHECK["SHA-256 hash\nconstant-time compare"] -->|"valid"| SCOPE

    SCOPE{"Scope check"}
    SCOPE -->|"ReadOnly key"| READONLY["GET /api/stats/*\nGET /api/keys/*"]
    SCOPE -->|"Admin key"| ADMIN["All routes\nincluding POST /api/keys\nDELETE /api/keys/*"]

    BROWSER -->|"GET /api/stats/*\nauto-sent cookie"| SESSMW
    SESSMW["Session middleware\nTTL check"] -->|"valid"| ROUTE

    ROUTE["Route Handler"]

    CSRF["CSRF check\nOrigin vs dashboard_origin"] -.->|"state-mutating\nroutes only"| ROUTE
    BF["Brute-force check\nper-IP attempt counting\nconfigurable lockout"] -.->|"login endpoint"| ARGON

Key Security Properties

Property	Implementation
Password storage	Argon2id hash (PHC defaults), never stored in plaintext
Session tokens	256-bit OS CSPRNG; `HashMap` with TTL; cleared on restart
API key storage	SHA-256 hash on disk; plaintext returned only at creation
Timing attacks	Constant-time comparison for API key validation
Session cookies	`HttpOnly; Secure; SameSite=Strict`
CSRF	Origin/Referer validation on all state-mutating session-auth routes
Brute force	Per-IP attempt counting; configurable lockout and `Retry-After`

Behavioral Extension

Advanced analytics rely on the DuckDB behavioral extension, which provides window aggregate functions purpose-built for clickstream analysis.

flowchart LR
    subgraph EXT["behavioral extension"]
        SESS_F["sessionize()\nGroup events into sessions\nby visitor and time gap"]
        FUNNEL_F["window_funnel()\nMulti-step ordered\nconversion funnel"]
        RET_F["retention()\nWeekly cohort\nretention grid"]
        SEQ_F["sequence_match()\nBehavioral pattern\ndetection"]
        FLOW_F["sequence_next_node()\nNext-page\nflow analysis"]
    end

    subgraph API["Behavioral Endpoints"]
        direction TB
        S["/api/stats/sessions"]
        FU["/api/stats/funnel"]
        R["/api/stats/retention"]
        SQ["/api/stats/sequences"]
        FL["/api/stats/flow"]
    end

    SESS_F --> S
    FUNNEL_F --> FU
    RET_F --> R
    SEQ_F --> SQ
    FLOW_F --> FL

    CORE["Core analytics\n/api/stats/main\n/api/stats/timeseries\n/api/stats/breakdown/*"] -.->|"no extension\nrequired"| ALWAYS["Always available"]

The extension is loaded at startup:

INSTALL behavioral FROM community;
LOAD behavioral;

If loading fails (network unavailable, air-gapped environment), all extension-dependent endpoints return graceful defaults (zeroes or empty arrays). Core analytics continue working normally. The GET /health/detailed JSON response and GET /metrics Prometheus output both report whether the extension loaded successfully.

Module Map

Module	Purpose
`config.rs`	TOML + environment variable configuration
`server.rs`	Axum router with CORS configuration and middleware stack
`ingest/handler.rs`	`POST /api/event` ingestion handler
`ingest/buffer.rs`	In-memory event buffer with periodic flush
`ingest/visitor_id.rs`	HMAC-SHA256 privacy-safe visitor ID
`ingest/useragent.rs`	User-Agent parsing
`ingest/geoip.rs`	MaxMind GeoIP reader with graceful fallback
`ingest/ratelimit.rs`	Per-site token-bucket rate limiter
`storage/schema.rs`	DuckDB table definitions and `events_all` view
`storage/parquet.rs`	Parquet write/read/partitioning
`storage/migrations.rs`	Schema versioning
`query/metrics.rs`	Core metric calculations
`query/breakdowns.rs`	Dimension breakdown queries
`query/timeseries.rs`	Time-bucketed aggregations
`query/sessions.rs`	`sessionize`-based session queries
`query/funnel.rs`	`window_funnel` query builder
`query/retention.rs`	Retention cohort query execution
`query/sequences.rs`	`sequence_match` query execution
`query/flow.rs`	`sequence_next_node` flow analysis
`query/cache.rs`	TTL-based query result cache
`api/stats.rs`	All analytics API handlers
`api/errors.rs`	API error types
`api/auth.rs`	Origin validation, session auth, API key management
`dashboard/`	Embedded SPA (Preact + HTM)

Security & Privacy

Privacy Model

Mallard Metrics is built with privacy as a hard constraint, not an afterthought.

No Cookies

The tracking script sets no cookies. There is no cookie-based session tracking of any kind.

No PII Storage

The client IP address is the only potentially identifying value that reaches the server. It is:

Used to compute the visitor ID (see below).
Used for a GeoIP lookup (if configured).
Discarded immediately. It is never written to the database, log files, or Parquet files.

No names, email addresses, or device fingerprints are collected or stored.

Privacy-Safe Visitor ID

To count unique visitors without storing PII, Mallard Metrics uses a two-step HMAC-SHA256 derivation:

flowchart LR
    SECRET["MALLARD_SECRET\nenvironment variable"]
    DATE["Today UTC date\n2024-01-15"]

    SECRET --> H1["HMAC-SHA256\nkey = 'mallard-metrics-salt'\nmsg = SECRET + ':' + DATE"]
    DATE --> H1
    H1 --> SALT["daily_salt\nrotates every 24 h"]

    IP["Client IP address"]
    UA["User-Agent header"]

    IP --> H2["HMAC-SHA256\nkey = daily_salt\nmsg = IP + '|' + UA"]
    UA --> H2
    SALT --> H2
    H2 --> VID["visitor_id\nstored in database"]

    IP -->|"discarded\nimmediately after"| TRASH["not stored"]

Properties of this approach:

Deterministic within a day — The same visitor from the same browser produces the same ID throughout the day, enabling accurate unique-visitor counts.
Rotates daily — The UTC date rotates the effective key every 24 hours, so IDs cannot be correlated across days.
Not reversible — Without MALLARD_SECRET, the IP address cannot be recovered from the stored hash.
No IP storage — The IP address is discarded immediately after hashing.

GDPR/CCPA Compliance

Mallard Metrics stores pseudonymous visitor IDs (daily-rotating HMAC-SHA256 hashes), which are personal data under GDPR Recital 26. Operators must establish a lawful basis for processing — typically Art. 6(1)(f) legitimate interests for aggregate analytics, especially when combined with GDPR mode. See PRIVACY.md for the full legal analysis, DPIA guidance, and operator obligations.

Key points:

No cookies are set for tracking — no ePrivacy consent banner is needed for the tracking script itself.
Data subject erasure is supported via DELETE /api/gdpr/erase (Admin API key required).
No third-party data sharing — all processing is first-party, no data processor agreements needed.

Authentication Security

Dashboard Password

Passwords are hashed with Argon2id using PHC default parameters before any comparison. The plaintext password is never stored. The hash is held in memory and loaded from the MALLARD_ADMIN_PASSWORD environment variable at startup.

Session Tokens

Dashboard sessions use 256-bit cryptographically random tokens generated with the OS CSPRNG. Tokens are delivered as HttpOnly; SameSite=Strict cookies to prevent JavaScript access and CSRF.

Sessions are stored in an in-memory HashMap with TTL expiry (default 24 hours, configurable via session_ttl_secs). Sessions are cleared on server restart.

When MALLARD_SECURE_COOKIES=true is set (required when behind a TLS reverse proxy), the Secure flag is added to the cookie, preventing transmission over plain HTTP.

API Keys

Property	Value
Entropy	256 bits of randomness
Prefix	`mm_` — easy to identify in logs and secret scanners
Storage	SHA-256 hash stored in a JSON file in `data_dir/`. Plaintext returned only at creation.
Comparison	Constant-time equality to prevent timing side-channel attacks
Scopes	`ReadOnly` (GET stats only) or `Admin` (full access including key management)
Persistence	Disk-persisted; survive server restarts

Input Validation and SQL Injection Prevention

Parameterized Queries

All user-supplied values (site IDs, date ranges, event names) are bound to SQL statements as parameters using DuckDB's prepared statement API. Raw string interpolation is used only where DuckDB's API does not support parameters (e.g., COPY TO file paths), and those values are explicitly validated and escaped before use.

Path Traversal Prevention

The site_id value is validated by is_safe_path_component() before being used in any filesystem path. The following are rejected:

Empty strings
Strings containing .. (directory traversal)
Strings containing / or \ (path separators)
Strings containing null bytes (\0)
Strings longer than 256 characters
Characters outside [a-zA-Z0-9._\-:]

Funnel and Sequence Step Validation

User-supplied funnel and sequence steps (from ?steps= query parameters) are parsed from a safe page:/path or event:name format. Raw SQL expressions are never accepted from the API. Single quotes in path values are escaped by doubling.

Date Range Validation

The start_date and end_date parameters are validated as YYYY-MM-DD format, checked for logical consistency (end >= start), and capped at a maximum 366-day span.

Breakdown Limit

The limit parameter for breakdown queries is capped at 1000 to prevent unbounded result sets.

Origin Validation

When site_ids is configured, the Origin header is validated with exact host matching:

https://example.com → passes (if "example.com" is in site_ids).
http://example.com:8080 → passes (explicit port suffix allowed).
https://example.com.evil.com → rejected (prefix match is explicitly disallowed).

CSV Injection Prevention

The CSV export endpoint escapes fields starting with formula-triggering characters (=, +, -, @) by prefixing them with a single quote, preventing formula injection when the CSV is opened in spreadsheet software.

Brute-Force Protection

Login attempts are tracked per client IP address. After max_login_attempts consecutive failures (default 5), the IP is locked out for login_lockout_secs seconds (default 300). The server returns 429 Too Many Requests with a Retry-After header containing the remaining lockout duration.

A successful login clears the failure count for that IP. Failure counts are stored in memory and reset on server restart.

Configure via TOML fields max_login_attempts and login_lockout_secs, or the environment variables MALLARD_MAX_LOGIN_ATTEMPTS and MALLARD_LOGIN_LOCKOUT. Set max_login_attempts = 0 to disable.

Security Headers

All HTTP responses include these OWASP-recommended security headers:

Header	Value	Purpose
`X-Content-Type-Options`	`nosniff`	Prevents MIME-type sniffing
`X-Frame-Options`	`DENY`	Prevents clickjacking via iframe embedding
`Referrer-Policy`	`strict-origin-when-cross-origin`	Limits referrer leakage
`Content-Security-Policy`	HTML responses only	Restricts scripts and resources to same origin
`Permissions-Policy`	`geolocation=(), microphone=(), camera=()`	Disables browser feature APIs
`Strict-Transport-Security`	`max-age=31536000; includeSubDomains; preload`	Instructs browsers to enforce HTTPS for 1 year; eligible for preload lists
`Cache-Control`	`no-store, no-cache`	JSON API responses only; prevents analytics data caching
`X-Request-ID`	UUID per request	Injected by the server, propagated through tracing spans for log correlation

HTTP Timeout

All requests have a 30-second server-side timeout. Connections that do not complete within this window are closed with 408 Request Timeout. This prevents Slowloris-style attacks that hold connections open indefinitely.

CSRF Protection

State-mutating endpoints authenticated via session cookie (login, logout, setup, key creation, key revocation) validate the Origin or Referer header against the configured dashboard_origin. Requests with a mismatched or missing origin receive 403 Forbidden.

When dashboard_origin is not set, CSRF checks are bypassed (all origins allowed). Set dashboard_origin in production to enable CSRF protection.

Network Security

CORS Policy

Mallard Metrics uses separate CORS policies for ingestion and dashboard routes:

Ingestion (POST /api/event):

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST

Dashboard / Stats / Admin (when dashboard_origin is set):

Access-Control-Allow-Origin: <configured origin>
Access-Control-Allow-Methods: GET, POST, DELETE
Access-Control-Allow-Credentials: true

If dashboard_origin is not configured, the dashboard routes use a permissive policy that allows any origin (explicitly, not same-origin-only). Set dashboard_origin in production to restrict cross-origin access.

TLS

Mallard Metrics does not handle TLS directly. In production, place it behind a TLS-terminating reverse proxy (nginx, Caddy, Traefik, etc.). Set MALLARD_SECURE_COOKIES=true once the proxy is in place.

Request Concurrency

The four heavy behavioral analytics endpoints (/api/stats/funnel, /api/stats/retention, /api/stats/sequences, /api/stats/flow) are protected by a semaphore. The maximum number of concurrent heavy queries is configurable via MALLARD_MAX_CONCURRENT_QUERIES (default 10). Requests that exceed this limit receive 429 Too Many Requests with a Retry-After header.

Supply Chain

All Rust dependencies are audited with cargo-deny in CI.
GitHub Actions steps are pinned to exact commit SHAs (no floating version tags).
The bundled DuckDB feature compiles DuckDB from source as part of the build; no pre-built DuckDB binaries are downloaded at runtime.
cargo build --locked is used in CI to ensure reproducible builds from Cargo.lock.

Threat Model Summary

Threat	Mitigation
SQL injection	Parameterized queries throughout; `site_id` character validation
Path traversal	`is_safe_path_component()` on all filesystem paths
CSRF	Origin/Referer validation on state-mutating session-auth routes
Brute force (login)	Per-IP lockout, Argon2id hashing
Brute force (API)	Per-site rate limiting
Session hijacking	`HttpOnly; Secure; SameSite=Strict` cookies
Timing attacks	Constant-time comparison for API keys
Clickjacking	`X-Frame-Options: DENY`
Protocol downgrade	`Strict-Transport-Security` (HSTS, 1 year)
MIME sniffing	`X-Content-Type-Options: nosniff`
Data exfiltration	No outbound network calls; embedded DB; IP discarded after hash
PII leakage	IPs hashed then discarded; daily ID rotation; no cookies
CSV injection	Formula character escaping in export output
Dependency vulnerabilities	`cargo-deny` in CI; `Cargo.lock` committed and enforced

Behavioral Analytics

Mallard Metrics integrates the DuckDB behavioral extension to provide advanced analytics that go beyond simple counts. This extension proves that DuckDB behavioral analytics is not just an academic exercise — it can power real-world, production analytics with a homelab-friendly footprint.

Prerequisites

The behavioral extension is loaded at startup:

INSTALL behavioral FROM community;
LOAD behavioral;

If the extension cannot be loaded (e.g., network unavailable or air-gapped environment), all behavioral endpoints return graceful defaults (zeroes or empty arrays). Core analytics (visitors, pageviews, breakdowns, timeseries) are unaffected.

The GET /health/detailed JSON response includes "behavioral_extension_loaded": true/false, and GET /metrics exposes the mallard_behavioral_extension gauge (1 = loaded, 0 = unavailable).

Session Analytics

Endpoint: GET /api/stats/sessions

Uses sessionize(timestamp, INTERVAL '30 minutes') to group events into sessions per visitor. A new session begins when there is a gap of more than 30 minutes between events from the same visitor.

Metrics returned:

Field	Description
`total_sessions`	Total number of distinct sessions
`avg_session_duration_secs`	Mean session duration in seconds
`avg_pages_per_session`	Mean pageviews per session

Funnel Analysis

Endpoint: GET /api/stats/funnel

Uses window_funnel(interval, timestamp, step1, step2, ...) to find visitors who completed a sequence of steps within a time window.

Example — Pricing to Signup funnel:

GET /api/stats/funnel?site_id=example.com&steps=page:/pricing,event:signup&window=1+day

Step format:

Input	SQL condition
`page:/pricing`	`pathname = '/pricing'`
`event:signup`	`event_name = 'signup'`

Response: Array of {step, visitors} showing how many visitors reached each step.

Notes:

Steps must be ordered (each step must follow the previous in time).
The window parameter controls the maximum elapsed time between the first and last step (e.g., 1 day, 2 hours).
At least 1 step is required; 2+ steps produce a meaningful funnel chart.

Retention Cohorts

Endpoint: GET /api/stats/retention?weeks=N

Uses retention(condition1, condition2, ...) to compute weekly cohort retention. Each cohort is defined by a visitor's first-seen week. Subsequent weeks show whether they returned.

Example response (4-week retention):

[
  {"cohort_date": "2024-01-08", "retained": [true, true, false, true]},
  {"cohort_date": "2024-01-15", "retained": [true, false, true, false]}
]

Each boolean in retained corresponds to one week: retained[0] is always true (the cohort week itself), and subsequent values indicate whether the visitor was seen in weeks +1, +2, +3, etc.

Parameter	Default	Range	Description
`weeks`	`4`	1–52	Number of weeks to include in the cohort grid

Sequence Matching

Endpoint: GET /api/stats/sequences

Uses sequence_match(pattern, timestamp, cond1, cond2, ...) to find visitors who performed a specific behavioral pattern. Returns overall conversion metrics.

Example — Pricing → Signup conversion:

GET /api/stats/sequences?site_id=example.com&steps=page:/pricing,event:signup

Response:

{
  "converting_visitors": 89,
  "total_visitors": 500,
  "conversion_rate": 0.178
}

Minimum 2 steps required. Steps use the same page:/path and event:name format as the funnel endpoint.

Flow Analysis

Endpoint: GET /api/stats/flow?page=/pricing

Uses sequence_next_node('forward', 'first_match', ...) to find the most common pages visitors navigate to after a given page.

Response:

[
  {"next_page": "/signup",  "visitors": 234},
  {"next_page": "/contact", "visitors": 89},
  {"next_page": "/",        "visitors": 67}
]

Returns up to 10 next-page destinations ordered by visitor count. Useful for understanding user navigation patterns and identifying high-exit pages.

Dashboard Views

The dashboard includes interactive views for all behavioral analytics:

Sessions — Cards showing total sessions, average duration, and pages per session.
Funnel — Horizontal bar chart with configurable steps and conversion percentages.
Retention — Cohort grid table showing Y (returned) / - (not returned) per week.
Sequences — Conversion metrics cards with converting visitors, total visitors, and rate.
Flow — Next-page table with visitor counts.

Graceful Degradation

All behavioral endpoints degrade gracefully when the extension is not available:

Endpoint	Without extension
`GET /api/stats/sessions`	Returns zeros for all fields
`GET /api/stats/funnel`	Returns empty array
`GET /api/stats/retention`	Returns empty array
`GET /api/stats/sequences`	Returns zeros for all fields
`GET /api/stats/flow`	Returns empty array

Core analytics (/api/stats/main, /api/stats/timeseries, /api/stats/breakdown/*) do not use the extension and are always available.

Deployment

Production Checklist

Before going to production:

Set MALLARD_SECRET to a random 32+ character string and keep it constant across restarts.
Set MALLARD_ADMIN_PASSWORD to a strong password.
Set MALLARD_SECURE_COOKIES=true when behind a TLS-terminating reverse proxy so session cookies carry the Secure flag.
Set MALLARD_METRICS_TOKEN to a secret token if the /metrics endpoint is publicly reachable.
Configure a TLS-terminating reverse proxy (nginx, Caddy, Traefik).
Mount a persistent volume for data_dir (contains mallard.duckdb and Parquet files).
Set site_ids to restrict event ingestion to your domains.
Configure retention_days to match your data retention policy.
Set dashboard_origin to your dashboard URL to enable CSRF protection.
Use /health/ready as your container or load-balancer readiness probe.

EU / GDPR deployments — additional steps:

Set MALLARD_GDPR_MODE=true (or enable individual flags) to reduce data collection surface.
Set MALLARD_RETENTION_DAYS=30 (or your DPA-approved retention period) for Art. 5(1)(e) storage limitation compliance.
Set MALLARD_GEOIP_PRECISION=country (already forced by gdpr_mode; document it explicitly in your DPIA).
Document your legal basis for processing in a DPIA or privacy notice. See PRIVACY.md for the full analysis.
Use DELETE /api/gdpr/erase?site_id=...&start_date=...&end_date=... (Admin API key required) to honour Art. 17 erasure requests.

Docker (Recommended)

Pull and Run

docker run -d \
  --name mallard-metrics \
  --restart unless-stopped \
  -p 127.0.0.1:8000:8000 \
  -v mallard-data:/data \
  -e MALLARD_SECRET=your-random-32-char-secret \
  -e MALLARD_ADMIN_PASSWORD=your-dashboard-password \
  -e MALLARD_SECURE_COOKIES=true \
  -e MALLARD_METRICS_TOKEN=your-prometheus-token \
  ghcr.io/tomtom215/mallard-metrics

The image is built FROM scratch with a static musl binary. It has no shell, no package manager, and no runtime dependencies.

With a Config File

docker run -d \
  --name mallard-metrics \
  -v mallard-data:/data \
  -v /etc/mallard-metrics/config.toml:/config.toml:ro \
  -e MALLARD_SECRET=... \
  -e MALLARD_ADMIN_PASSWORD=... \
  ghcr.io/tomtom215/mallard-metrics /config.toml

Docker Compose

Save the following as docker-compose.yml:

services:
  mallard-metrics:
    image: ghcr.io/tomtom215/mallard-metrics:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - mallard-data:/data
    environment:
      MALLARD_SECRET: "${MALLARD_SECRET}"
      MALLARD_ADMIN_PASSWORD: "${MALLARD_ADMIN_PASSWORD}"
      MALLARD_SECURE_COOKIES: "true"
      MALLARD_METRICS_TOKEN: "${MALLARD_METRICS_TOKEN}"
      MALLARD_LOG_FORMAT: "json"

volumes:
  mallard-data:

Create a .env file (do not commit to source control):

MALLARD_SECRET=your-random-32-char-secret
MALLARD_ADMIN_PASSWORD=your-dashboard-password
MALLARD_METRICS_TOKEN=your-prometheus-bearer-token

Start:

docker compose up -d
docker compose logs -f

Behind a Reverse Proxy

Mallard Metrics binds to 0.0.0.0:8000 by default (all interfaces). Set MALLARD_HOST=127.0.0.1 to restrict to localhost when behind a reverse proxy.

nginx

server {
    listen 443 ssl;
    server_name analytics.example.com;

    ssl_certificate     /etc/ssl/certs/analytics.example.com.crt;
    ssl_certificate_key /etc/ssl/private/analytics.example.com.key;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Important: Mallard Metrics reads the client IP for visitor ID hashing. If behind a proxy, the X-Forwarded-For or X-Real-IP header must be set correctly. Configure your proxy to send the real client IP.

Caddy

analytics.example.com {
    reverse_proxy 127.0.0.1:8000
}

Caddy sets X-Forwarded-For automatically.

After-Proxy Configuration

Once behind a TLS reverse proxy, set these environment variables:

# Enables Secure flag on session cookies
MALLARD_SECURE_COOKIES=true

# Restricts dashboard CORS and enables CSRF protection
MALLARD_DASHBOARD_ORIGIN=https://analytics.example.com

Health and Readiness Probes

Endpoint	Purpose
`GET /health`	Liveness probe — returns `ok` if the process is alive
`GET /health/ready`	Readiness probe — queries DuckDB; returns 503 if the database is not ready
`GET /health/detailed`	JSON health report — version, buffer, auth, GeoIP, behavioral extension, cache status

Kubernetes Example

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

Docker Compose Health Check

The FROM scratch image has no shell or utilities (wget, curl). Use Docker's HEALTHCHECK with an external check from the host, or rely on your reverse proxy or orchestrator's health probes:

# External health check from the host
curl -sf http://localhost:8000/health/ready || exit 1

Build from Source (Static musl Binary)

To build a FROM scratch-compatible static binary:

# Install the musl target
rustup target add x86_64-unknown-linux-musl

# Build
cargo build --release --target x86_64-unknown-linux-musl

# The binary
ls -lh target/x86_64-unknown-linux-musl/release/mallard-metrics

The resulting binary has no dynamic library dependencies:

ldd target/x86_64-unknown-linux-musl/release/mallard-metrics
# not a dynamic executable

GeoIP Setup

Mallard Metrics supports optional IP geolocation via MaxMind GeoLite2.

Create a free account at maxmind.com.
Download the GeoLite2-City.mmdb database.
Configure the path:

# config.toml
geoip_db_path = "/data/GeoLite2-City.mmdb"

Or with Docker:

docker run ... \
  -v /path/to/GeoLite2-City.mmdb:/data/GeoLite2-City.mmdb:ro \
  -e ... \
  ghcr.io/tomtom215/mallard-metrics

If the file is missing or unreadable, country/region/city fields are stored as NULL. No error is raised.

Note: The MaxMind GeoLite2 database is updated monthly. Automate downloads with geoipupdate.

Mallard Metrics provides a configurable privacy mode designed to reduce the data-collection surface to a level that makes aggregate analytics possible under GDPR Art. 6(1)(f) legitimate interests (no consent required) for many EU operators. Consult your legal team; requirements vary by context and member-state law.

The quickest path is the MALLARD_GDPR_MODE=true preset, which bundles the recommended privacy settings:

docker run -d \
  --name mallard-metrics \
  --restart unless-stopped \
  -p 127.0.0.1:8000:8000 \
  -v mallard-data:/data \
  -e MALLARD_SECRET=your-random-32-char-secret \
  -e MALLARD_ADMIN_PASSWORD=your-dashboard-password \
  -e MALLARD_SECURE_COOKIES=true \
  -e MALLARD_GDPR_MODE=true \
  -e MALLARD_RETENTION_DAYS=30 \
  ghcr.io/tomtom215/mallard-metrics

Or via TOML config:

gdpr_mode      = true
retention_days = 30

Flag	Standard	GDPR Mode
Referrer stored as	Full URL (with query/fragment)	Path only — `?q=...` and `#...` stripped
Timestamps	Millisecond precision	Rounded to nearest hour
Browser info	Name + version	Name only (e.g. `"Chrome"`)
OS info	Name + version	Name only (e.g. `"Windows"`)
Screen / device	Stored	Omitted
GeoIP	City-level	Country-level only

Fine-Grained Privacy Flags

Each setting can be controlled independently via environment variable or TOML key:

Env var	TOML key	Default	Effect
`MALLARD_GDPR_MODE`	`gdpr_mode`	`false`	Enable all flags below (except suppress_visitor_id)
`MALLARD_STRIP_REFERRER_QUERY`	`strip_referrer_query`	`false`	Strip `?query` and `#fragment` from referrers
`MALLARD_ROUND_TIMESTAMPS`	`round_timestamps`	`false`	Round timestamps to nearest hour
`MALLARD_SUPPRESS_BROWSER_VERSION`	`suppress_browser_version`	`false`	Store browser name only
`MALLARD_SUPPRESS_OS_VERSION`	`suppress_os_version`	`false`	Store OS name only
`MALLARD_SUPPRESS_SCREEN_SIZE`	`suppress_screen_size`	`false`	Omit screen size and device type
`MALLARD_GEOIP_PRECISION`	`geoip_precision`	`"city"`	`"city"` / `"region"` / `"country"` / `"none"`
`MALLARD_SUPPRESS_VISITOR_ID`	`suppress_visitor_id`	`false`	Replace HMAC hash with random UUID per request (breaks unique-visitor counting)

Note on suppress_visitor_id: This flag is intentionally not activated by gdpr_mode because it eliminates unique-visitor metrics entirely. The default HMAC-SHA256 visitor ID is pseudonymous personal data under GDPR Recital 26. Most operators can rely on Art. 6(1)(f) legitimate interests for aggregate analytics without suppressing visitor IDs.

Right to Erasure (Art. 17)

Mallard Metrics supports data erasure requests via an authenticated API endpoint:

# Requires an Admin API key
curl -X DELETE \
  "https://analytics.example.com/api/gdpr/erase?site_id=mysite.com&start_date=2024-01-01&end_date=2024-12-31" \
  -H "X-API-Key: mm_your_admin_key"

Response:

{
  "site_id": "mysite.com",
  "start_date": "2024-01-01",
  "end_date": "2024-12-31",
  "db_records_deleted": 1423,
  "parquet_partitions_deleted": 8
}

Important limitations:

Erasure is by site and date range, not by individual visitor ID (visitor IDs are pseudonymous hashes and cannot be reverse-mapped to individuals).
After erasure, the events_all VIEW is refreshed automatically.
Consider setting MALLARD_RETENTION_DAYS=30 for automated data minimisation under Art. 5(1)(e) in place of manual erasure requests.

Graceful Shutdown

Mallard Metrics handles SIGINT (Ctrl+C) and SIGTERM (Docker stop, systemd stop). On receiving either signal:

The server stops accepting new connections.
In-flight requests are completed.
Buffered events are flushed to DuckDB (persisted via WAL).

The flush is bounded by shutdown_timeout_secs (default 30). If flushing takes longer, a warning is logged and the process exits.

Systemd Service

For non-Docker deployments:

[Unit]
Description=Mallard Metrics
After=network.target

[Service]
Type=simple
User=mallard
ExecStart=/usr/local/bin/mallard-metrics /etc/mallard-metrics/config.toml
Restart=on-failure
RestartSec=5s
Environment=MALLARD_SECRET=...
Environment=MALLARD_ADMIN_PASSWORD=...

[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable --now mallard-metrics

VPS Deployment Guide

Zero to production in one command — deploy Mallard Metrics on any generic Linux VPS with full TLS, LUKS-encrypted data at rest, Cloudflare DNS, and an automated security audit.

Overview

This guide deploys Mallard Metrics on a bare VPS using:

Component	Role
Caddy (custom build)	TLS termination, reverse proxy, HTTP/3, ACME DNS-01
Cloudflare DNS	DNS-01 ACME challenge — no port 80 required
LUKS	Full encryption of the analytics data volume at rest
Docker Compose	Container orchestration
vps-audit	Automated security assessment and weekly re-audit
UFW + fail2ban	Host-level firewall and brute-force protection

The FROM scratch Mallard binary runs with no shell, no OS utilities, read-only root filesystem, all Linux capabilities dropped, and no network port exposed to the host — all traffic flows through Caddy on the internal Docker network.

Architecture

Internet
    │
    ▼
┌───────────────────────────────────────────────┐
│  VPS Host (Ubuntu/Debian)                     │
│                                               │
│  UFW Firewall: 22, 80, 443 (tcp+udp/QUIC)    │
│                                               │
│  ┌─────────────────────────────────────────┐  │
│  │  Docker network: mallard-production_proxy│  │
│  │                                         │  │
│  │  ┌─────────────┐    ┌───────────────┐   │  │
│  │  │   Caddy     │───▶│ mallard:8000  │   │  │
│  │  │ :80/:443    │    │ (FROM scratch) │   │  │
│  │  │ TLS + proxy │    │               │   │  │
│  │  └─────────────┘    └───────┬───────┘   │  │
│  └───────────────────────────  │ ──────────┘  │
│                                │              │
│  ┌─────────────────────────────▼────────────┐ │
│  │  LUKS encrypted volume (/srv/mallard/data)│ │
│  │  mallard.duckdb  data/YYYY/MM/DD/*.parquet│ │
│  └───────────────────────────────────────────┘ │
└───────────────────────────────────────────────┘

Prerequisites

VPS requirements

Resource	Minimum	Recommended
CPU	1 vCPU	2 vCPU
RAM	512 MB	1 GB
Disk	10 GB	40 GB
OS	Ubuntu 22.04	Ubuntu 24.04 LTS
Architecture	x86-64	x86-64

Mallard Metrics is a single static binary. Under light to medium traffic (< 50k daily events) the minimum spec is adequate. The disk budget is dominated by Parquet data growth and the LUKS image pre-allocation.

Domain and DNS

You need a domain whose DNS is managed in Cloudflare. The domain can be:

A subdomain: analytics.example.com (recommended — keeps the apex clean)
An apex domain: example.com

Create an A record pointing to your VPS IP before running setup. Caddy validates DNS during certificate issuance.

analytics.example.com.  A  203.0.113.42

If you're using Cloudflare's proxy (orange cloud), set it to DNS only (grey cloud) for the analytics subdomain. Caddy manages TLS itself and Cloudflare's proxy can interfere with HTTP/3 and certificate validation.

Cloudflare API token

Caddy uses the Cloudflare API to create DNS TXT records for ACME DNS-01 challenges. Create a scoped token:

Log in to dash.cloudflare.com → My Profile → API Tokens
Click Create Token → Custom Token
Set permissions:
- Zone → Zone → Read (for all zones or just the specific zone)
- Zone → DNS → Edit (for the specific zone containing your domain)
Restrict to Zone Resources → Specific zone → your zone
Copy the generated token — you will not see it again

SSH key access

setup.sh disables SSH password authentication as part of hardening. You must have an SSH public key installed on the server before running the script, or you will be locked out.

# On your local machine — copy your public key to the server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@your-vps-ip

# Verify it works before running setup
ssh -i ~/.ssh/id_ed25519 user@your-vps-ip echo "Key access confirmed"

One-Command Deployment

If you trust the script (review it first), this does everything:

# 1. SSH into the VPS
ssh user@your-vps-ip

# 2. Clone the repository
git clone https://github.com/tomtom215/mallardmetrics.git
cd mallardmetrics

# 3. Run the setup script
sudo bash deploy/setup.sh

The script is interactive — it will prompt for your domain, email, and Cloudflare API token, then generate and display the admin password.

Pre-set values to run non-interactively (e.g., for CI/cloud-init):

export MM_DOMAIN=analytics.example.com
export MM_EMAIL=admin@example.com
export MM_CF_TOKEN=your-cloudflare-token
sudo -E bash deploy/setup.sh

Step-by-Step Manual Deployment

Step 1 — Provision the VPS

Choose a provider (any KVM/XEN VPS works):

Hetzner — CX22 (2 vCPU, 4 GB, €4/mo) is excellent value
DigitalOcean — Basic Droplet $6/mo
Vultr — Cloud Compute $5/mo
Oracle Cloud Always Free — 2 AMD VMs, 200 GB block storage, genuinely free
Linode/Akamai — Shared CPU $5/mo

Use Ubuntu 22.04 LTS or 24.04 LTS as the OS image. Enable backups at the provider level for an additional safety net.

After provisioning:

# Note your VPS IP address, then SSH in
ssh root@<VPS-IP>

# Immediately create a non-root user with sudo
adduser deploy
usermod -aG sudo deploy

# Add your SSH key to the new user
mkdir -p /home/deploy/.ssh
cp /root/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

# Switch to the non-root user for the rest
su - deploy

Step 2 — Clone the repository

git clone https://github.com/tomtom215/mallardmetrics.git
cd mallardmetrics

Step 3 — Run setup.sh

sudo bash deploy/setup.sh

The script will:

Detect your OS and verify prerequisites
Ask you to confirm SSH key access before hardening SSH
Update packages and install tooling
Harden SSH, enable UFW firewall, configure fail2ban
Apply kernel hardening sysctl settings
Install Docker CE and the Compose plugin
Create a 20 GB LUKS-encrypted image at /srv/mallard/data.img and mount it
Download and run vps-audit — saving the report
Prompt you to configure deploy/.env (or auto-generate secrets)
Build the Docker images and start the stack
Install weekly vps-audit and daily backup cron jobs
Print your admin password and a post-setup checklist

Step 4 — Verify deployment

# Check container status
docker compose -f deploy/docker-compose.production.yml ps

# Check Caddy got a certificate (look for "TLS certificate obtained")
docker compose -f deploy/docker-compose.production.yml logs caddy | grep -i cert

# Test the health endpoint (replace with your domain)
curl -s https://analytics.example.com/health/ready

# Expected: ready

Open https://<your-domain> in a browser. You should see the Mallard Metrics dashboard login page.

What setup.sh Does

Here is the complete sequence of operations setup.sh performs, with the rationale for each:

Step	Operation	Why
1	OS detection and SSH key check	Prevents lockout before hardening
2	`apt upgrade` + `unattended-upgrades`	Patches known CVEs immediately
3	SSH drop-in config in `sshd_config.d/`	Non-destructive; preserves original config
4	UFW: deny-all ingress, allow 22/80/443	Minimal attack surface
5	fail2ban for SSH	Blocks brute-force login attempts
6	Kernel sysctl hardening	Disables TCP redirects, restricts dmesg/BPF
7	Docker CE from official repo	Ensures a current, vendor-supported version
8	LUKS encrypted image + keyfile	Analytics data encrypted at rest
9	vps-audit + weekly cron	Ongoing visibility into security posture
10	Secret generation + `deploy/.env`	Strong random credentials without manual work
11	`docker compose build && up -d`	Brings the stack live
12	Backup cron (rsync)	Daily snapshot of DuckDB + Parquet

LUKS Encrypted Volume

How it works

setup.sh creates a file-backed LUKS2 container at /srv/mallard/data.img using AES-XTS-PLAIN64 with a 512-bit key. A random keyfile is stored at /etc/mallard-data.key (read-only by root) so the volume auto-unlocks on boot without a passphrase prompt.

The decrypted volume is formatted ext4 and mounted at /srv/mallard/data. The Mallard Metrics container bind-mounts this path as /data.

/srv/mallard/data.img   ← LUKS2 container (AES-256 XTS, file on host disk)
        ↓ cryptsetup luksOpen
/dev/mapper/mallard-data  ← Decrypted block device
        ↓ ext4 mount
/srv/mallard/data/        ← Plaintext filesystem (only visible to root while mounted)
        ↓ Docker bind mount
/data/ (inside container) ← mallard.duckdb, data/YYYY/MM/DD/*.parquet

If an attacker gains access to the raw disk image (e.g., by stealing a disk or snapshot), the data is unreadable without the keyfile.

After reboot

The LUKS volume is configured in /etc/crypttab and /etc/fstab to auto-mount on boot using the keyfile. No manual intervention is required after a planned reboot.

# To verify the volume mounted after a reboot:
mountpoint /srv/mallard/data && echo "mounted" || echo "NOT mounted"

# If it did not mount (e.g., keyfile missing), mount manually:
sudo cryptsetup luksOpen --key-file /etc/mallard-data.key \
    /srv/mallard/data.img mallard-data
sudo mount /dev/mapper/mallard-data /srv/mallard/data

# Then restart the stack
sudo docker compose -f /path/to/mallardmetrics/deploy/docker-compose.production.yml up -d

Resizing the volume

# 1. Stop the stack
docker compose -f deploy/docker-compose.production.yml down

# 2. Unmount and close
sudo umount /srv/mallard/data
sudo cryptsetup luksClose mallard-data

# 3. Grow the image file (+10 GB example)
sudo fallocate -l 30G /srv/mallard/data.img          # change to new total size

# 4. Grow the LUKS container
sudo cryptsetup luksOpen --key-file /etc/mallard-data.key \
    /srv/mallard/data.img mallard-data
sudo cryptsetup resize mallard-data

# 5. Grow the filesystem
sudo e2fsck -f /dev/mapper/mallard-data
sudo resize2fs /dev/mapper/mallard-data

# 6. Re-mount and restart
sudo mount /dev/mapper/mallard-data /srv/mallard/data
docker compose -f deploy/docker-compose.production.yml up -d

Caddy and TLS

Cloudflare DNS challenge

The Caddyfile is configured for the ACME DNS-01 challenge using the Cloudflare provider. This means:

Port 80 does not need to be accessible — challenge is completed via DNS API
Wildcard certificates (*.example.com) are supported
Certificates are obtained before the first request arrives

Caddy stores its ACME account and certificates in the caddy-data Docker volume. Certificates are renewed automatically, typically 30 days before expiry.

Certificate renewal

No action is required — Caddy handles renewal entirely. To check certificate status:

# View Caddy's certificate store
docker exec mallard-caddy caddy environ
docker exec mallard-caddy caddy list-modules | grep dns

# Check cert expiry
echo | openssl s_client -connect analytics.example.com:443 -servername analytics.example.com 2>/dev/null \
    | openssl x509 -noout -dates

Custom domain configurations

Subdomain (most common):

# In .env
DOMAIN=analytics.example.com

Apex domain:

# In .env
DOMAIN=example.com

Multiple domains (edit deploy/Caddyfile directly):

analytics.example.com, stats.myothersite.io {
    import security_headers
    reverse_proxy mallard:8000 { ... }
}

Security Hardening

vps-audit integration

vps-audit performs 40+ security checks across SSH, firewall, kernel, authentication, file permissions, and services.

# Run a fresh audit at any time
sudo vps-audit

# Run with JSON output for automation
sudo vps-audit --format json > /tmp/audit.json

# View the initial audit report
cat /srv/mallard/vps-audit-initial-$(date +%Y%m%d).log

# View weekly audit logs
tail -100 /var/log/vps-audit.log

The weekly cron runs every Sunday at 03:00 UTC. Review WARN and FAIL items and address them using the audit's built-in guidance (vps-audit --guide).

SSH hardening

setup.sh installs a hardening drop-in at /etc/ssh/sshd_config.d/99-mallard-hardening.conf:

PermitRootLogin no           # Root cannot SSH in at all
PasswordAuthentication no    # Only public key authentication
MaxAuthTries 3               # Lock after 3 failed attempts
LoginGraceTime 30            # 30s window to authenticate
ClientAliveInterval 300      # 5-minute keepalive
AllowAgentForwarding no      # No agent forwarding
AllowTcpForwarding no        # No tunnel forwarding
X11Forwarding no             # No graphical forwarding

fail2ban bans IPs after 5 failed SSH attempts for 1 hour.

Firewall (UFW)

# View current rules
sudo ufw status numbered

# Default policy after setup.sh
# Default incoming: deny
# Default outgoing: allow
# 22/tcp  — SSH
# 80/tcp  — HTTP (Caddy redirects to HTTPS)
# 443/tcp — HTTPS
# 443/udp — HTTP/3 QUIC

Kernel parameters

Applied via /etc/sysctl.d/99-mallard-hardening.conf:

Setting	Value	Effect
`tcp_syncookies`	1	SYN flood protection
`rp_filter`	1	Spoofed packet rejection
`accept_redirects`	0	ICMP redirect attacks blocked
`dmesg_restrict`	1	Kernel log visible only to root
`unprivileged_bpf_disabled`	1	BPF restricted to privileged users
`bpf_jit_harden`	2	JIT hardening against side-channel
`suid_dumpable`	0	No core dumps from setuid programs

Configuration Reference

All configuration is in deploy/.env. The file is created by setup.sh from deploy/.env.example. Here are the settings most commonly adjusted post-deployment:

Variable	Default	Description
`DOMAIN`	(required)	Hostname Caddy serves
`MALLARD_RETENTION_DAYS`	`365`	Delete Parquet partitions older than N days
`MALLARD_RATE_LIMIT`	`0` (unlimited)	Max events/sec per site_id
`MALLARD_CACHE_TTL`	`60`	Query result cache TTL (seconds)
`MALLARD_MAX_CONCURRENT_QUERIES`	`10`	DuckDB concurrency cap
`MALLARD_MAX_LOGIN_ATTEMPTS`	`5`	Failed logins before IP lockout
`MALLARD_LOGIN_LOCKOUT`	`300`	Lockout duration (seconds)
`MALLARD_GEOIP_DB`	(blank)	Path to MaxMind GeoLite2-City.mmdb (inside container)

After editing .env, restart the stack:

docker compose -f deploy/docker-compose.production.yml up -d

Adding the Tracking Script

Add this to every page you want to track:

<script
  defer
  src="https://analytics.example.com/mallard.js"
  data-domain="example.com">
</script>

Replace analytics.example.com with your deployment domain and example.com with the site_id you want to use for this site.

Custom events:

window.mallard('Purchase', {
  revenue: 49.99,
  currency: 'USD',
  props: { plan: 'pro' }
});

Embed on GitHub Pages docs (static site):

Simply paste the <script> tag into your mdBook layout template or into individual markdown pages using HTML passthrough. The script is < 1 KB and has zero external dependencies.

Accessing the Dashboard Remotely

The dashboard is served at the root URL of your Mallard Metrics instance (e.g. https://analytics.example.com). It requires authentication when MALLARD_ADMIN_PASSWORD is set.

Note: The server sets X-Frame-Options: DENY to prevent clickjacking, so the dashboard cannot be embedded in an iframe. Access it directly in a browser tab instead.

Post-Deployment Operations

View logs

# All services (follow)
docker compose -f deploy/docker-compose.production.yml logs -f

# Mallard only (JSON structured logs)
docker compose -f deploy/docker-compose.production.yml logs mallard | jq .

# Caddy access log (on the LUKS volume)
tail -f /srv/mallard/data/logs/caddy-access.log | jq .

Update Mallard Metrics

cd ~/mallardmetrics

# Pull latest changes
git pull origin main

# Rebuild and restart (zero downtime if only Mallard changes)
docker compose -f deploy/docker-compose.production.yml build mallard
docker compose -f deploy/docker-compose.production.yml up -d mallard

# Or rebuild everything
docker compose -f deploy/docker-compose.production.yml build --no-cache
docker compose -f deploy/docker-compose.production.yml up -d

The Caddy build only needs to be rebuilt if you change deploy/Dockerfile.caddy or deploy/Caddyfile.

Backup and restore

Backup (done automatically daily by the cron job):

# Manual backup
rsync -a --delete /srv/mallard/data/ /srv/mallard/backup/

# Copy off-server (replace with your backup destination)
rsync -az /srv/mallard/data/ backup-server:/backups/mallard/$(date +%Y%m%d)/

Restore:

# Stop the stack
docker compose -f deploy/docker-compose.production.yml down

# Restore data files
rsync -a /srv/mallard/backup/ /srv/mallard/data/

# Restart
docker compose -f deploy/docker-compose.production.yml up -d

GeoIP setup

Mallard supports MaxMind GeoLite2-City for country/region/city resolution.

Create a free MaxMind account at maxmind.com
Download GeoLite2-City.mmdb

Copy it to the data volume:

cp GeoLite2-City.mmdb /srv/mallard/data/GeoLite2-City.mmdb

Update deploy/.env:

MALLARD_GEOIP_DB=/data/GeoLite2-City.mmdb

Restart Mallard:

docker compose -f deploy/docker-compose.production.yml restart mallard

Set up weekly automatic updates (MaxMind databases are updated Tuesdays and Fridays):

# Install geoipupdate
apt-get install -y geoipupdate

# Configure with your MaxMind account ID and licence key
# /etc/GeoIP.conf:
# AccountID YOUR_ACCOUNT_ID
# LicenseKey YOUR_LICENSE_KEY
# EditionIDs GeoLite2-City

# Run update
geoipupdate

# Link to the data volume
ln -sf /usr/share/GeoIP/GeoLite2-City.mmdb /srv/mallard/data/GeoLite2-City.mmdb

Monitoring

The detailed health endpoint returns rich status JSON:

curl -s https://analytics.example.com/health/detailed | jq .

Example response:

{
  "status": "ok",
  "version": "0.1.0",
  "buffered_events": 0,
  "auth_configured": true,
  "geoip_loaded": false,
  "behavioral_extension_loaded": true,
  "filter_bots": true,
  "cache_entries": 0,
  "cache_empty": true
}

Prometheus metrics (requires MALLARD_METRICS_TOKEN):

curl -H "Authorization: Bearer $MALLARD_METRICS_TOKEN" \
  https://analytics.example.com/metrics

Available metrics:

mallard_events_ingested_total — cumulative event count
mallard_flush_failures_total — Parquet flush failures
mallard_rate_limit_rejections_total — rate-limited requests
mallard_login_failures_total — failed dashboard logins
mallard_cache_hits_total / mallard_cache_misses_total — query cache
mallard_behavioral_extension — 1 if the behavioral extension loaded

UptimeRobot / Better Uptime:

Monitor https://<domain>/health/ready with a 1-minute interval. It returns HTTP 200 when the database is reachable, 503 otherwise.

Troubleshooting

Caddy shows "certificate error" or HTTP instead of HTTPS

# Check Caddy logs for ACME errors
docker compose -f deploy/docker-compose.production.yml logs caddy | grep -i "acme\|cert\|error"

# Common causes:
# 1. CLOUDFLARE_API_TOKEN is wrong or lacks Zone:DNS:Edit permission
# 2. DNS A record not yet propagated (allow up to 10 minutes)
# 3. You hit Let's Encrypt rate limits — wait 1 hour or switch to staging
#    (uncomment the acme_ca staging line in deploy/Caddyfile)

Mallard container exits immediately

docker compose -f deploy/docker-compose.production.yml logs mallard

# Common cause: MALLARD_SECRET is blank (required at startup)
# Check deploy/.env has MALLARD_SECRET set to a non-empty value

Data volume not mounted after reboot

# Check if LUKS device is open
ls -la /dev/mapper/mallard-data || echo "LUKS device not open"

# Check mount
mountpoint /srv/mallard/data || echo "Not mounted"

# Manually open and mount
sudo cryptsetup luksOpen --key-file /etc/mallard-data.key \
    /srv/mallard/data.img mallard-data
sudo mount /dev/mapper/mallard-data /srv/mallard/data

# Restart stack
docker compose -f deploy/docker-compose.production.yml up -d

Health check returns 503

# Mallard is running but the DuckDB VIEW rebuild failed
docker compose -f deploy/docker-compose.production.yml logs mallard | tail -50

# Try restarting Mallard only (Caddy stays up, no TLS interruption)
docker compose -f deploy/docker-compose.production.yml restart mallard

Port 443 already in use

sudo ss -tlnp | grep :443
# If another process (nginx, apache) is listening:
sudo systemctl stop nginx apache2 2>/dev/null || true
docker compose -f deploy/docker-compose.production.yml up -d caddy

Out of disk space

df -h /srv/mallard/data   # Check LUKS volume usage
df -h /var/lib/docker     # Check Docker overlay usage

# Trim old Docker layers
docker system prune -f

# Enable data retention if not already set
# In deploy/.env: MALLARD_RETENTION_DAYS=365
# Then restart mallard

Frequently Asked Questions

Q: Can I deploy without Cloudflare?

Yes — use any DNS provider that Caddy supports. The DNS-01 plugin ecosystem includes Route53, GoDaddy, Namecheap, Gandi, and many others. See caddyserver.com/docs/modules/dns for the full list. Alternatively, if port 80 is accessible from the internet, change the Caddyfile global block to remove acme_dns and Caddy will use the HTTP-01 challenge automatically.

Q: Can I run Mallard Metrics on a Raspberry Pi or ARM server?

The current Dockerfile targets x86_64-unknown-linux-musl. To build for ARM64, change the target to aarch64-unknown-linux-musl in the Dockerfile and add platform: linux/arm64 to the compose service. The rest of the stack (Caddy, LUKS) is architecture-agnostic.

Q: How do I add multiple sites?

Mallard Metrics handles multiple sites with a single deployment. Each site uses a different data-domain in the tracking script. All data is partitioned by site_id at the Parquet layer. Dashboard queries are filtered per site.

Q: Is the LUKS keyfile approach secure?

The keyfile provides encryption at rest — protection against an attacker who obtains the raw disk image (e.g., a stolen drive or a cloud snapshot). It does not protect against an attacker who has live root access to a running server, because the decrypted volume is mounted and readable. For higher threat models, use a passphrase-protected LUKS setup with manual unlock after reboot, or consider a dedicated HSM.

Q: How do I change the admin password?

# Set the new password in deploy/.env
sed -i 's/^MALLARD_ADMIN_PASSWORD=.*/MALLARD_ADMIN_PASSWORD=new-password-here/' deploy/.env

# Restart mallard to pick it up
docker compose -f deploy/docker-compose.production.yml restart mallard

Q: Can I use a wildcard certificate?

Yes. DNS-01 challenge (which this setup uses) supports wildcards. Change your domain to *.example.com in the Caddyfile and the certificate will cover all subdomains.

Q: How do I run Mallard Metrics on a private/internal network with no public IP?

Since we use the DNS-01 challenge, the server does not need to be reachable on port 80 from the internet. Any server that can make outbound HTTPS requests to Cloudflare's API can get a certificate — including servers on private VPNs, home labs, and internal networks.

Q: What happens to data if the LUKS container runs out of space?

Mallard will return errors on write (DuckDB INSERT and Parquet COPY TO will fail). Flush failures are counted in the mallard_flush_failures_total Prometheus metric. In-memory buffered events are preserved and retried. To prevent this, monitor disk usage and enable MALLARD_RETENTION_DAYS to automatically delete old partitions.

Q: Can I enable Let's Encrypt staging to test without hitting rate limits?

Yes. In deploy/Caddyfile, uncomment:

acme_ca https://acme-staging-v02.api.letsencrypt.org/directory

Your browser will show a certificate warning (staging certs aren't trusted), but you can verify the issuance flow. Remove the line and docker compose restart caddy to switch back to production.

Q: How do I integrate this with Grafana or another dashboard?

Use the Prometheus /metrics endpoint as a data source. For detailed analytics data, the JSON export endpoint (GET /api/stats/export?format=json) produces daily rollups that can be ingested into any TSDB.

Index

Term	Section
A record	Domain and DNS
ACME	Caddy and TLS
Admin password	FAQ — change password
Backup	Backup and restore
Caddy	Architecture, Caddy and TLS
Certificate renewal	Certificate renewal
Cloudflare API token	Cloudflare API token
Configuration	Configuration Reference
crypttab	After reboot
DNS-01 challenge	Cloudflare DNS challenge
Docker Compose	One-Command Deployment
fail2ban	SSH hardening
Firewall	Firewall (UFW)
GeoIP	GeoIP setup
Health check	Monitoring
HTTP/3 QUIC	Architecture
Kernel hardening	Kernel parameters
LUKS encryption	LUKS Encrypted Volume
Logging	View logs
Metrics (Prometheus)	Monitoring
Multi-site	FAQ — multiple sites
Resize volume	Resizing the volume
setup.sh	What setup.sh Does
SSH key	SSH key access
TLS	Caddy and TLS
Tracking script	Adding the Tracking Script
UFW	Firewall (UFW)
Updates	Update Mallard Metrics
vps-audit	vps-audit integration
Wildcard certificate	FAQ — wildcard certificate

Fly.io Deployment

Fly.io is a managed application platform that runs Docker containers in hardware-isolated micro-VMs (Firecracker) across a global network. It is not a "free tier" service — it requires a credit card. However, its Hobby plan includes enough free allowances to run Mallard Metrics at low-to-medium traffic volumes at little or no monthly cost.

Overview

Fly.io runs your Docker image as a Firecracker micro-VM. Mallard Metrics deploys well because:

The FROM scratch musl-static binary has no OS dependencies
Fly.io provides persistent volumes for DuckDB and Parquet data
Fly.io terminates TLS automatically — no Caddy or certbot needed
The Fly.io edge network handles HTTP/2 and HTTPS globally
Machines auto-start on traffic and can auto-stop when idle

Limitations compared to a dedicated VPS:

No LUKS encryption (volume encryption is managed by Fly.io's infrastructure)
Auto-stop means cold-start latency if traffic is infrequent
Volume size and I/O throughput are lower than a dedicated disk
Scaling beyond a single machine requires paid plan upgrades

Fly.io vs VPS: When to Choose Each

Criterion	Fly.io	Dedicated VPS
Setup time	< 15 minutes	30–60 minutes
Monthly cost (light traffic)	~$0–$5	$4–$10
TLS management	Automatic	Caddy (setup.sh handles)
Data encryption at rest	Platform-managed	LUKS (user-managed)
Cold-start latency	Yes (if auto-stop)	No
Custom kernel tuning	No	Yes
Multi-region	Yes	Manual
Persistent storage	Volumes (3 GB included)	LUKS image (you size it)
SSH access	`fly ssh console`	Direct SSH

Choose Fly.io if you want zero infrastructure maintenance and are comfortable with platform-managed data storage.

Choose a VPS if you need full control, LUKS encryption, or higher data volumes.

Pricing and Allowances

Fly.io's Hobby plan (requires a payment method) includes monthly allowances:

Resource	Included free
Shared-CPU-1x 256 MB VMs	3 VMs
Persistent volume storage	3 GB
Outbound data transfer	160 GB
TLS certificates	Unlimited

Mallard Metrics needs:

1 VM — shared-cpu-1x with 256 MB RAM is sufficient for up to ~10k daily events. Scale to 512 MB or 1x CPU for higher loads.
1 Volume — minimum 1 GB (DuckDB grows with data). 3 GB is comfortable for a year of moderate traffic.

At low traffic, your deployment may fit entirely within the free allowances. At higher traffic or with a large data volume, expect $1–5/month.

Prerequisites

A Fly.io account — sign up at fly.io (credit card required)
flyctl installed on your local machine
The mallardmetrics repository cloned locally
A domain name (optional — Fly.io provides a .fly.dev subdomain for free)

Initial Setup

Install flyctl

macOS:

brew install flyctl

Linux:

curl -L https://fly.io/install.sh | sh
# Add to PATH (add this to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/.fly/bin:$PATH"

Windows:

iwr https://fly.io/install.ps1 -useb | iex

Verify:

fly version

Authenticate

fly auth login
# Opens a browser — log in to your Fly.io account

Configure the Application

fly.toml

Create fly.toml in the repository root:

# Mallard Metrics — Fly.io configuration
# Replace "mallard-metrics-YOURNAME" with a globally unique app name.

app            = "mallard-metrics-YOURNAME"
primary_region = "ord"    # Chicago. See: fly platform regions

[build]
  # Use the existing Dockerfile (FROM scratch, musl binary)
  dockerfile = "Dockerfile"

[env]
  # Non-secret configuration — secrets go in fly secrets (see below)
  # IMPORTANT: env var names must match config.rs exactly:
  #   MALLARD_RATE_LIMIT  → config.rate_limit_per_site  (NOT _PER_SITE suffix)
  #   MALLARD_CACHE_TTL   → config.cache_ttl_secs        (NOT _SECS suffix)
  #   MALLARD_GEOIP_DB    → config.geoip_db_path          (NOT _PATH suffix)
  MALLARD_DATA_DIR           = "/data"
  MALLARD_HOST               = "0.0.0.0"
  MALLARD_PORT               = "8080"
  MALLARD_LOG_FORMAT         = "json"
  MALLARD_FILTER_BOTS        = "true"
  MALLARD_SECURE_COOKIES     = "true"
  MALLARD_RETENTION_DAYS     = "365"
  MALLARD_RATE_LIMIT         = "200"
  MALLARD_CACHE_TTL          = "300"
  MALLARD_MAX_LOGIN_ATTEMPTS = "5"
  MALLARD_LOGIN_LOCKOUT      = "300"
  RUST_LOG                   = "mallard_metrics=info,tower_http=warn"

[http_service]
  internal_port       = 8080
  force_https         = true       # Fly.io handles TLS; redirect HTTP → HTTPS
  auto_stop_machines  = "stop"     # Stop idle machines to save cost
  auto_start_machines = true       # Auto-start on new traffic
  min_machines_running = 1         # Keep at least 1 machine alive (prevents cold starts)
  processes            = ["app"]

  [http_service.concurrency]
    type       = "requests"
    soft_limit = 200
    hard_limit = 250

[[vm]]
  cpu_kind = "shared"
  cpus     = 1
  memory   = "256mb"     # Increase to "512mb" for >50k daily events

[mounts]
  source      = "mallard_data"    # Volume name (created below)
  destination = "/data"
  initial_size = "3gb"

[checks]
  [checks.health]
    grace_period = "10s"
    interval     = "30s"
    method       = "GET"
    path         = "/health/ready"
    port         = 8080
    timeout      = "5s"
    type         = "http"

Choose your region (primary_region):

fly platform regions
# Pick the region closest to your users or your DNS provider
# Common choices: ord (Chicago), iad (Virginia), lax (Los Angeles),
#                 lhr (London), fra (Frankfurt), nrt (Tokyo), sin (Singapore)

Dockerfile note

The existing Dockerfile targets x86_64-unknown-linux-musl. Fly.io runs on x86-64 by default — no changes to the Dockerfile are needed.

If you want to build for Fly.io's ARM machines (--vm-cpu-kind performance), change the target to aarch64-unknown-linux-musl and update the rust-toolchain.toml accordingly.

Create a Persistent Volume

The Fly.io volume stores DuckDB and Parquet data between deployments and machine restarts.

# Create a 3 GB volume in your primary region (included in Hobby allowances)
fly volumes create mallard_data \
  --size 3 \
  --region ord \
  --app mallard-metrics-YOURNAME

# Verify
fly volumes list --app mallard-metrics-YOURNAME

Important: Volumes are single-region and single-machine by default. If you scale to multiple machines, each machine needs its own volume — but Mallard Metrics is a single-instance application (DuckDB is embedded). Do not scale to more than 1 machine without understanding the data consistency implications.

Set Secrets

Fly.io secrets are encrypted at rest and injected as environment variables at runtime. Never put secrets in fly.toml.

APP=mallard-metrics-YOURNAME

# Required secrets — generate strong values:
fly secrets set \
  MALLARD_SECRET="$(openssl rand -base64 48)" \
  MALLARD_ADMIN_PASSWORD="$(openssl rand -base64 24 | tr -d '=+/' | head -c 32)" \
  MALLARD_METRICS_TOKEN="$(openssl rand -hex 32)" \
  --app "$APP"

Save the admin password before running the above — it is not retrievable after setting:

# Generate and save before setting:
ADMIN_PASS="$(openssl rand -base64 24 | tr -d '=+/' | head -c 32)"
echo "Admin password: $ADMIN_PASS"  # Save this!
fly secrets set MALLARD_ADMIN_PASSWORD="$ADMIN_PASS" --app "$APP"

To update a secret later:

fly secrets set MALLARD_ADMIN_PASSWORD="new-password" --app "$APP"
# Fly.io triggers a rolling restart automatically

To view which secrets are set (names only — values are never shown):

fly secrets list --app "$APP"

Deploy

# From the repository root directory
fly deploy --app mallard-metrics-YOURNAME

# Or launch for the first time (creates app + prompts for config):
fly launch
# Answer the prompts; Fly.io will detect the Dockerfile and suggest settings.
# Review the generated fly.toml and adjust as described above.

Fly.io will:

Build the Docker image remotely (using Fly's build infrastructure)
Push it to Fly.io's container registry
Create a Firecracker micro-VM from the image
Mount the mallard_data volume at /data
Inject secrets as environment variables
Start the machine and run health checks

Deployment typically takes 2–4 minutes. Watch progress:

fly deploy --app mallard-metrics-YOURNAME 2>&1 | tee deploy.log

Configure a Custom Domain

By default your app is available at https://mallard-metrics-YOURNAME.fly.dev.

To use a custom domain:

# 1. Add the domain to your Fly.io app
fly certs add analytics.example.com --app mallard-metrics-YOURNAME

# 2. Fly.io will show you the DNS records to create:
fly certs show analytics.example.com --app mallard-metrics-YOURNAME

Create the DNS records shown (usually a CNAME to <app>.fly.dev or an A/AAAA to Fly's IPs). Fly.io obtains a Let's Encrypt certificate automatically via the HTTP-01 or DNS-01 challenge.

Update the app to know its domain:

fly secrets set \
  MALLARD_DASHBOARD_ORIGIN="https://analytics.example.com" \
  --app mallard-metrics-YOURNAME

Verify the Deployment

# Check machine status
fly status --app mallard-metrics-YOURNAME

# View machine health
fly checks list --app mallard-metrics-YOURNAME

# Quick smoke test
curl -s https://mallard-metrics-YOURNAME.fly.dev/health/ready
# Expected: ready

# View all available endpoints
curl -s https://mallard-metrics-YOURNAME.fly.dev/health/detailed | jq .

Open https://mallard-metrics-YOURNAME.fly.dev (or your custom domain) in a browser. Log in with the admin password you set.

Logs and Monitoring

# Stream live logs
fly logs --app mallard-metrics-YOURNAME

# Historical logs (last N lines)
fly logs --app mallard-metrics-YOURNAME -n 200

# Parse JSON structured logs
fly logs --app mallard-metrics-YOURNAME | jq 'select(.fields.uri != "/health/ready")'

# Machine console (SSH equivalent — note: FROM scratch has no shell)
# Use this to inspect the volume contents:
fly ssh console --app mallard-metrics-YOURNAME
# > ls /data/

Prometheus metrics:

METRICS_TOKEN=$(fly secrets list --app mallard-metrics-YOURNAME | grep METRICS_TOKEN)
curl -H "Authorization: Bearer $YOUR_METRICS_TOKEN" \
  https://mallard-metrics-YOURNAME.fly.dev/metrics

Fly.io built-in monitoring:

The Fly.io dashboard at fly.io/apps/YOUR-APP shows:

Machine CPU and memory graphs
HTTP request rate and latency
Health check pass/fail history

Scaling and Regions

Increase VM memory (if DuckDB queries are slow or OOMing):

# Edit fly.toml:
# [[vm]]
#   memory = "512mb"   # or "1gb"

fly deploy  # Apply the change

Prevent cold starts (machine auto-stops when idle):

# In fly.toml, ensure:
# [http_service]
#   min_machines_running = 1

This keeps 1 machine always running, eliminating cold-start latency at the cost of ~1 machine's worth of compute (within Hobby allowances).

Multi-region (advanced):

Fly.io supports deploying machines in multiple regions for lower global latency. However, Mallard Metrics uses an embedded single-file DuckDB database — volumes cannot be shared across regions. Multi-region deployment is not recommended without a replication strategy.

Updating Mallard Metrics

# Pull latest changes
git pull origin main

# Deploy (Fly.io builds the new image and does a rolling restart)
fly deploy --app mallard-metrics-YOURNAME

# Monitor the deploy
fly status --app mallard-metrics-YOURNAME
fly logs --app mallard-metrics-YOURNAME

Fly.io performs a blue/green-style deploy — it starts the new machine, runs health checks, and only terminates the old machine once the new one is healthy. Downtime is typically < 5 seconds.

Backup and Restore

Fly.io volumes are not automatically backed up. Back up the DuckDB file and Parquet data regularly.

Export via API (for structured backup):

# CSV export of all data
curl -H "Authorization: Bearer $API_KEY" \
  "https://mallard-metrics-YOURNAME.fly.dev/api/stats/export?site_id=example.com&format=json" \
  > backup-$(date +%Y%m%d).json

Volume snapshot (Fly.io feature):

# List volumes
fly volumes list --app mallard-metrics-YOURNAME

# Create a snapshot (may cause brief I/O pause)
fly volumes snapshots create <VOLUME_ID> --app mallard-metrics-YOURNAME

# List snapshots
fly volumes snapshots list <VOLUME_ID> --app mallard-metrics-YOURNAME

Restore from snapshot:

# Create a new volume from snapshot
fly volumes create mallard_data_restore \
  --snapshot-id <SNAPSHOT_ID> \
  --size 3 \
  --region ord \
  --app mallard-metrics-YOURNAME

Troubleshooting

Machine fails to start

fly logs --app mallard-metrics-YOURNAME | tail -50

# Common causes:
# 1. MALLARD_SECRET not set — run: fly secrets list
# 2. Volume not found — run: fly volumes list
# 3. Port mismatch — ensure MALLARD_PORT=8080 matches fly.toml internal_port=8080

Health checks failing

fly checks list --app mallard-metrics-YOURNAME

# Test the endpoint manually
fly ssh console --app mallard-metrics-YOURNAME
# Inside the console (if you have a shell):
wget -qO- http://localhost:8080/health/ready
# Note: FROM scratch has no shell — use fly proxy instead:
fly proxy 8080 --app mallard-metrics-YOURNAME
# Then in another terminal: curl http://localhost:8080/health/ready

Volume not mounted / data missing after update

# Check the mount
fly ssh console --app mallard-metrics-YOURNAME
ls /data/

# If /data is empty, the volume may have been detached
# Verify volume attachment in fly.toml [mounts] section matches the volume name
fly volumes list --app mallard-metrics-YOURNAME

Out of disk space on volume

# Extend the volume (Fly.io allows online resize)
fly volumes extend <VOLUME_ID> --size 10 --app mallard-metrics-YOURNAME

# Enable retention to prune old data
fly secrets set MALLARD_RETENTION_DAYS=180 --app mallard-metrics-YOURNAME

Machine auto-stopped unexpectedly

# Check if auto_stop_machines is enabled in fly.toml
# Ensure min_machines_running = 1 to prevent full auto-stop

# Or disable auto-stop entirely:
# [http_service]
#   auto_stop_machines = false

Frequently Asked Questions

Q: Does Fly.io encrypt volume data at rest?

Yes — Fly.io encrypts all volume data at rest using AES-256. You do not need to manage LUKS yourself. For compliance requirements, consult Fly.io's security documentation.

Q: Do I need a credit card?

Yes. Fly.io requires a payment method for all accounts, including those that stay within the free allowances. There is no truly card-free free tier.

Q: What is the cold-start latency?

When auto_stop_machines = "stop" and min_machines_running = 0, an idle machine is stopped after ~5 minutes. The first request after that triggers a cold start — typically 2–5 seconds for the Firecracker VM to boot. For an analytics ingestion endpoint, this means some requests may be delayed or dropped during cold start. Set min_machines_running = 1 to keep the machine always warm.

Q: Can I use Fly.io without a custom domain?

Yes. Fly.io provides a free *.fly.dev subdomain with a valid TLS certificate. Use it in your tracking script and dashboard URL.

Q: How do I SSH into the machine?

fly ssh console --app mallard-metrics-YOURNAME

Note that the Mallard container is FROM scratch and has no shell. The fly ssh console command connects to the VM's outer shell (not the container), so you can run ls / but not exec into the container.

To inspect the data volume:

fly ssh console --app mallard-metrics-YOURNAME
ls /data/         # See DuckDB and Parquet files
du -sh /data/     # Check usage

Q: Can I run Mallard Metrics alongside other services?

Fly.io apps are isolated. You can deploy other services as separate Fly apps in the same organisation and they share the same billing account. Each service gets its own machine(s) and volume(s).

Q: How do I migrate from Fly.io to a VPS?

Export your data via the API (/api/stats/export)
Or copy the volume contents: create a volume snapshot, restore it locally
Copy mallard.duckdb and the Parquet data directory to your VPS LUKS volume
Follow the VPS Deployment Guide

Q: Does the behavioral extension work on Fly.io?

Yes, if the behavioral extension binary is included in the build. Check GET /health/detailed — "behavioral_extension_loaded": true confirms it loaded successfully.

Q: What happens to in-flight events if the machine is auto-stopped?

Mallard handles SIGTERM with a graceful shutdown — it flushes the in-memory event buffer to Parquet before the machine stops. As long as the shutdown completes within MALLARD_SHUTDOWN_TIMEOUT_SECS (default 30s), no events are lost. Events buffered after the flush starts may be lost. Set min_machines_running = 1 to avoid auto-stop entirely for high-reliability deployments.

Monitoring

Health Checks

Three health endpoints are available without authentication:

`GET /health`

Returns ok with HTTP 200 when the server is running. Use this for:

Load balancer health checks.
Container orchestrator liveness probes.

# Kubernetes liveness probe
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

`GET /health/ready`

Executes a lightweight DuckDB query (SELECT 1 FROM events_all LIMIT 0) to verify the database is operational. Returns:

200 OK — database is ready and accepting queries.
503 Service Unavailable — database is not ready (use this as a readiness probe, not liveness).

# Kubernetes readiness probe
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

`GET /health/detailed`

Returns a JSON object with component-level status. See Health & Metrics API for the full schema.

Prometheus Metrics

GET /metrics returns Prometheus text format metrics (text/plain; version=0.0.4).

If MALLARD_METRICS_TOKEN is set, this endpoint requires Authorization: Bearer <token>.

Gauges

Metric	Type	Description
`mallard_buffered_events`	gauge	Events in memory, not yet flushed to Parquet
`mallard_cache_entries`	gauge	Cached query results in memory
`mallard_auth_configured`	gauge	`1` if admin password is set, `0` otherwise
`mallard_geoip_loaded`	gauge	`1` if GeoIP database loaded successfully
`mallard_filter_bots`	gauge	`1` if bot filtering is active
`mallard_behavioral_extension`	gauge	`1` if behavioral extension loaded, `0` otherwise

Counters

Metric	Type	Description
`mallard_events_ingested_total`	counter	Total events accepted through `POST /api/event`
`mallard_flush_failures_total`	counter	Total buffer flush failures
`mallard_rate_limit_rejections_total`	counter	Total requests rejected by the per-site rate limiter
`mallard_login_failures_total`	counter	Total failed login attempts
`mallard_cache_hits_total`	counter	Total query cache hits
`mallard_cache_misses_total`	counter	Total query cache misses

Prometheus Scrape Configuration

scrape_configs:
  - job_name: mallard_metrics
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics
    scrape_interval: 30s
    # If MALLARD_METRICS_TOKEN is set:
    authorization:
      credentials: your-metrics-token

Example Output

# HELP mallard_buffered_events Number of events in the in-memory buffer
# TYPE mallard_buffered_events gauge
mallard_buffered_events 42

# HELP mallard_cache_entries Number of cached query results
# TYPE mallard_cache_entries gauge
mallard_cache_entries 3

# HELP mallard_behavioral_extension Whether behavioral extension is loaded
# TYPE mallard_behavioral_extension gauge
mallard_behavioral_extension 1

# HELP mallard_events_ingested_total Total events ingested
# TYPE mallard_events_ingested_total counter
mallard_events_ingested_total 158432

# HELP mallard_cache_hits_total Total query cache hits
# TYPE mallard_cache_hits_total counter
mallard_cache_hits_total 9871

# HELP mallard_cache_misses_total Total query cache misses
# TYPE mallard_cache_misses_total counter
mallard_cache_misses_total 1204

Grafana Dashboard

A minimal Grafana panel configuration for key metrics:

{
  "panels": [
    {
      "title": "Ingestion Rate",
      "targets": [{"expr": "rate(mallard_events_ingested_total[5m])"}]
    },
    {
      "title": "Buffered Events",
      "targets": [{"expr": "mallard_buffered_events"}]
    },
    {
      "title": "Cache Hit Rate",
      "targets": [{"expr": "rate(mallard_cache_hits_total[5m]) / (rate(mallard_cache_hits_total[5m]) + rate(mallard_cache_misses_total[5m]))"}]
    },
    {
      "title": "Rate Limit Rejections",
      "targets": [{"expr": "rate(mallard_rate_limit_rejections_total[5m])"}]
    }
  ]
}

Structured Logging

Mallard Metrics uses tracing for structured logging. Two formats are supported:

Text (default)

Human-readable output with timestamps, log levels, and structured fields:

2024-01-15T10:00:00.123Z  INFO mallard_metrics: Starting Mallard Metrics host="0.0.0.0" port=8000
2024-01-15T10:00:00.456Z  INFO mallard_metrics: Behavioral extension loaded
2024-01-15T10:00:00.457Z  INFO mallard_metrics: Listening addr="0.0.0.0:8000"

JSON

Set MALLARD_LOG_FORMAT=json for machine-parseable output compatible with log aggregators (Loki, Elasticsearch, Splunk):

{"timestamp":"2024-01-15T10:00:00.123Z","level":"INFO","fields":{"message":"Flushed events to Parquet","count":42},"target":"mallard_metrics::ingest::buffer","request_id":"a3f2c1d8-..."}

Every log line emitted during a request carries a request_id field matching the X-Request-ID response header, enabling end-to-end log correlation.

Log Level Control

Use the RUST_LOG environment variable (standard tracing-subscriber env-filter syntax):

RUST_LOG=mallard_metrics=debug,tower_http=info

Default: mallard_metrics=info,tower_http=info

Alerting Recommendations

Alert	Condition	Severity
Server down	`up{job="mallard_metrics"} == 0`	Critical
Large event buffer	`mallard_buffered_events > 5000`	Warning
High flush failures	`increase(mallard_flush_failures_total[5m]) > 0`	Warning
Auth not configured	`mallard_auth_configured == 0`	Warning
High rate limit rejections	`rate(mallard_rate_limit_rejections_total[5m]) > 10`	Info
Low cache hit rate	`(cache_hits / (cache_hits + cache_misses)) < 0.5`	Info
GeoIP not loaded	`mallard_geoip_loaded == 0`	Info
Behavioral extension missing	`mallard_behavioral_extension == 0`	Info

Data Management

Storage Layout

Events are stored as date-partitioned, ZSTD-compressed Parquet files under data_dir/events/:

data/events/
├── site_id=example.com/
│   ├── date=2024-01-15/
│   │   ├── 0001.parquet   ← first flush for this day
│   │   └── 0002.parquet   ← second flush for this day
│   └── date=2024-01-16/
│       └── 0001.parquet
└── site_id=other.org/
    └── date=2024-01-15/
        └── 0001.parquet

Each Parquet file contains one batch of flushed events for a specific site and date. Files are numbered sequentially within each partition. Parquet files are self-describing and can be read by any Parquet-compatible tool.

Buffer and Flush Lifecycle

flowchart TD
    EVENT["Incoming Event\nPOST /api/event"]
    EVENT --> BUF["In-Memory Buffer\nVec&lt;Event&gt;"]

    BUF --> T1{"Count reached\nflush_event_count?"}
    BUF --> T2{"Periodic timer\nevery flush_interval_secs?"}
    BUF --> T3{"SIGINT or SIGTERM\ngraceful shutdown?"}

    T1 -->|"Yes"| FLUSH
    T2 -->|"Yes"| FLUSH
    T3 -->|"Yes"| FLUSH

    FLUSH["Flush — spawn_blocking\nDuckDB Appender API\nbatch column insert"]
    FLUSH --> PARQUET["COPY TO Parquet\nZSTD compression\ndate-partitioned file"]
    PARQUET --> DELETE["DELETE FROM events\nhot table cleared"]
    DELETE --> VIEW["Refresh events_all VIEW\nglobbed over new Parquet files"]
    VIEW --> READY(["All data queryable\nevents_all VIEW\nhot union cold"])

Failure safety: If the Appender insertion fails, drained events are restored to the front of the buffer and the flush returns an error. No events are lost due to a failed flush attempt.

Flush triggers:

Event count reaches flush_event_count (default 1000).
Periodic timer fires every flush_interval_secs (default 60 seconds). Runs in spawn_blocking to avoid blocking the async runtime.
Graceful shutdown — bounded by shutdown_timeout_secs (default 30 seconds).

Data Retention

When retention_days is set to a non-zero value, a background task runs daily and removes Parquet partition directories older than the configured threshold.

# Delete partitions older than 90 days
retention_days = 90

What is deleted: the entire date=YYYY-MM-DD/ directory and all Parquet files within it.

What is not deleted: the site_id=*/ parent directory (it remains even if all date partitions have been removed).

To keep data indefinitely, set retention_days = 0 (the default).

Mallard Metrics provides an admin-authenticated DELETE /api/gdpr/erase endpoint to permanently delete analytics data for a given site_id within a date range. Because visitor IDs are pseudonymous daily-rotating HMAC hashes that cannot be reverse-mapped to individuals, erasure operates at the site + date-range granularity — the finest granularity available without the original IP address and User-Agent. See PRIVACY.md for the full analysis and operator obligations.

Backup and Restore

Parquet files are self-describing and portable. To back up:

# Sync data directory to a backup location
rsync -a --checksum /data/events/ /backup/mallard-events/

# Or with rclone to S3
rclone sync /data/events s3:my-bucket/mallard-events

To restore:

rsync -a /backup/mallard-events/ /data/events/

After restore, restart Mallard Metrics. The events_all VIEW automatically picks up all Parquet files on startup.

Tip: Include data/mallard.duckdb and data/mallard.duckdb.wal in your backups to preserve any hot (not yet flushed) events.

Inspecting Data with DuckDB CLI

You can query Parquet files directly with the DuckDB CLI, independent of the Mallard Metrics server:

duckdb

-- Daily visitor and pageview counts for a site
SELECT
    CAST(timestamp AS DATE) AS date,
    COUNT(DISTINCT visitor_id)                            AS visitors,
    COUNT(*) FILTER (WHERE event_name = 'pageview')       AS pageviews
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
GROUP BY date
ORDER BY date DESC;

-- Top pages last 30 days
SELECT pathname, COUNT(*) AS views
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
WHERE event_name = 'pageview'
  AND CAST(timestamp AS DATE) >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY pathname
ORDER BY views DESC
LIMIT 20;

-- Revenue by product
SELECT
    json_extract_string(props, '$.product') AS product,
    SUM(revenue_amount)                      AS total_revenue,
    COUNT(*)                                 AS transactions
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
WHERE event_name = 'purchase'
GROUP BY product
ORDER BY total_revenue DESC;

Schema

The events table schema (also the Parquet file schema):

Column	Type	Nullable	Description
`site_id`	VARCHAR	No	Site identifier
`visitor_id`	VARCHAR	No	HMAC-SHA256 privacy-safe visitor ID
`timestamp`	TIMESTAMP	No	UTC event timestamp
`event_name`	VARCHAR	No	Event type (e.g. `pageview`, `signup`)
`pathname`	VARCHAR	No	URL path
`hostname`	VARCHAR	Yes	URL hostname
`referrer`	VARCHAR	Yes	Referrer URL
`referrer_source`	VARCHAR	Yes	Parsed referrer source name
`utm_source`	VARCHAR	Yes	UTM source parameter
`utm_medium`	VARCHAR	Yes	UTM medium parameter
`utm_campaign`	VARCHAR	Yes	UTM campaign parameter
`utm_content`	VARCHAR	Yes	UTM content parameter
`utm_term`	VARCHAR	Yes	UTM term parameter
`browser`	VARCHAR	Yes	Browser name
`browser_version`	VARCHAR	Yes	Browser version string
`os`	VARCHAR	Yes	Operating system name
`os_version`	VARCHAR	Yes	OS version string
`device_type`	VARCHAR	Yes	`desktop`, `mobile`, or `tablet`
`screen_size`	VARCHAR	Yes	Screen viewport width in pixels (e.g. `1920`)
`country_code`	VARCHAR(2)	Yes	ISO 3166-1 alpha-2 country code
`region`	VARCHAR	Yes	Region/state name
`city`	VARCHAR	Yes	City name
`props`	VARCHAR	Yes	Custom properties (JSON string, queryable via `json_extract`)
`revenue_amount`	DECIMAL(12,2)	Yes	Revenue amount
`revenue_currency`	VARCHAR(3)	Yes	ISO 4217 currency code