Mallard Metrics

Self-hosted, privacy-focused web analytics powered by DuckDB and the behavioral extension.

Single binary. Single process. Zero external dependencies.


What is Mallard Metrics?

Mallard Metrics is a lightweight, GDPR/CCPA-compliant alternative to cloud analytics platforms. It runs entirely on your infrastructure, stores no personally identifiable information, and requires no cookies or consent banners.

Built in Rust for predictable, low resource usage. The embedded DuckDB database — combined with the behavioral extension — provides SQL-native behavioral analytics: funnels, retention cohorts, session analysis, sequence matching, and flow analysis. No third-party services involved.

Mallard Metrics Dashboard

Core Properties

PropertyValue
LanguageRust (MSRV 1.94.0)
Web frameworkAxum 0.8.x
DatabaseDuckDB (disk-based, embedded, in-process)
Analyticsbehavioral extension (loaded at runtime)
StorageDate-partitioned Parquet files (ZSTD-compressed)
FrontendPreact + HTM (no build step, embedded in binary)
DeploymentStatic musl binary, FROM scratch Docker image
Tests333 passing (262 unit + 71 integration)

Key Features

Privacy by Design

  • No cookies — Visitor identification uses a daily-rotating HMAC-SHA256 hash of IP + User-Agent + daily salt.
  • No PII storage — IP addresses are hashed and discarded; they are never written to disk.
  • Daily salt rotation — Visitor IDs change every 24 hours, preventing long-term tracking.
  • Privacy-preserving — Pseudonymous visitor IDs; no cookies; no raw IP storage. See Security & Privacy for details.

Single Binary Deployment

  • One process handles ingestion, storage, querying, authentication, and the dashboard.
  • DuckDB is embedded — no separate database to install or operate.
  • FROM scratch Docker image: the binary is the only file in the container.
  • WAL-based durability: disk-backed DuckDB preserves hot events through crashes.

Analytical Power

CategoryCapabilities
Core metricsUnique visitors, pageviews, bounce rate, pages/session
BreakdownsPages, referrers, browsers, OS, devices, countries
Time-seriesHourly and daily aggregations
Funnel analysisMulti-step conversion funnels via window_funnel()
Retention cohortsWeekly retention grids via retention()
Session analyticsDuration, depth via sessionize()
Sequence matchingBehavioral patterns via sequence_match()
Flow analysisNext-page navigation via sequence_next_node()

Production Ready

  • Argon2id authentication — Password-protected dashboard with cryptographic session tokens.
  • API key management — Programmatic access with SHA-256 hashed keys (mm_ prefix), disk-persisted.
  • Rate limiting — Per-site token-bucket rate limiter on the ingestion endpoint.
  • Query caching — TTL-based in-memory cache for analytics queries.
  • Bot filtering — Automatic filtering of known bot User-Agents.
  • GeoIP — MaxMind GeoLite2 integration with graceful fallback.
  • Data retention — Configurable automatic cleanup of old Parquet partitions.
  • Graceful shutdown — Buffered events are flushed before process exit.
  • Prometheus metricsGET /metrics for scraping with counters for ingestion, auth, cache, and rate limiting.
  • OWASP security headers — Including HSTS, CSP, Permissions-Policy, and X-Request-ID.
  • CSRF protection — Origin/Referer validation on all state-mutating endpoints.
  • Brute-force protection — Per-IP login lockout with configurable threshold and lockout duration.
  • GDPR-friendly mode — Single MALLARD_GDPR_MODE=true toggle strips referrers, rounds timestamps, reduces GeoIP precision, and enables the Art. 17 data-erasure API.

When Should You Use Mallard Metrics?

Mallard Metrics is a good fit when you:

  • Want full control over your analytics data on your own server.
  • Need GDPR/CCPA compliance without third-party data processors.
  • Are running a small-to-medium website and want low operational overhead.
  • Need advanced behavioral analytics (funnels, retention, sequences) without a SaaS subscription.
  • Want to demonstrate the power of DuckDB's behavioral extension in a real-world production context.

It is not designed for:

  • Multi-region distributed analytics at very high volume (millions of events/minute).
  • Real-time dashboards with sub-second latency requirements.
  • Replacing a full data warehouse.

Project Status

Mallard Metrics is actively developed and production-ready. See GitHub for the latest releases and issue tracker.

The behavioral extension powering advanced analytics is developed at github.com/tomtom215/duckdb-behavioral.

Quick Start

This guide gets Mallard Metrics running and collecting events in a few minutes.

Prerequisites

  • Docker (recommended), or a Linux/macOS host with Rust 1.94+ for building from source.
  • A web property you want to track.

docker run -p 8000:8000 \
  -v mallard-data:/data \
  -e MALLARD_SECRET=your-random-32-char-secret \
  -e MALLARD_ADMIN_PASSWORD=your-dashboard-password \
  ghcr.io/tomtom215/mallard-metrics

Open http://localhost:8000 to access the dashboard.

Option 2: Docker Compose

Download docker-compose.yml from the repository root and run:

docker compose up -d

The compose file includes persistent storage, restart policy, and environment variable configuration. Set MALLARD_SECRET and MALLARD_ADMIN_PASSWORD in your shell or a .env file before running.

Option 3: Build from Source

git clone https://github.com/tomtom215/mallardmetrics
cd mallardmetrics
cargo build --release
./target/release/mallard-metrics mallard-metrics.toml.example

Note: The bundled feature for DuckDB means no external libduckdb is required. The build will take a few minutes the first time as DuckDB is compiled from source.


Step 2: Embed the Tracking Script

Add the tracking script to every page you want to track. Place it in the <head> or at the end of <body>:

<script
  async
  defer
  src="https://your-mallard-instance.com/mallard.js"
  data-domain="your-site.com">
</script>

Replace:

  • https://your-mallard-instance.com with the URL of your Mallard Metrics instance.
  • your-site.com with the domain you configured in site_ids (or any domain if site_ids is empty).

The script is under 1 KB, loads asynchronously, sets no cookies, and automatically tracks pageview events including URL, referrer, UTM parameters, screen size, and User-Agent.

See Tracking Script for the full API including custom events and revenue tracking.


Step 3: Verify Events Are Arriving

Check the health endpoint:

curl http://localhost:8000/health
# ok

curl http://localhost:8000/health/detailed
# {"status":"ok","version":"0.1.0","buffered_events":3,...}

Events are held in a memory buffer before being flushed to disk. You can query the dashboard immediately — the events_all view unions the hot buffer and all persisted Parquet data automatically.


Step 4: Dashboard

Navigate to http://localhost:8000 in your browser.

If you set MALLARD_ADMIN_PASSWORD, you will be prompted to log in. The dashboard shows:

  • Overview — Unique visitors, pageviews, bounce rate, session metrics.
  • Timeseries — Visitors and pageviews charted over your selected period.
  • Breakdowns — Top pages, referrer sources, browsers, OS, devices, countries.
  • Funnel — Define a conversion funnel with up to N steps.
  • Retention — Weekly cohort retention grid.
  • Sequences — Behavioral pattern matching and conversion rates.
  • Flow — Next-page navigation from any starting page.

What's Next?

Tracking Script

The Mallard Metrics tracking script (mallard.js) is served by the server at GET /mallard.js. It is under 1 KB, sets no cookies, and loads asynchronously.

Basic Embed

<script
  async
  defer
  src="https://your-instance.com/mallard.js"
  data-domain="your-site.com">
</script>

Attributes:

AttributeRequiredDescription
data-domainYesThe site ID to record events under. Must match an entry in site_ids if that config option is set.

Automatic Tracking

Once embedded, the script automatically fires a pageview event on every page load with the following data:

FieldSource
pathnamewindow.location.pathname + search + hash
referrerdocument.referrer
screen_widthwindow.innerWidth (viewport width in pixels)
User-AgentSent in request header, parsed server-side
UTM parametersExtracted from URL query string

Custom Events

Use window.mallard(eventName, options) to track custom actions:

// Simple event
window.mallard('signup');

// Event with custom properties
window.mallard('purchase', {
  props: { plan: 'pro', coupon: 'SAVE20' }
});

// Revenue event
window.mallard('checkout', {
  revenue: 99.00,
  currency: 'USD'
});

// Event with callback
window.mallard('form_submit', {
  props: { form: 'contact' },
  callback: function() {
    console.log('Event recorded');
  }
});

Options

OptionTypeDescription
propsobjectCustom properties stored as JSON in the props column. Queryable via json_extract.
revenuenumberRevenue amount (stored as DECIMAL(12,2)).
currencystringISO 4217 currency code (3 characters, e.g. "USD").
callbackfunctionCalled after the event is successfully recorded.

To track outbound link clicks, call window.mallard before navigating:

document.querySelectorAll('a[href^="http"]').forEach(function(link) {
  link.addEventListener('click', function(e) {
    window.mallard('outbound_link', {
      props: { url: link.href },
      callback: function() { window.location = link.href; }
    });
    e.preventDefault();
  });
});

Single-Page App Support

For SPAs, call window.mallard('pageview') manually after each route change:

// Example with a router
router.afterEach(function(to) {
  window.mallard('pageview');
});

Server-Side Events (No Script)

You can also send events directly to the API without the browser script. This is useful for server-rendered pages or background jobs:

curl -X POST https://your-instance.com/api/event \
  -H 'Content-Type: application/json' \
  -d '{
    "d": "your-site.com",
    "n": "signup",
    "u": "https://your-site.com/signup"
  }'

See Event Ingestion API for the full request schema.

Configuration

Mallard Metrics is configured through a TOML file and environment variables. All settings have sensible defaults; you can start without any configuration file.

Loading Configuration

Pass the path to a TOML file as the first command-line argument:

mallard-metrics /etc/mallard-metrics/config.toml

If no argument is provided, defaults are used.

Environment Variables

These two values are secrets and must not be stored in files committed to source control. Set them in your shell or a .env file:

VariableRequiredDescription
MALLARD_SECRETRecommendedHMAC key for visitor ID hashing. If unset, a UUID is auto-generated on first start and persisted to data_dir/.secret (survives restarts). Set explicitly in production for portability across hosts.
MALLARD_ADMIN_PASSWORDRecommendedDashboard password. If unset, the dashboard is unauthenticated.
MALLARD_MAX_LOGIN_ATTEMPTSOptionalOverride max_login_attempts at runtime.
MALLARD_LOGIN_LOCKOUTOptionalOverride login_lockout_secs at runtime.
MALLARD_LOG_FORMATOptionalSet to json for structured JSON log output. Omit or set to any other value for human-readable text logs.
MALLARD_SECURE_COOKIESOptionalSet to true to add the Secure flag to session cookies (required behind TLS).
MALLARD_METRICS_TOKENOptionalBearer token protecting the /metrics endpoint.
MALLARD_GEOIP_DBOptionalPath to MaxMind GeoLite2-City .mmdb file.
MALLARD_DASHBOARD_ORIGINOptionalRestrict dashboard CORS and enable CSRF protection.
MALLARD_MAX_CONCURRENT_QUERIESOptionalMax concurrent analytical queries (default 10). Returns 429 when exhausted.
MALLARD_CACHE_MAX_ENTRIESOptionalMax query cache entries (default 10000).
MALLARD_GDPR_MODEOptionalEnable GDPR-friendly preset (see PRIVACY.md).
MALLARD_GEOIP_PRECISIONOptionalGeoIP precision: city, region, country, or none.
MALLARD_HOSTOptionalServer bind address (default 0.0.0.0).
MALLARD_PORTOptionalServer listen port (default 8000).
MALLARD_DATA_DIROptionalData directory for Parquet files and DuckDB (default data).
MALLARD_FLUSH_COUNTOptionalEvents buffered before flushing to disk (default 1000).
MALLARD_FLUSH_INTERVALOptionalSeconds between periodic buffer flushes (default 60).
MALLARD_FILTER_BOTSOptionalFilter known bot User-Agents (default true).
MALLARD_RETENTION_DAYSOptionalAuto-delete data older than N days; 0 = unlimited (default 0).
MALLARD_SESSION_TTLOptionalDashboard session TTL in seconds (default 86400).
MALLARD_SHUTDOWN_TIMEOUTOptionalGraceful shutdown timeout in seconds (default 30).
MALLARD_RATE_LIMITOptionalMax events/sec per site; 0 = unlimited (default 0).
MALLARD_CACHE_TTLOptionalQuery cache TTL in seconds (default 60).
MALLARD_STRIP_REFERRER_QUERYOptionalStrip query/fragment from stored referrers (default false).
MALLARD_ROUND_TIMESTAMPSOptionalRound timestamps to the nearest hour (default false).
MALLARD_SUPPRESS_VISITOR_IDOptionalReplace HMAC hash with per-request UUID (default false).
MALLARD_SUPPRESS_BROWSER_VERSIONOptionalStore browser name only (default false).
MALLARD_SUPPRESS_OS_VERSIONOptionalStore OS name only (default false).
MALLARD_SUPPRESS_SCREEN_SIZEOptionalOmit screen width and device type (default false).

TOML Configuration Reference

A complete example is shipped as mallard-metrics.toml.example. Every field has a default and is optional.

# Network binding
host = "0.0.0.0"   # default
port = 8000         # default

# Storage
data_dir = "data"   # relative or absolute path; events and Parquet files are stored here

# Event buffer
flush_event_count = 1000   # flush buffer to Parquet when this many events accumulate
flush_interval_secs = 60   # also flush on this interval (seconds)

# Site allowlist — leave empty to accept events from any origin
# site_ids = ["example.com", "other-site.org"]
site_ids = []

# GeoIP database (optional — gracefully skipped if missing)
# geoip_db_path = "/path/to/GeoLite2-City.mmdb"

# Dashboard CORS origin (optional — set when dashboard is on a different origin)
# dashboard_origin = "https://analytics.example.com"

# Bot filtering (default: true — filters known bot User-Agents from event ingestion)
filter_bots = true

# Data retention: delete Parquet partitions older than this many days
# Set to 0 for unlimited retention (default)
retention_days = 0

# Session authentication TTL in seconds (default: 86400 = 24 hours)
session_ttl_secs = 86400

# Brute-force protection: lock out an IP after this many failed login attempts (0 = disabled)
max_login_attempts = 5

# Duration in seconds to lock out an IP after exceeding max_login_attempts
login_lockout_secs = 300

# Graceful shutdown timeout in seconds (default: 30)
shutdown_timeout_secs = 30

# Ingestion rate limit per site_id (events/second, 0 = unlimited)
rate_limit_per_site = 0

# Query cache TTL in seconds (0 = no caching, default: 60)
cache_ttl_secs = 60

# Log format: "text" (default) or "json"
log_format = "text"

# Query cache max entries (0 = unlimited, default: 10000)
cache_max_entries = 10000

# Max concurrent analytics queries (0 = unlimited, default: 10)
# Excess requests receive HTTP 429
max_concurrent_queries = 10

# Cookie Secure flag (set to true when behind TLS)
secure_cookies = false

# ── GDPR / Privacy Flags ──────────────────────────────────────────────
# gdpr_mode = false            # convenience preset — enables all flags below
# strip_referrer_query = false  # strip ?query and #fragment from referrers
# round_timestamps = false      # round timestamps to the nearest hour
# suppress_visitor_id = false   # replace HMAC hash with per-request UUID
# suppress_browser_version = false
# suppress_os_version = false
# suppress_screen_size = false
# geoip_precision = "city"      # city | region | country | none

Configuration Field Details

host / port

The address and port the HTTP server listens on.

  • Default: 0.0.0.0:8000
  • To restrict to localhost: host = "127.0.0.1"

data_dir

Root directory for all persistent data. Mallard Metrics creates subdirectories:

data/
└── events/
    └── site_id=example.com/
        └── date=2024-01-15/
            ├── 0001.parquet
            └── 0002.parquet

Parquet files are ZSTD-compressed. The directory is created automatically.

flush_event_count / flush_interval_secs

Events arrive into a memory buffer before being flushed to Parquet. Flushing happens when either threshold is reached. The buffer is also flushed on graceful shutdown.

  • Lower values reduce data loss on crash; higher values reduce I/O.
  • Queries always see both buffered (hot) and persisted (cold) data via the events_all view.

site_ids

An allowlist of site identifiers. If non-empty, the Origin header of each ingestion request must exactly match one of the listed values. Requests from unlisted origins receive a 403 Forbidden response.

The comparison is exact: example.com matches https://example.com and http://example.com:8080 (with explicit port) but not example.com.other.io.

geoip_db_path

Path to a MaxMind GeoLite2-City .mmdb file. GeoLite2 databases are free for non-commercial use and available at maxmind.com.

If the file is not specified or does not exist, country/region/city fields are stored as NULL. This is the default behavior and does not cause any errors.

rate_limit_per_site

Maximum events per second accepted per site_id. Uses a token-bucket algorithm. Set to 0 (default) for no limit.

cache_ttl_secs

Query results for /api/stats/main and /api/stats/timeseries are cached in memory for this duration. Setting to 0 disables caching (useful for development). Default is 60 seconds.

retention_days

Parquet partition directories older than retention_days days are deleted automatically by a background task that runs daily. Set to 0 (default) for unlimited retention.

max_login_attempts / login_lockout_secs

Brute-force protection for the dashboard login endpoint. After max_login_attempts consecutive failures from the same IP, that IP is blocked for login_lockout_secs seconds. The server responds with 429 Too Many Requests and a Retry-After header during the lockout period.

  • max_login_attempts: Default 5. Set to 0 to disable brute-force protection entirely.
  • login_lockout_secs: Default 300 (5 minutes).

These can also be set via MALLARD_MAX_LOGIN_ATTEMPTS and MALLARD_LOGIN_LOCKOUT environment variables.

API Reference

Mallard Metrics exposes a JSON HTTP API. All endpoints are served by the same process as the dashboard.

Base URL

http://your-instance.com

Authentication

Most /api/stats/* and /api/keys/* endpoints require authentication. Provide one of:

  1. Session cookie — Set after POST /api/auth/login. Sent automatically by browsers.
  2. Bearer token — An API key in the Authorization: Bearer mm_... header.
  3. X-API-Key header — An API key in the X-API-Key: mm_... header.

Endpoints that do not require authentication:

  • POST /api/event — Event ingestion (uses Origin allowlist instead).
  • GET /api/event — Pixel tracking (same parameters as POST via query string; returns 1×1 GIF).
  • POST /api/auth/login, POST /api/auth/setup, GET /api/auth/status, POST /api/auth/logout
  • GET /health, GET /health/ready, GET /health/detailed
  • GET /metrics — optionally protected by MALLARD_METRICS_TOKEN bearer token.
  • GET /robots.txt, GET /.well-known/security.txt
  • GET / (dashboard)

Content Type

All request bodies are application/json. All responses are application/json unless otherwise noted.

Error Responses

Errors are returned as JSON objects:

{
  "error": "human-readable description"
}

HTTP Status Codes

CodeMeaning
200Success
202Event accepted (ingestion only)
400Bad request — missing or invalid parameters
401Unauthenticated — no valid session or API key
403Forbidden — origin not in allowlist, or CSRF check failed
404Not found
408Request timeout (30-second server-side limit)
409Conflict — resource already exists (e.g. password already set)
413Request body too large (limit: 64 KB on ingestion routes)
422Unprocessable — JSON validation failed
429Rate limited or concurrent query limit — includes Retry-After header
503Service unavailable — database not ready
500Internal server error

Sections

Event Ingestion

POST /api/event

Records a single analytics event. This endpoint is called by the tracking script automatically and can also be called directly for server-side event recording.

Authentication: None required. The Origin header is validated against site_ids if that config option is set.

CORS: Fully permissive (Access-Control-Allow-Origin: *) to allow cross-origin calls from the tracking script.

Request Body

{
  "d": "example.com",
  "n": "pageview",
  "u": "https://example.com/pricing",
  "r": "https://google.com/",
  "w": 1920,
  "p": "{\"plan\": \"pro\"}",
  "ra": 99.00,
  "rc": "USD"
}
FieldTypeRequiredDescription
dstringYesDomain / site identifier. Max 256 chars; alphanumeric plus ., -, _, : only.
nstringYesEvent name (e.g. "pageview", "signup", "purchase").
ustringYesFull URL of the page where the event occurred.
rstringNoReferrer URL.
wnumberNoScreen width in pixels (for device-type detection).
pstringNoCustom properties as a JSON-encoded string. Stored in the props column and queryable via json_extract.
ranumberNoRevenue amount (stored as DECIMAL(12,2)).
rcstringNoISO 4217 currency code (e.g. "USD", "EUR"). Maximum 3 characters.

Response

HTTP/1.1 202 Accepted

The response body is empty. 202 means the event was accepted into the buffer. It will be flushed to Parquet on the next flush cycle or when the buffer threshold is reached.

Validation Errors

ConditionStatus
Missing required field (d, n, or u)422 Unprocessable Entity
Empty d, n, or u400 Bad Request
d contains invalid characters or exceeds 256 chars400 Bad Request
Field exceeds length limit (n > 256, u > 2048, r > 2048, p > 4096)400 Bad Request
Request body exceeds 64 KB413 Payload Too Large
Origin header does not match site_ids403 Forbidden
Rate limit exceeded for this site_id429 Too Many Requests

GET /api/event

Pixel-tracking endpoint for environments where JavaScript is unavailable (email, AMP pages, RSS readers). Returns a 1x1 transparent GIF (43 bytes, Content-Type: image/gif).

Authentication: None required.

Query Parameters

ParameterTypeRequiredDescription
dstringYesDomain / site identifier.
nstringNoEvent name (defaults to "pageview").
ustringYesFull URL of the page.
rstringNoReferrer URL.
wnumberNoScreen width in pixels.

Revenue (ra, rc) and custom properties (p) are not supported on the GET endpoint.

Response

HTTP/1.1 200 OK
Content-Type: image/gif
Content-Length: 43

Usage

<img src="https://analytics.example.com/api/event?d=example.com&u=https://example.com/page" width="1" height="1" alt="">

Bot Filtering

When filter_bots = true (default), the server inspects the User-Agent header and discards the event if it matches known bot patterns. A 202 is still returned — the event is silently dropped rather than returning an error.

Privacy Processing

Before the event is stored:

  1. The client IP address is extracted from the request.
  2. A daily-rotating HMAC-SHA256 visitor_id is computed from IP + User-Agent + today's UTC date + MALLARD_SECRET.
  3. The IP address is discarded. It is never written to disk or the database.

Server-Side Example

curl -X POST https://your-instance.com/api/event \
  -H 'Content-Type: application/json' \
  -d '{
    "d": "example.com",
    "n": "server_signup",
    "u": "https://example.com/signup"
  }'

Analytics Stats API

All stats endpoints require authentication (session cookie, Authorization: Bearer API key, or X-API-Key header).

Query results for /api/stats/main and /api/stats/timeseries are cached per (site_id, period) for cache_ttl_secs seconds (default 60).


Common Query Parameters

ParameterTypeDescription
site_idstringRequired. The site to query.
periodstringOptional. One of day, today, 7d, 30d, 90d. Defaults to 30d.
start_datestringOptional. Explicit start date (YYYY-MM-DD). Both start_date and end_date must be provided together; a lone date is ignored. Overrides period.
end_datestringOptional. Explicit end date (YYYY-MM-DD, exclusive). Maximum range: 366 days.

site_id Validation

All endpoints validate site_id and return 400 Bad Request if any of the following conditions are not met:

  • Non-empty string.
  • At most 256 characters.
  • ASCII alphanumeric characters plus ., -, _, and : only.
// 400 response for invalid site_id
{"error": "Invalid site_id"}

GET /api/stats/main

Returns core aggregate metrics.

Response

{
  "unique_visitors": 1423,
  "total_pageviews": 5812,
  "bounce_rate": 0.42,
  "avg_visit_duration_secs": 0.0,
  "pages_per_visit": 4.08
}
FieldTypeNotes
unique_visitorsintegerDistinct visitor_id values in the period.
total_pageviewsintegerEvents where event_name = 'pageview'.
bounce_ratefloatSessions with exactly one pageview / total sessions. Requires behavioral extension; returns 0.0 if unavailable.
avg_visit_duration_secsfloatAlways 0.0 in this version (requires behavioral extension integration; computed separately via /api/stats/sessions).
pages_per_visitfloattotal_pageviews / unique_visitors.

GET /api/stats/timeseries

Returns visitors and pageviews bucketed by time.

Granularity is determined automatically from the period: day/today returns hourly buckets; all other periods return daily buckets.

Response

[
  {"date": "2024-01-15", "visitors": 142, "pageviews": 518},
  {"date": "2024-01-16", "visitors": 167, "pageviews": 603}
]

For period=day the date field includes the hour (e.g. "2024-01-15 10:00").


GET /api/stats/breakdown/{dimension}

Returns visitor and pageview counts grouped by a single dimension.

Dimensions

PathGrouped by
/breakdown/pagespathname
/breakdown/sourcesreferrer_source
/breakdown/browsersbrowser
/breakdown/osos
/breakdown/devicesdevice_type
/breakdown/countriescountry_code

Additional Parameters

ParameterTypeDescription
limitintegerMaximum rows to return. Default 10, maximum 1000. Returns 400 if exceeded.

Response

[
  {"value": "/pricing", "visitors": 312, "pageviews": 489},
  {"value": "/about",   "visitors": 201, "pageviews": 247}
]

Unknown/null dimension values are represented as "(unknown)".


GET /api/stats/sessions

Returns session-level aggregates using the sessionize behavioral function.

Requires the behavioral extension. Returns zeroes if the extension is not loaded.

Response

{
  "total_sessions": 892,
  "avg_session_duration_secs": 124.7,
  "avg_pages_per_session": 3.2
}

GET /api/stats/funnel

Returns a conversion funnel where each step is a filter condition.

Additional Parameters

ParameterTypeDescription
stepsstringComma-separated list of steps. Format: page:/path or event:name.
windowstringSession window duration. Default "1 day". Must be of the form N unit (e.g. "30 minutes", "2 hours").

Step Format

FormatMeaning
page:/pricingpathname = '/pricing'
event:signupevent_name = 'signup'

Example Request

GET /api/stats/funnel?site_id=example.com&steps=page:/pricing,event:signup&window=1+hour

Response

[
  {"step": 1, "visitors": 500},
  {"step": 2, "visitors": 120}
]

Requires behavioral extension. Returns empty array if unavailable.


GET /api/stats/retention

Returns weekly retention cohorts using the retention behavioral function.

Additional Parameters

ParameterTypeDescription
weeksintegerNumber of cohort weeks to compute. Range: 1–52. Default 4.

Response

[
  {
    "cohort_date": "2024-01-08",
    "retained": [true, true, false, true]
  }
]

Each retained boolean corresponds to one cohort week.

Requires behavioral extension. Returns empty array if unavailable.


GET /api/stats/sequences

Returns conversion metrics for a sequence of behavioral steps using sequence_match.

Additional Parameters

ParameterTypeDescription
stepsstringComma-separated steps in page:/path or event:name format. Minimum 2 steps required.

Response

{
  "converting_visitors": 89,
  "total_visitors": 500,
  "conversion_rate": 0.178
}

Requires behavioral extension. Returns zeroes if unavailable.


GET /api/stats/flow

Returns the most common next pages after a given starting page using sequence_next_node.

Additional Parameters

ParameterTypeDescription
pagestringThe target page path to start from (e.g. /pricing).

Response

[
  {"next_page": "/signup",  "visitors": 234},
  {"next_page": "/contact", "visitors": 89}
]

Returns up to 10 results. Requires behavioral extension.


GET /api/stats/export

Exports daily aggregated stats as CSV or JSON.

Additional Parameters

ParameterTypeDescription
formatstringcsv (default) or json. Any other value returns 400.

CSV Response

date,visitors,pageviews,top_page,top_source
2024-01-15,142,518,/pricing,(direct)
2024-01-16,167,603,/pricing,google

CSV fields that might trigger formula injection (start with =, +, -, @) are prefixed with a single quote.

Content-Disposition: attachment; filename="export.csv" is set so browsers prompt a download.

JSON Response

[
  {
    "date": "2024-01-15",
    "visitors": 142,
    "pageviews": 518,
    "top_page": "/pricing",
    "top_source": "(direct)"
  }
]

top_page and top_source reflect the single highest-traffic page and referrer source for the entire queried period, not per-day.

Authentication API

Mallard Metrics supports two forms of authentication:

  1. Session cookies — For human dashboard users.
  2. API keys — For programmatic access (CI/CD, integrations, monitoring).

Dashboard Authentication

POST /api/auth/setup

Sets the admin password for the first time. Returns 409 Conflict if a password is already configured.

No authentication required.

// Request — password must be at least 8 characters
{"password": "your-secure-password"}

// Response 200 — also sets HttpOnly, SameSite=Strict cookie mm_session
{"token": "<session-token>"}

// Response 400 — password too short
{"error": "Password must be at least 8 characters"}

// Response 409 — password already configured
{"error": "Admin password already configured"}

Passwords are hashed with Argon2id before storage. The plaintext password is never persisted.


POST /api/auth/login

Authenticates with the admin password and creates a session.

No authentication required.

// Request
{"password": "your-secure-password"}

// Response 200 — sets HttpOnly, SameSite=Strict cookie mm_session
{"token": "<session-token>"}

// Response 400 — no password configured yet
{"error": "No admin password configured. Use /api/auth/setup first."}

// Response 401 — wrong password
{"error": "Invalid password"}

// Response 429 — Too Many Requests (IP locked out after max failed attempts)
// Retry-After header contains the remaining lockout seconds
{"error": "Too many failed login attempts. Try again later."}

Sessions are stored in memory and expire after session_ttl_secs (default 24 hours). Sessions are cleared on server restart.

Brute-force protection: After max_login_attempts (default 5) consecutive failures from the same IP, the IP is locked out for login_lockout_secs (default 300 seconds). A successful login clears the failure count. Configure via MALLARD_MAX_LOGIN_ATTEMPTS and MALLARD_LOGIN_LOCKOUT environment variables, or the corresponding TOML fields. Set max_login_attempts = 0 to disable.


POST /api/auth/logout

Invalidates the current session.

No authentication required. If a valid session cookie is present, it is invalidated. Otherwise the endpoint is a no-op (always returns 200).

// Response 200 — clears mm_session cookie
{"status": "logged_out"}

GET /api/auth/status

Returns the current authentication state.

// No password configured (open access mode)
{"setup_required": true, "authenticated": true}

// Password configured, not logged in
{"setup_required": false, "authenticated": false}

// Password configured, logged in
{"setup_required": false, "authenticated": true}
FieldTypeNotes
setup_requiredbooleantrue when no admin password has been set. System is in open-access mode.
authenticatedbooleantrue when the request carries a valid session or API key, or when setup_required is true.

API Key Management

API keys are prefixed with mm_ and are SHA-256 hashed before storage. The plaintext key is only returned once at creation time.

All key management endpoints require authentication.

POST /api/keys

Creates a new API key.

// Request
{"name": "ci-pipeline", "scope": "ReadOnly"}

// Response 201
{
  "key": "mm_abc123...",
  "key_hash": "a1b2c3...",
  "name": "ci-pipeline",
  "scope": "ReadOnly"
}

The key field is the only time the plaintext key is returned. Store it securely.

Scopes:

ValueAccess
ReadOnlyRead-only access to stats queries.
AdminFull admin access (key management, config).

GET /api/keys

Lists all API keys (without plaintext values).

[
  {
    "key_hash": "a1b2c3...",
    "name": "ci-pipeline",
    "scope": "ReadOnly",
    "created_at": "2024-01-15T10:00:00Z",
    "revoked": false
  }
]

DELETE /api/keys/{key_hash}

Revokes an API key by its SHA-256 hex hash.

// Response 200
{"status": "revoked"}

// Response 404 if hash not found
{"error": "Key not found"}

Using API Keys

API keys can be passed in two ways:

Authorization header (Bearer token):

curl "https://your-instance.com/api/stats/main?site_id=example.com&period=30d" \
  -H "Authorization: Bearer mm_abc123..."

X-API-Key header:

curl "https://your-instance.com/api/stats/main?site_id=example.com&period=30d" \
  -H "X-API-Key: mm_abc123..."

Both headers are accepted on all stats and admin endpoints. ReadOnly keys can access stats endpoints; all key management endpoints (GET /api/keys, POST /api/keys, DELETE /api/keys/{hash}) require an Admin-scoped key.

Health & Metrics Endpoints

These endpoints are publicly accessible (no authentication required) and are designed for monitoring and orchestration systems.


GET /health

Simple liveness check. Returns HTTP 200 when the server process is running.

HTTP/1.1 200 OK
Content-Type: text/plain

ok

Use this with your load balancer or container orchestrator liveness probe.


GET /health/ready

Readiness probe. Executes a lightweight DuckDB query to verify the database is operational.

Success (200):

HTTP/1.1 200 OK
Content-Type: text/plain

ready

Not ready (503):

HTTP/1.1 503 Service Unavailable
Content-Type: text/plain

database not ready

Use this as your Kubernetes readiness probe or Docker health check. Do not use it as a liveness probe — a 503 here means the database is temporarily unavailable, not that the process is dead.


GET /health/detailed

Detailed system status in JSON. Returns component-level health information.

{
  "status": "ok",
  "version": "0.1.0",
  "buffered_events": 42,
  "auth_configured": true,
  "geoip_loaded": false,
  "behavioral_extension_loaded": true,
  "filter_bots": true,
  "cache_entries": 3,
  "cache_empty": false
}
FieldTypeDescription
statusstringAlways "ok" when the server is running.
versionstringBinary version from Cargo.toml.
buffered_eventsintegerEvents in the in-memory buffer, not yet flushed to Parquet.
auth_configuredbooleanWhether an admin password has been set.
geoip_loadedbooleanWhether a MaxMind GeoLite2 database was successfully loaded.
behavioral_extension_loadedbooleanWhether the DuckDB behavioral extension loaded successfully at startup.
filter_botsbooleanWhether bot filtering is active.
cache_entriesintegerNumber of cached query results currently in memory.
cache_emptybooleantrue if the query cache is empty.

GET /metrics

Prometheus-compatible metrics in text exposition format (text/plain; version=0.0.4).

If MALLARD_METRICS_TOKEN is set, this endpoint requires Authorization: Bearer <token>. Returns 401 Unauthorized without a valid token.

Gauges

# HELP mallard_buffered_events Number of events in the in-memory buffer
# TYPE mallard_buffered_events gauge
mallard_buffered_events 42

# HELP mallard_cache_entries Number of cached query results
# TYPE mallard_cache_entries gauge
mallard_cache_entries 3

# HELP mallard_auth_configured Whether admin password is set
# TYPE mallard_auth_configured gauge
mallard_auth_configured 1

# HELP mallard_geoip_loaded Whether GeoIP database is loaded
# TYPE mallard_geoip_loaded gauge
mallard_geoip_loaded 0

# HELP mallard_filter_bots Whether bot filtering is enabled
# TYPE mallard_filter_bots gauge
mallard_filter_bots 1

# HELP mallard_behavioral_extension Whether behavioral extension is loaded
# TYPE mallard_behavioral_extension gauge
mallard_behavioral_extension 1

Counters

# HELP mallard_events_ingested_total Total events ingested via POST /api/event
# TYPE mallard_events_ingested_total counter
mallard_events_ingested_total 158432

# HELP mallard_flush_failures_total Total buffer flush failures
# TYPE mallard_flush_failures_total counter
mallard_flush_failures_total 0

# HELP mallard_rate_limit_rejections_total Total requests rejected by per-site rate limiter
# TYPE mallard_rate_limit_rejections_total counter
mallard_rate_limit_rejections_total 17

# HELP mallard_login_failures_total Total failed login attempts
# TYPE mallard_login_failures_total counter
mallard_login_failures_total 3

# HELP mallard_cache_hits_total Total query cache hits
# TYPE mallard_cache_hits_total counter
mallard_cache_hits_total 9871

# HELP mallard_cache_misses_total Total query cache misses
# TYPE mallard_cache_misses_total counter
mallard_cache_misses_total 1204

Prometheus Scrape Configuration

scrape_configs:
  - job_name: mallard_metrics
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics
    scrape_interval: 30s
    # If MALLARD_METRICS_TOKEN is set:
    authorization:
      credentials: your-metrics-bearer-token

Architecture

Overview

Mallard Metrics is a single Rust binary that handles the complete analytics lifecycle: event ingestion, storage, querying, authentication, and dashboard serving. There are no external services, no message queues, and no separate database process.

flowchart TD
    TS["Tracking Script\nmallard.js &lt;1KB"]
    DASH["Dashboard SPA\nPreact + HTM"]

    TS -->|"POST /api/event"| AXUM
    DASH <-->|"GET /api/stats/*\nGET /api/keys/*"| AXUM

    subgraph BINARY["Single Binary — Single Process"]
        AXUM["Axum HTTP Server\nport 8000"]

        subgraph INGEST["Ingestion Pipeline"]
            direction LR
            OC["Origin Check\nRate Limiter"] --> BF["Bot Filter\nUA Parser"]
            BF --> GEO["GeoIP Lookup\nVisitor ID Hash"]
            GEO --> BUF["In-Memory\nEvent Buffer"]
        end

        subgraph STORE["Two-Tier Storage"]
            direction LR
            DB["DuckDB disk-based\nmallard.duckdb\nWAL durability"]
            PQ["Parquet Files\nsite_id=*/date=*/*.parquet\nZSTD-compressed"]
            VIEW["events_all VIEW\nhot union cold"]
            DB -->|"COPY TO"| PQ
            DB --> VIEW
            PQ -->|"read_parquet()"| VIEW
        end

        subgraph QUERY["Query Engine"]
            direction LR
            CACHE["TTL Query Cache"] --> QH["Stats\nSessions\nFunnels\nRetention\nSequences\nFlow"]
            EXT["behavioral extension\nsessionize\nwindow_funnel\nretention\nsequence_match"] -.->|"optional"| CACHE
        end

        AUTH["Auth Layer\nArgon2id passwords\n256-bit session tokens\nAPI keys SHA-256"] -.->|"guards"| AXUM

        AXUM --> OC
        BUF -->|"flush"| DB
        VIEW --> CACHE
        QH --> AXUM
    end

Event Ingestion Pipeline

Every POST /api/event request passes through a sequential pipeline of validation and enrichment steps before being buffered.

flowchart TD
    START(["POST /api/event\nJSON body"])

    START --> SZ{"Body size\n&le; 64 KB?"}
    SZ -->|"No"| R413["413 Request\nEntity Too Large"]
    SZ -->|"Yes"| OC

    OC{"Origin in\nallowlist?"}
    OC -->|"No (if configured)"| R403["403 Forbidden"]
    OC -->|"Yes"| RL

    RL{"Rate limit\nexceeded?"}
    RL -->|"Yes"| R429["429 Too Many Requests\nRetry-After header"]
    RL -->|"No"| SITEID

    SITEID{"site_id valid?\na-z A-Z 0-9 .-: max 256 chars"}
    SITEID -->|"No"| R400["400 Bad Request"]
    SITEID -->|"Yes"| BOT

    BOT{"Bot\nUser-Agent?"}
    BOT -->|"Yes"| DISCARD["Silently discarded\n202 Accepted"]
    BOT -->|"No"| UA

    UA["Parse User-Agent\nbrowser, OS, device type"]
    UA --> GEO

    GEO["GeoIP Lookup\ncountry, region, city\nGraceful fallback if no DB"]
    GEO --> VID

    VID["Compute Visitor ID\nHMAC-SHA256\nIP plus UA plus daily-salt\nDiscard IP immediately"]
    VID --> URL

    URL["Parse URL\npathname, hostname\nUTM parameters"]
    URL --> BUF

    BUF["Push to In-Memory Buffer"]
    BUF --> THR{"Buffer count\n>= flush_event_count?"}
    THR -->|"Yes"| FLUSH["Flush to DuckDB\nAppender API batch insert"]
    THR -->|"No"| R202
    FLUSH --> R202

    R202(["202 Accepted"])

Two-Tier Storage Model

Mallard Metrics stores events in two complementary tiers, always queried together via the events_all VIEW.

flowchart LR
    INGEST["Ingestion\nEvent Buffer"]

    subgraph HOT["Hot Tier — DuckDB (mallard.duckdb)"]
        EVENTS["events table\nrecently arrived events\nWAL-backed, survives SIGKILL"]
    end

    subgraph COLD["Cold Tier — Parquet on Disk"]
        P1["site_id=example.com/\ndate=2024-01-15/\n0001.parquet"]
        P2["site_id=example.com/\ndate=2024-01-16/\n0001.parquet"]
        P3["site_id=other.org/\ndate=2024-01-15/\n0001.parquet"]
    end

    subgraph UNIFIED["Unified Query Layer"]
        VIEW["events_all VIEW\nSELECT * FROM events\nUNION ALL\nSELECT * FROM read_parquet(...)"]
    end

    INGEST -->|"flush"| EVENTS
    EVENTS -->|"COPY TO ZSTD"| P1
    EVENTS -->|"COPY TO ZSTD"| P2
    EVENTS -->|"COPY TO ZSTD"| P3
    EVENTS -->|"hot events"| VIEW
    P1 -->|"read_parquet()"| VIEW
    P2 -->|"read_parquet()"| VIEW
    P3 -->|"read_parquet()"| VIEW
    VIEW --> ANALYTICS["Analytics Queries\nGET /api/stats/*"]

Hot tier (data/mallard.duckdb): Stores events that have been buffered but not yet flushed. Events here are immediately queryable. The DuckDB WAL provides durability — hot events survive a SIGKILL (crash), not just a graceful SIGTERM.

Cold tier (.parquet files): After flushing, events are written as ZSTD-compressed Parquet files partitioned by site and date. These files are the primary durability layer for historical data and can be queried independently with any Parquet-compatible tool (DuckDB CLI, pandas, Apache Spark).

The events_all VIEW is created at startup and refreshed after each flush. It transparently unions the hot and cold tiers so all analytics queries work correctly regardless of which tier the data resides in.

The cold-tier directory layout:

data/events/
├── site_id=example.com/
│   ├── date=2024-01-15/
│   │   ├── 0001.parquet
│   │   └── 0002.parquet
│   └── date=2024-01-16/
│       └── 0001.parquet
└── site_id=other-site.org/
    └── date=2024-01-15/
        └── 0001.parquet

Authentication Architecture

flowchart TD
    subgraph CREDS["Credentials at Rest"]
        HASH["Admin Password\nArgon2id hash PHC defaults\nmemory-only at runtime"]
        KEYS["API Keys\nmm_ prefix plus 256-bit random\nSHA-256 hash on disk\nJSON file in data_dir"]
        SESS["Session Tokens\n256-bit OS CSPRNG\nHashMap with TTL expiry\nHttpOnly SameSite=Strict"]
    end

    BROWSER["Browser"] -->|"POST /api/auth/login\npassword"| ARGON
    ARGON["Argon2id verify"] -->|"match"| SESS
    SESS -->|"session cookie\nHttpOnly Secure SameSite=Strict"| BROWSER

    APICLIENT["API Client"] -->|"Authorization: Bearer mm_xxx\nor X-API-Key: mm_xxx"| KEYCHECK
    KEYCHECK["SHA-256 hash\nconstant-time compare"] -->|"valid"| SCOPE

    SCOPE{"Scope check"}
    SCOPE -->|"ReadOnly key"| READONLY["GET /api/stats/*\nGET /api/keys/*"]
    SCOPE -->|"Admin key"| ADMIN["All routes\nincluding POST /api/keys\nDELETE /api/keys/*"]

    BROWSER -->|"GET /api/stats/*\nauto-sent cookie"| SESSMW
    SESSMW["Session middleware\nTTL check"] -->|"valid"| ROUTE

    ROUTE["Route Handler"]

    CSRF["CSRF check\nOrigin vs dashboard_origin"] -.->|"state-mutating\nroutes only"| ROUTE
    BF["Brute-force check\nper-IP attempt counting\nconfigurable lockout"] -.->|"login endpoint"| ARGON

Key Security Properties

PropertyImplementation
Password storageArgon2id hash (PHC defaults), never stored in plaintext
Session tokens256-bit OS CSPRNG; HashMap with TTL; cleared on restart
API key storageSHA-256 hash on disk; plaintext returned only at creation
Timing attacksConstant-time comparison for API key validation
Session cookiesHttpOnly; Secure; SameSite=Strict
CSRFOrigin/Referer validation on all state-mutating session-auth routes
Brute forcePer-IP attempt counting; configurable lockout and Retry-After

Behavioral Extension

Advanced analytics rely on the DuckDB behavioral extension, which provides window aggregate functions purpose-built for clickstream analysis.

flowchart LR
    subgraph EXT["behavioral extension"]
        SESS_F["sessionize()\nGroup events into sessions\nby visitor and time gap"]
        FUNNEL_F["window_funnel()\nMulti-step ordered\nconversion funnel"]
        RET_F["retention()\nWeekly cohort\nretention grid"]
        SEQ_F["sequence_match()\nBehavioral pattern\ndetection"]
        FLOW_F["sequence_next_node()\nNext-page\nflow analysis"]
    end

    subgraph API["Behavioral Endpoints"]
        direction TB
        S["/api/stats/sessions"]
        FU["/api/stats/funnel"]
        R["/api/stats/retention"]
        SQ["/api/stats/sequences"]
        FL["/api/stats/flow"]
    end

    SESS_F --> S
    FUNNEL_F --> FU
    RET_F --> R
    SEQ_F --> SQ
    FLOW_F --> FL

    CORE["Core analytics\n/api/stats/main\n/api/stats/timeseries\n/api/stats/breakdown/*"] -.->|"no extension\nrequired"| ALWAYS["Always available"]

The extension is loaded at startup:

INSTALL behavioral FROM community;
LOAD behavioral;

If loading fails (network unavailable, air-gapped environment), all extension-dependent endpoints return graceful defaults (zeroes or empty arrays). Core analytics continue working normally. The GET /health/detailed JSON response and GET /metrics Prometheus output both report whether the extension loaded successfully.


Module Map

ModulePurpose
config.rsTOML + environment variable configuration
server.rsAxum router with CORS configuration and middleware stack
ingest/handler.rsPOST /api/event ingestion handler
ingest/buffer.rsIn-memory event buffer with periodic flush
ingest/visitor_id.rsHMAC-SHA256 privacy-safe visitor ID
ingest/useragent.rsUser-Agent parsing
ingest/geoip.rsMaxMind GeoIP reader with graceful fallback
ingest/ratelimit.rsPer-site token-bucket rate limiter
storage/schema.rsDuckDB table definitions and events_all view
storage/parquet.rsParquet write/read/partitioning
storage/migrations.rsSchema versioning
query/metrics.rsCore metric calculations
query/breakdowns.rsDimension breakdown queries
query/timeseries.rsTime-bucketed aggregations
query/sessions.rssessionize-based session queries
query/funnel.rswindow_funnel query builder
query/retention.rsRetention cohort query execution
query/sequences.rssequence_match query execution
query/flow.rssequence_next_node flow analysis
query/cache.rsTTL-based query result cache
api/stats.rsAll analytics API handlers
api/errors.rsAPI error types
api/auth.rsOrigin validation, session auth, API key management
dashboard/Embedded SPA (Preact + HTM)

Security & Privacy

Privacy Model

Mallard Metrics is built with privacy as a hard constraint, not an afterthought.

No Cookies

The tracking script sets no cookies. There is no cookie-based session tracking of any kind.

No PII Storage

The client IP address is the only potentially identifying value that reaches the server. It is:

  1. Used to compute the visitor ID (see below).
  2. Used for a GeoIP lookup (if configured).
  3. Discarded immediately. It is never written to the database, log files, or Parquet files.

No names, email addresses, or device fingerprints are collected or stored.

Privacy-Safe Visitor ID

To count unique visitors without storing PII, Mallard Metrics uses a two-step HMAC-SHA256 derivation:

flowchart LR
    SECRET["MALLARD_SECRET\nenvironment variable"]
    DATE["Today UTC date\n2024-01-15"]

    SECRET --> H1["HMAC-SHA256\nkey = 'mallard-metrics-salt'\nmsg = SECRET + ':' + DATE"]
    DATE --> H1
    H1 --> SALT["daily_salt\nrotates every 24 h"]

    IP["Client IP address"]
    UA["User-Agent header"]

    IP --> H2["HMAC-SHA256\nkey = daily_salt\nmsg = IP + '|' + UA"]
    UA --> H2
    SALT --> H2
    H2 --> VID["visitor_id\nstored in database"]

    IP -->|"discarded\nimmediately after"| TRASH["not stored"]

Properties of this approach:

  • Deterministic within a day — The same visitor from the same browser produces the same ID throughout the day, enabling accurate unique-visitor counts.
  • Rotates daily — The UTC date rotates the effective key every 24 hours, so IDs cannot be correlated across days.
  • Not reversible — Without MALLARD_SECRET, the IP address cannot be recovered from the stored hash.
  • No IP storage — The IP address is discarded immediately after hashing.

GDPR/CCPA Compliance

Mallard Metrics stores pseudonymous visitor IDs (daily-rotating HMAC-SHA256 hashes), which are personal data under GDPR Recital 26. Operators must establish a lawful basis for processing — typically Art. 6(1)(f) legitimate interests for aggregate analytics, especially when combined with GDPR mode. See PRIVACY.md for the full legal analysis, DPIA guidance, and operator obligations.

Key points:

  • No cookies are set for tracking — no ePrivacy consent banner is needed for the tracking script itself.
  • Data subject erasure is supported via DELETE /api/gdpr/erase (Admin API key required).
  • No third-party data sharing — all processing is first-party, no data processor agreements needed.

Authentication Security

Dashboard Password

Passwords are hashed with Argon2id using PHC default parameters before any comparison. The plaintext password is never stored. The hash is held in memory and loaded from the MALLARD_ADMIN_PASSWORD environment variable at startup.

Session Tokens

Dashboard sessions use 256-bit cryptographically random tokens generated with the OS CSPRNG. Tokens are delivered as HttpOnly; SameSite=Strict cookies to prevent JavaScript access and CSRF.

Sessions are stored in an in-memory HashMap with TTL expiry (default 24 hours, configurable via session_ttl_secs). Sessions are cleared on server restart.

When MALLARD_SECURE_COOKIES=true is set (required when behind a TLS reverse proxy), the Secure flag is added to the cookie, preventing transmission over plain HTTP.

API Keys

PropertyValue
Entropy256 bits of randomness
Prefixmm_ — easy to identify in logs and secret scanners
StorageSHA-256 hash stored in a JSON file in data_dir/. Plaintext returned only at creation.
ComparisonConstant-time equality to prevent timing side-channel attacks
ScopesReadOnly (GET stats only) or Admin (full access including key management)
PersistenceDisk-persisted; survive server restarts

Input Validation and SQL Injection Prevention

Parameterized Queries

All user-supplied values (site IDs, date ranges, event names) are bound to SQL statements as parameters using DuckDB's prepared statement API. Raw string interpolation is used only where DuckDB's API does not support parameters (e.g., COPY TO file paths), and those values are explicitly validated and escaped before use.

Path Traversal Prevention

The site_id value is validated by is_safe_path_component() before being used in any filesystem path. The following are rejected:

  • Empty strings
  • Strings containing .. (directory traversal)
  • Strings containing / or \ (path separators)
  • Strings containing null bytes (\0)
  • Strings longer than 256 characters
  • Characters outside [a-zA-Z0-9._\-:]

Funnel and Sequence Step Validation

User-supplied funnel and sequence steps (from ?steps= query parameters) are parsed from a safe page:/path or event:name format. Raw SQL expressions are never accepted from the API. Single quotes in path values are escaped by doubling.

Date Range Validation

The start_date and end_date parameters are validated as YYYY-MM-DD format, checked for logical consistency (end >= start), and capped at a maximum 366-day span.

Breakdown Limit

The limit parameter for breakdown queries is capped at 1000 to prevent unbounded result sets.

Origin Validation

When site_ids is configured, the Origin header is validated with exact host matching:

  • https://example.com → passes (if "example.com" is in site_ids).
  • http://example.com:8080 → passes (explicit port suffix allowed).
  • https://example.com.evil.comrejected (prefix match is explicitly disallowed).

CSV Injection Prevention

The CSV export endpoint escapes fields starting with formula-triggering characters (=, +, -, @) by prefixing them with a single quote, preventing formula injection when the CSV is opened in spreadsheet software.


Brute-Force Protection

Login attempts are tracked per client IP address. After max_login_attempts consecutive failures (default 5), the IP is locked out for login_lockout_secs seconds (default 300). The server returns 429 Too Many Requests with a Retry-After header containing the remaining lockout duration.

A successful login clears the failure count for that IP. Failure counts are stored in memory and reset on server restart.

Configure via TOML fields max_login_attempts and login_lockout_secs, or the environment variables MALLARD_MAX_LOGIN_ATTEMPTS and MALLARD_LOGIN_LOCKOUT. Set max_login_attempts = 0 to disable.


Security Headers

All HTTP responses include these OWASP-recommended security headers:

HeaderValuePurpose
X-Content-Type-OptionsnosniffPrevents MIME-type sniffing
X-Frame-OptionsDENYPrevents clickjacking via iframe embedding
Referrer-Policystrict-origin-when-cross-originLimits referrer leakage
Content-Security-PolicyHTML responses onlyRestricts scripts and resources to same origin
Permissions-Policygeolocation=(), microphone=(), camera=()Disables browser feature APIs
Strict-Transport-Securitymax-age=31536000; includeSubDomains; preloadInstructs browsers to enforce HTTPS for 1 year; eligible for preload lists
Cache-Controlno-store, no-cacheJSON API responses only; prevents analytics data caching
X-Request-IDUUID per requestInjected by the server, propagated through tracing spans for log correlation

HTTP Timeout

All requests have a 30-second server-side timeout. Connections that do not complete within this window are closed with 408 Request Timeout. This prevents Slowloris-style attacks that hold connections open indefinitely.


CSRF Protection

State-mutating endpoints authenticated via session cookie (login, logout, setup, key creation, key revocation) validate the Origin or Referer header against the configured dashboard_origin. Requests with a mismatched or missing origin receive 403 Forbidden.

When dashboard_origin is not set, CSRF checks are bypassed (all origins allowed). Set dashboard_origin in production to enable CSRF protection.


Network Security

CORS Policy

Mallard Metrics uses separate CORS policies for ingestion and dashboard routes:

Ingestion (POST /api/event):

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST

Dashboard / Stats / Admin (when dashboard_origin is set):

Access-Control-Allow-Origin: <configured origin>
Access-Control-Allow-Methods: GET, POST, DELETE
Access-Control-Allow-Credentials: true

If dashboard_origin is not configured, the dashboard routes use a permissive policy that allows any origin (explicitly, not same-origin-only). Set dashboard_origin in production to restrict cross-origin access.

TLS

Mallard Metrics does not handle TLS directly. In production, place it behind a TLS-terminating reverse proxy (nginx, Caddy, Traefik, etc.). Set MALLARD_SECURE_COOKIES=true once the proxy is in place.

Request Concurrency

The four heavy behavioral analytics endpoints (/api/stats/funnel, /api/stats/retention, /api/stats/sequences, /api/stats/flow) are protected by a semaphore. The maximum number of concurrent heavy queries is configurable via MALLARD_MAX_CONCURRENT_QUERIES (default 10). Requests that exceed this limit receive 429 Too Many Requests with a Retry-After header.


Supply Chain

  • All Rust dependencies are audited with cargo-deny in CI.
  • GitHub Actions steps are pinned to exact commit SHAs (no floating version tags).
  • The bundled DuckDB feature compiles DuckDB from source as part of the build; no pre-built DuckDB binaries are downloaded at runtime.
  • cargo build --locked is used in CI to ensure reproducible builds from Cargo.lock.

Threat Model Summary

ThreatMitigation
SQL injectionParameterized queries throughout; site_id character validation
Path traversalis_safe_path_component() on all filesystem paths
CSRFOrigin/Referer validation on state-mutating session-auth routes
Brute force (login)Per-IP lockout, Argon2id hashing
Brute force (API)Per-site rate limiting
Session hijackingHttpOnly; Secure; SameSite=Strict cookies
Timing attacksConstant-time comparison for API keys
ClickjackingX-Frame-Options: DENY
Protocol downgradeStrict-Transport-Security (HSTS, 1 year)
MIME sniffingX-Content-Type-Options: nosniff
Data exfiltrationNo outbound network calls; embedded DB; IP discarded after hash
PII leakageIPs hashed then discarded; daily ID rotation; no cookies
CSV injectionFormula character escaping in export output
Dependency vulnerabilitiescargo-deny in CI; Cargo.lock committed and enforced

Behavioral Analytics

Mallard Metrics integrates the DuckDB behavioral extension to provide advanced analytics that go beyond simple counts. This extension proves that DuckDB behavioral analytics is not just an academic exercise — it can power real-world, production analytics with a homelab-friendly footprint.

Prerequisites

The behavioral extension is loaded at startup:

INSTALL behavioral FROM community;
LOAD behavioral;

If the extension cannot be loaded (e.g., network unavailable or air-gapped environment), all behavioral endpoints return graceful defaults (zeroes or empty arrays). Core analytics (visitors, pageviews, breakdowns, timeseries) are unaffected.

The GET /health/detailed JSON response includes "behavioral_extension_loaded": true/false, and GET /metrics exposes the mallard_behavioral_extension gauge (1 = loaded, 0 = unavailable).


Session Analytics

Endpoint: GET /api/stats/sessions

Uses sessionize(timestamp, INTERVAL '30 minutes') to group events into sessions per visitor. A new session begins when there is a gap of more than 30 minutes between events from the same visitor.

Metrics returned:

FieldDescription
total_sessionsTotal number of distinct sessions
avg_session_duration_secsMean session duration in seconds
avg_pages_per_sessionMean pageviews per session

Funnel Analysis

Endpoint: GET /api/stats/funnel

Uses window_funnel(interval, timestamp, step1, step2, ...) to find visitors who completed a sequence of steps within a time window.

Example — Pricing to Signup funnel:

GET /api/stats/funnel?site_id=example.com&steps=page:/pricing,event:signup&window=1+day

Step format:

InputSQL condition
page:/pricingpathname = '/pricing'
event:signupevent_name = 'signup'

Response: Array of {step, visitors} showing how many visitors reached each step.

Notes:

  • Steps must be ordered (each step must follow the previous in time).
  • The window parameter controls the maximum elapsed time between the first and last step (e.g., 1 day, 2 hours).
  • At least 1 step is required; 2+ steps produce a meaningful funnel chart.

Retention Cohorts

Endpoint: GET /api/stats/retention?weeks=N

Uses retention(condition1, condition2, ...) to compute weekly cohort retention. Each cohort is defined by a visitor's first-seen week. Subsequent weeks show whether they returned.

Example response (4-week retention):

[
  {"cohort_date": "2024-01-08", "retained": [true, true, false, true]},
  {"cohort_date": "2024-01-15", "retained": [true, false, true, false]}
]

Each boolean in retained corresponds to one week: retained[0] is always true (the cohort week itself), and subsequent values indicate whether the visitor was seen in weeks +1, +2, +3, etc.

ParameterDefaultRangeDescription
weeks41–52Number of weeks to include in the cohort grid

Sequence Matching

Endpoint: GET /api/stats/sequences

Uses sequence_match(pattern, timestamp, cond1, cond2, ...) to find visitors who performed a specific behavioral pattern. Returns overall conversion metrics.

Example — Pricing → Signup conversion:

GET /api/stats/sequences?site_id=example.com&steps=page:/pricing,event:signup

Response:

{
  "converting_visitors": 89,
  "total_visitors": 500,
  "conversion_rate": 0.178
}

Minimum 2 steps required. Steps use the same page:/path and event:name format as the funnel endpoint.


Flow Analysis

Endpoint: GET /api/stats/flow?page=/pricing

Uses sequence_next_node('forward', 'first_match', ...) to find the most common pages visitors navigate to after a given page.

Response:

[
  {"next_page": "/signup",  "visitors": 234},
  {"next_page": "/contact", "visitors": 89},
  {"next_page": "/",        "visitors": 67}
]

Returns up to 10 next-page destinations ordered by visitor count. Useful for understanding user navigation patterns and identifying high-exit pages.


Dashboard Views

The dashboard includes interactive views for all behavioral analytics:

  • Sessions — Cards showing total sessions, average duration, and pages per session.
  • Funnel — Horizontal bar chart with configurable steps and conversion percentages.
  • Retention — Cohort grid table showing Y (returned) / - (not returned) per week.
  • Sequences — Conversion metrics cards with converting visitors, total visitors, and rate.
  • Flow — Next-page table with visitor counts.

Graceful Degradation

All behavioral endpoints degrade gracefully when the extension is not available:

EndpointWithout extension
GET /api/stats/sessionsReturns zeros for all fields
GET /api/stats/funnelReturns empty array
GET /api/stats/retentionReturns empty array
GET /api/stats/sequencesReturns zeros for all fields
GET /api/stats/flowReturns empty array

Core analytics (/api/stats/main, /api/stats/timeseries, /api/stats/breakdown/*) do not use the extension and are always available.

Deployment

Production Checklist

Before going to production:

  • Set MALLARD_SECRET to a random 32+ character string and keep it constant across restarts.
  • Set MALLARD_ADMIN_PASSWORD to a strong password.
  • Set MALLARD_SECURE_COOKIES=true when behind a TLS-terminating reverse proxy so session cookies carry the Secure flag.
  • Set MALLARD_METRICS_TOKEN to a secret token if the /metrics endpoint is publicly reachable.
  • Configure a TLS-terminating reverse proxy (nginx, Caddy, Traefik).
  • Mount a persistent volume for data_dir (contains mallard.duckdb and Parquet files).
  • Set site_ids to restrict event ingestion to your domains.
  • Configure retention_days to match your data retention policy.
  • Set dashboard_origin to your dashboard URL to enable CSRF protection.
  • Use /health/ready as your container or load-balancer readiness probe.

EU / GDPR deployments — additional steps:

  • Set MALLARD_GDPR_MODE=true (or enable individual flags) to reduce data collection surface.
  • Set MALLARD_RETENTION_DAYS=30 (or your DPA-approved retention period) for Art. 5(1)(e) storage limitation compliance.
  • Set MALLARD_GEOIP_PRECISION=country (already forced by gdpr_mode; document it explicitly in your DPIA).
  • Document your legal basis for processing in a DPIA or privacy notice. See PRIVACY.md for the full analysis.
  • Use DELETE /api/gdpr/erase?site_id=...&start_date=...&end_date=... (Admin API key required) to honour Art. 17 erasure requests.

Pull and Run

docker run -d \
  --name mallard-metrics \
  --restart unless-stopped \
  -p 127.0.0.1:8000:8000 \
  -v mallard-data:/data \
  -e MALLARD_SECRET=your-random-32-char-secret \
  -e MALLARD_ADMIN_PASSWORD=your-dashboard-password \
  -e MALLARD_SECURE_COOKIES=true \
  -e MALLARD_METRICS_TOKEN=your-prometheus-token \
  ghcr.io/tomtom215/mallard-metrics

The image is built FROM scratch with a static musl binary. It has no shell, no package manager, and no runtime dependencies.

With a Config File

docker run -d \
  --name mallard-metrics \
  -v mallard-data:/data \
  -v /etc/mallard-metrics/config.toml:/config.toml:ro \
  -e MALLARD_SECRET=... \
  -e MALLARD_ADMIN_PASSWORD=... \
  ghcr.io/tomtom215/mallard-metrics /config.toml

Docker Compose

Save the following as docker-compose.yml:

services:
  mallard-metrics:
    image: ghcr.io/tomtom215/mallard-metrics:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - mallard-data:/data
    environment:
      MALLARD_SECRET: "${MALLARD_SECRET}"
      MALLARD_ADMIN_PASSWORD: "${MALLARD_ADMIN_PASSWORD}"
      MALLARD_SECURE_COOKIES: "true"
      MALLARD_METRICS_TOKEN: "${MALLARD_METRICS_TOKEN}"
      MALLARD_LOG_FORMAT: "json"

volumes:
  mallard-data:

Create a .env file (do not commit to source control):

MALLARD_SECRET=your-random-32-char-secret
MALLARD_ADMIN_PASSWORD=your-dashboard-password
MALLARD_METRICS_TOKEN=your-prometheus-bearer-token

Start:

docker compose up -d
docker compose logs -f

Behind a Reverse Proxy

Mallard Metrics binds to 0.0.0.0:8000 by default (all interfaces). Set MALLARD_HOST=127.0.0.1 to restrict to localhost when behind a reverse proxy.

nginx

server {
    listen 443 ssl;
    server_name analytics.example.com;

    ssl_certificate     /etc/ssl/certs/analytics.example.com.crt;
    ssl_certificate_key /etc/ssl/private/analytics.example.com.key;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Important: Mallard Metrics reads the client IP for visitor ID hashing. If behind a proxy, the X-Forwarded-For or X-Real-IP header must be set correctly. Configure your proxy to send the real client IP.

Caddy

analytics.example.com {
    reverse_proxy 127.0.0.1:8000
}

Caddy sets X-Forwarded-For automatically.

After-Proxy Configuration

Once behind a TLS reverse proxy, set these environment variables:

# Enables Secure flag on session cookies
MALLARD_SECURE_COOKIES=true

# Restricts dashboard CORS and enables CSRF protection
MALLARD_DASHBOARD_ORIGIN=https://analytics.example.com

Health and Readiness Probes

EndpointPurpose
GET /healthLiveness probe — returns ok if the process is alive
GET /health/readyReadiness probe — queries DuckDB; returns 503 if the database is not ready
GET /health/detailedJSON health report — version, buffer, auth, GeoIP, behavioral extension, cache status

Kubernetes Example

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

Docker Compose Health Check

The FROM scratch image has no shell or utilities (wget, curl). Use Docker's HEALTHCHECK with an external check from the host, or rely on your reverse proxy or orchestrator's health probes:

# External health check from the host
curl -sf http://localhost:8000/health/ready || exit 1

Build from Source (Static musl Binary)

To build a FROM scratch-compatible static binary:

# Install the musl target
rustup target add x86_64-unknown-linux-musl

# Build
cargo build --release --target x86_64-unknown-linux-musl

# The binary
ls -lh target/x86_64-unknown-linux-musl/release/mallard-metrics

The resulting binary has no dynamic library dependencies:

ldd target/x86_64-unknown-linux-musl/release/mallard-metrics
# not a dynamic executable

GeoIP Setup

Mallard Metrics supports optional IP geolocation via MaxMind GeoLite2.

  1. Create a free account at maxmind.com.
  2. Download the GeoLite2-City.mmdb database.
  3. Configure the path:
# config.toml
geoip_db_path = "/data/GeoLite2-City.mmdb"

Or with Docker:

docker run ... \
  -v /path/to/GeoLite2-City.mmdb:/data/GeoLite2-City.mmdb:ro \
  -e ... \
  ghcr.io/tomtom215/mallard-metrics

If the file is missing or unreadable, country/region/city fields are stored as NULL. No error is raised.

Note: The MaxMind GeoLite2 database is updated monthly. Automate downloads with geoipupdate.


GDPR-Friendly Deployment

Mallard Metrics provides a configurable privacy mode designed to reduce the data-collection surface to a level that makes aggregate analytics possible under GDPR Art. 6(1)(f) legitimate interests (no consent required) for many EU operators. Consult your legal team; requirements vary by context and member-state law.

Activate GDPR Mode

The quickest path is the MALLARD_GDPR_MODE=true preset, which bundles the recommended privacy settings:

docker run -d \
  --name mallard-metrics \
  --restart unless-stopped \
  -p 127.0.0.1:8000:8000 \
  -v mallard-data:/data \
  -e MALLARD_SECRET=your-random-32-char-secret \
  -e MALLARD_ADMIN_PASSWORD=your-dashboard-password \
  -e MALLARD_SECURE_COOKIES=true \
  -e MALLARD_GDPR_MODE=true \
  -e MALLARD_RETENTION_DAYS=30 \
  ghcr.io/tomtom215/mallard-metrics

Or via TOML config:

gdpr_mode      = true
retention_days = 30

What GDPR Mode Does

FlagStandardGDPR Mode
Referrer stored asFull URL (with query/fragment)Path only — ?q=... and #... stripped
TimestampsMillisecond precisionRounded to nearest hour
Browser infoName + versionName only (e.g. "Chrome")
OS infoName + versionName only (e.g. "Windows")
Screen / deviceStoredOmitted
GeoIPCity-levelCountry-level only

Fine-Grained Privacy Flags

Each setting can be controlled independently via environment variable or TOML key:

Env varTOML keyDefaultEffect
MALLARD_GDPR_MODEgdpr_modefalseEnable all flags below (except suppress_visitor_id)
MALLARD_STRIP_REFERRER_QUERYstrip_referrer_queryfalseStrip ?query and #fragment from referrers
MALLARD_ROUND_TIMESTAMPSround_timestampsfalseRound timestamps to nearest hour
MALLARD_SUPPRESS_BROWSER_VERSIONsuppress_browser_versionfalseStore browser name only
MALLARD_SUPPRESS_OS_VERSIONsuppress_os_versionfalseStore OS name only
MALLARD_SUPPRESS_SCREEN_SIZEsuppress_screen_sizefalseOmit screen size and device type
MALLARD_GEOIP_PRECISIONgeoip_precision"city""city" / "region" / "country" / "none"
MALLARD_SUPPRESS_VISITOR_IDsuppress_visitor_idfalseReplace HMAC hash with random UUID per request (breaks unique-visitor counting)

Note on suppress_visitor_id: This flag is intentionally not activated by gdpr_mode because it eliminates unique-visitor metrics entirely. The default HMAC-SHA256 visitor ID is pseudonymous personal data under GDPR Recital 26. Most operators can rely on Art. 6(1)(f) legitimate interests for aggregate analytics without suppressing visitor IDs.

Right to Erasure (Art. 17)

Mallard Metrics supports data erasure requests via an authenticated API endpoint:

# Requires an Admin API key
curl -X DELETE \
  "https://analytics.example.com/api/gdpr/erase?site_id=mysite.com&start_date=2024-01-01&end_date=2024-12-31" \
  -H "X-API-Key: mm_your_admin_key"

Response:

{
  "site_id": "mysite.com",
  "start_date": "2024-01-01",
  "end_date": "2024-12-31",
  "db_records_deleted": 1423,
  "parquet_partitions_deleted": 8
}

Important limitations:

  • Erasure is by site and date range, not by individual visitor ID (visitor IDs are pseudonymous hashes and cannot be reverse-mapped to individuals).
  • After erasure, the events_all VIEW is refreshed automatically.
  • Consider setting MALLARD_RETENTION_DAYS=30 for automated data minimisation under Art. 5(1)(e) in place of manual erasure requests.

Graceful Shutdown

Mallard Metrics handles SIGINT (Ctrl+C) and SIGTERM (Docker stop, systemd stop). On receiving either signal:

  1. The server stops accepting new connections.
  2. In-flight requests are completed.
  3. Buffered events are flushed to DuckDB (persisted via WAL).

The flush is bounded by shutdown_timeout_secs (default 30). If flushing takes longer, a warning is logged and the process exits.


Systemd Service

For non-Docker deployments:

[Unit]
Description=Mallard Metrics
After=network.target

[Service]
Type=simple
User=mallard
ExecStart=/usr/local/bin/mallard-metrics /etc/mallard-metrics/config.toml
Restart=on-failure
RestartSec=5s
Environment=MALLARD_SECRET=...
Environment=MALLARD_ADMIN_PASSWORD=...

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable --now mallard-metrics

VPS Deployment Guide

Zero to production in one command — deploy Mallard Metrics on any generic Linux VPS with full TLS, LUKS-encrypted data at rest, Cloudflare DNS, and an automated security audit.


Overview

This guide deploys Mallard Metrics on a bare VPS using:

ComponentRole
Caddy (custom build)TLS termination, reverse proxy, HTTP/3, ACME DNS-01
Cloudflare DNSDNS-01 ACME challenge — no port 80 required
LUKSFull encryption of the analytics data volume at rest
Docker ComposeContainer orchestration
vps-auditAutomated security assessment and weekly re-audit
UFW + fail2banHost-level firewall and brute-force protection

The FROM scratch Mallard binary runs with no shell, no OS utilities, read-only root filesystem, all Linux capabilities dropped, and no network port exposed to the host — all traffic flows through Caddy on the internal Docker network.


Architecture

Internet
    │
    ▼
┌───────────────────────────────────────────────┐
│  VPS Host (Ubuntu/Debian)                     │
│                                               │
│  UFW Firewall: 22, 80, 443 (tcp+udp/QUIC)    │
│                                               │
│  ┌─────────────────────────────────────────┐  │
│  │  Docker network: mallard-production_proxy│  │
│  │                                         │  │
│  │  ┌─────────────┐    ┌───────────────┐   │  │
│  │  │   Caddy     │───▶│ mallard:8000  │   │  │
│  │  │ :80/:443    │    │ (FROM scratch) │   │  │
│  │  │ TLS + proxy │    │               │   │  │
│  │  └─────────────┘    └───────┬───────┘   │  │
│  └───────────────────────────  │ ──────────┘  │
│                                │              │
│  ┌─────────────────────────────▼────────────┐ │
│  │  LUKS encrypted volume (/srv/mallard/data)│ │
│  │  mallard.duckdb  data/YYYY/MM/DD/*.parquet│ │
│  └───────────────────────────────────────────┘ │
└───────────────────────────────────────────────┘

Prerequisites

VPS requirements

ResourceMinimumRecommended
CPU1 vCPU2 vCPU
RAM512 MB1 GB
Disk10 GB40 GB
OSUbuntu 22.04Ubuntu 24.04 LTS
Architecturex86-64x86-64

Mallard Metrics is a single static binary. Under light to medium traffic (< 50k daily events) the minimum spec is adequate. The disk budget is dominated by Parquet data growth and the LUKS image pre-allocation.

Domain and DNS

You need a domain whose DNS is managed in Cloudflare. The domain can be:

  • A subdomain: analytics.example.com (recommended — keeps the apex clean)
  • An apex domain: example.com

Create an A record pointing to your VPS IP before running setup. Caddy validates DNS during certificate issuance.

analytics.example.com.  A  203.0.113.42

If you're using Cloudflare's proxy (orange cloud), set it to DNS only (grey cloud) for the analytics subdomain. Caddy manages TLS itself and Cloudflare's proxy can interfere with HTTP/3 and certificate validation.

Cloudflare API token

Caddy uses the Cloudflare API to create DNS TXT records for ACME DNS-01 challenges. Create a scoped token:

  1. Log in to dash.cloudflare.comMy ProfileAPI Tokens
  2. Click Create TokenCustom Token
  3. Set permissions:
    • Zone → Zone → Read (for all zones or just the specific zone)
    • Zone → DNS → Edit (for the specific zone containing your domain)
  4. Restrict to Zone Resources → Specific zone → your zone
  5. Copy the generated token — you will not see it again

SSH key access

setup.sh disables SSH password authentication as part of hardening. You must have an SSH public key installed on the server before running the script, or you will be locked out.

# On your local machine — copy your public key to the server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@your-vps-ip

# Verify it works before running setup
ssh -i ~/.ssh/id_ed25519 user@your-vps-ip echo "Key access confirmed"

One-Command Deployment

If you trust the script (review it first), this does everything:

# 1. SSH into the VPS
ssh user@your-vps-ip

# 2. Clone the repository
git clone https://github.com/tomtom215/mallardmetrics.git
cd mallardmetrics

# 3. Run the setup script
sudo bash deploy/setup.sh

The script is interactive — it will prompt for your domain, email, and Cloudflare API token, then generate and display the admin password.

Pre-set values to run non-interactively (e.g., for CI/cloud-init):

export MM_DOMAIN=analytics.example.com
export MM_EMAIL=admin@example.com
export MM_CF_TOKEN=your-cloudflare-token
sudo -E bash deploy/setup.sh

Step-by-Step Manual Deployment

Step 1 — Provision the VPS

Choose a provider (any KVM/XEN VPS works):

Use Ubuntu 22.04 LTS or 24.04 LTS as the OS image. Enable backups at the provider level for an additional safety net.

After provisioning:

# Note your VPS IP address, then SSH in
ssh root@<VPS-IP>

# Immediately create a non-root user with sudo
adduser deploy
usermod -aG sudo deploy

# Add your SSH key to the new user
mkdir -p /home/deploy/.ssh
cp /root/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

# Switch to the non-root user for the rest
su - deploy

Step 2 — Clone the repository

git clone https://github.com/tomtom215/mallardmetrics.git
cd mallardmetrics

Step 3 — Run setup.sh

sudo bash deploy/setup.sh

The script will:

  1. Detect your OS and verify prerequisites
  2. Ask you to confirm SSH key access before hardening SSH
  3. Update packages and install tooling
  4. Harden SSH, enable UFW firewall, configure fail2ban
  5. Apply kernel hardening sysctl settings
  6. Install Docker CE and the Compose plugin
  7. Create a 20 GB LUKS-encrypted image at /srv/mallard/data.img and mount it
  8. Download and run vps-audit — saving the report
  9. Prompt you to configure deploy/.env (or auto-generate secrets)
  10. Build the Docker images and start the stack
  11. Install weekly vps-audit and daily backup cron jobs
  12. Print your admin password and a post-setup checklist

Step 4 — Verify deployment

# Check container status
docker compose -f deploy/docker-compose.production.yml ps

# Check Caddy got a certificate (look for "TLS certificate obtained")
docker compose -f deploy/docker-compose.production.yml logs caddy | grep -i cert

# Test the health endpoint (replace with your domain)
curl -s https://analytics.example.com/health/ready

# Expected: ready

Open https://<your-domain> in a browser. You should see the Mallard Metrics dashboard login page.


What setup.sh Does

Here is the complete sequence of operations setup.sh performs, with the rationale for each:

StepOperationWhy
1OS detection and SSH key checkPrevents lockout before hardening
2apt upgrade + unattended-upgradesPatches known CVEs immediately
3SSH drop-in config in sshd_config.d/Non-destructive; preserves original config
4UFW: deny-all ingress, allow 22/80/443Minimal attack surface
5fail2ban for SSHBlocks brute-force login attempts
6Kernel sysctl hardeningDisables TCP redirects, restricts dmesg/BPF
7Docker CE from official repoEnsures a current, vendor-supported version
8LUKS encrypted image + keyfileAnalytics data encrypted at rest
9vps-audit + weekly cronOngoing visibility into security posture
10Secret generation + deploy/.envStrong random credentials without manual work
11docker compose build && up -dBrings the stack live
12Backup cron (rsync)Daily snapshot of DuckDB + Parquet

LUKS Encrypted Volume

How it works

setup.sh creates a file-backed LUKS2 container at /srv/mallard/data.img using AES-XTS-PLAIN64 with a 512-bit key. A random keyfile is stored at /etc/mallard-data.key (read-only by root) so the volume auto-unlocks on boot without a passphrase prompt.

The decrypted volume is formatted ext4 and mounted at /srv/mallard/data. The Mallard Metrics container bind-mounts this path as /data.

/srv/mallard/data.img   ← LUKS2 container (AES-256 XTS, file on host disk)
        ↓ cryptsetup luksOpen
/dev/mapper/mallard-data  ← Decrypted block device
        ↓ ext4 mount
/srv/mallard/data/        ← Plaintext filesystem (only visible to root while mounted)
        ↓ Docker bind mount
/data/ (inside container) ← mallard.duckdb, data/YYYY/MM/DD/*.parquet

If an attacker gains access to the raw disk image (e.g., by stealing a disk or snapshot), the data is unreadable without the keyfile.

After reboot

The LUKS volume is configured in /etc/crypttab and /etc/fstab to auto-mount on boot using the keyfile. No manual intervention is required after a planned reboot.

# To verify the volume mounted after a reboot:
mountpoint /srv/mallard/data && echo "mounted" || echo "NOT mounted"

# If it did not mount (e.g., keyfile missing), mount manually:
sudo cryptsetup luksOpen --key-file /etc/mallard-data.key \
    /srv/mallard/data.img mallard-data
sudo mount /dev/mapper/mallard-data /srv/mallard/data

# Then restart the stack
sudo docker compose -f /path/to/mallardmetrics/deploy/docker-compose.production.yml up -d

Resizing the volume

# 1. Stop the stack
docker compose -f deploy/docker-compose.production.yml down

# 2. Unmount and close
sudo umount /srv/mallard/data
sudo cryptsetup luksClose mallard-data

# 3. Grow the image file (+10 GB example)
sudo fallocate -l 30G /srv/mallard/data.img          # change to new total size

# 4. Grow the LUKS container
sudo cryptsetup luksOpen --key-file /etc/mallard-data.key \
    /srv/mallard/data.img mallard-data
sudo cryptsetup resize mallard-data

# 5. Grow the filesystem
sudo e2fsck -f /dev/mapper/mallard-data
sudo resize2fs /dev/mapper/mallard-data

# 6. Re-mount and restart
sudo mount /dev/mapper/mallard-data /srv/mallard/data
docker compose -f deploy/docker-compose.production.yml up -d

Caddy and TLS

Cloudflare DNS challenge

The Caddyfile is configured for the ACME DNS-01 challenge using the Cloudflare provider. This means:

  • Port 80 does not need to be accessible — challenge is completed via DNS API
  • Wildcard certificates (*.example.com) are supported
  • Certificates are obtained before the first request arrives

Caddy stores its ACME account and certificates in the caddy-data Docker volume. Certificates are renewed automatically, typically 30 days before expiry.

Certificate renewal

No action is required — Caddy handles renewal entirely. To check certificate status:

# View Caddy's certificate store
docker exec mallard-caddy caddy environ
docker exec mallard-caddy caddy list-modules | grep dns

# Check cert expiry
echo | openssl s_client -connect analytics.example.com:443 -servername analytics.example.com 2>/dev/null \
    | openssl x509 -noout -dates

Custom domain configurations

Subdomain (most common):

# In .env
DOMAIN=analytics.example.com

Apex domain:

# In .env
DOMAIN=example.com

Multiple domains (edit deploy/Caddyfile directly):

analytics.example.com, stats.myothersite.io {
    import security_headers
    reverse_proxy mallard:8000 { ... }
}

Security Hardening

vps-audit integration

vps-audit performs 40+ security checks across SSH, firewall, kernel, authentication, file permissions, and services.

# Run a fresh audit at any time
sudo vps-audit

# Run with JSON output for automation
sudo vps-audit --format json > /tmp/audit.json

# View the initial audit report
cat /srv/mallard/vps-audit-initial-$(date +%Y%m%d).log

# View weekly audit logs
tail -100 /var/log/vps-audit.log

The weekly cron runs every Sunday at 03:00 UTC. Review WARN and FAIL items and address them using the audit's built-in guidance (vps-audit --guide).

SSH hardening

setup.sh installs a hardening drop-in at /etc/ssh/sshd_config.d/99-mallard-hardening.conf:

PermitRootLogin no           # Root cannot SSH in at all
PasswordAuthentication no    # Only public key authentication
MaxAuthTries 3               # Lock after 3 failed attempts
LoginGraceTime 30            # 30s window to authenticate
ClientAliveInterval 300      # 5-minute keepalive
AllowAgentForwarding no      # No agent forwarding
AllowTcpForwarding no        # No tunnel forwarding
X11Forwarding no             # No graphical forwarding

fail2ban bans IPs after 5 failed SSH attempts for 1 hour.

Firewall (UFW)

# View current rules
sudo ufw status numbered

# Default policy after setup.sh
# Default incoming: deny
# Default outgoing: allow
# 22/tcp  — SSH
# 80/tcp  — HTTP (Caddy redirects to HTTPS)
# 443/tcp — HTTPS
# 443/udp — HTTP/3 QUIC

Kernel parameters

Applied via /etc/sysctl.d/99-mallard-hardening.conf:

SettingValueEffect
tcp_syncookies1SYN flood protection
rp_filter1Spoofed packet rejection
accept_redirects0ICMP redirect attacks blocked
dmesg_restrict1Kernel log visible only to root
unprivileged_bpf_disabled1BPF restricted to privileged users
bpf_jit_harden2JIT hardening against side-channel
suid_dumpable0No core dumps from setuid programs

Configuration Reference

All configuration is in deploy/.env. The file is created by setup.sh from deploy/.env.example. Here are the settings most commonly adjusted post-deployment:

VariableDefaultDescription
DOMAIN(required)Hostname Caddy serves
MALLARD_RETENTION_DAYS365Delete Parquet partitions older than N days
MALLARD_RATE_LIMIT0 (unlimited)Max events/sec per site_id
MALLARD_CACHE_TTL60Query result cache TTL (seconds)
MALLARD_MAX_CONCURRENT_QUERIES10DuckDB concurrency cap
MALLARD_MAX_LOGIN_ATTEMPTS5Failed logins before IP lockout
MALLARD_LOGIN_LOCKOUT300Lockout duration (seconds)
MALLARD_GEOIP_DB(blank)Path to MaxMind GeoLite2-City.mmdb (inside container)

After editing .env, restart the stack:

docker compose -f deploy/docker-compose.production.yml up -d

Adding the Tracking Script

Add this to every page you want to track:

<script
  defer
  src="https://analytics.example.com/mallard.js"
  data-domain="example.com">
</script>

Replace analytics.example.com with your deployment domain and example.com with the site_id you want to use for this site.

Custom events:

window.mallard('Purchase', {
  revenue: 49.99,
  currency: 'USD',
  props: { plan: 'pro' }
});

Embed on GitHub Pages docs (static site):

Simply paste the <script> tag into your mdBook layout template or into individual markdown pages using HTML passthrough. The script is < 1 KB and has zero external dependencies.


Accessing the Dashboard Remotely

The dashboard is served at the root URL of your Mallard Metrics instance (e.g. https://analytics.example.com). It requires authentication when MALLARD_ADMIN_PASSWORD is set.

Note: The server sets X-Frame-Options: DENY to prevent clickjacking, so the dashboard cannot be embedded in an iframe. Access it directly in a browser tab instead.


Post-Deployment Operations

View logs

# All services (follow)
docker compose -f deploy/docker-compose.production.yml logs -f

# Mallard only (JSON structured logs)
docker compose -f deploy/docker-compose.production.yml logs mallard | jq .

# Caddy access log (on the LUKS volume)
tail -f /srv/mallard/data/logs/caddy-access.log | jq .

Update Mallard Metrics

cd ~/mallardmetrics

# Pull latest changes
git pull origin main

# Rebuild and restart (zero downtime if only Mallard changes)
docker compose -f deploy/docker-compose.production.yml build mallard
docker compose -f deploy/docker-compose.production.yml up -d mallard

# Or rebuild everything
docker compose -f deploy/docker-compose.production.yml build --no-cache
docker compose -f deploy/docker-compose.production.yml up -d

The Caddy build only needs to be rebuilt if you change deploy/Dockerfile.caddy or deploy/Caddyfile.

Backup and restore

Backup (done automatically daily by the cron job):

# Manual backup
rsync -a --delete /srv/mallard/data/ /srv/mallard/backup/

# Copy off-server (replace with your backup destination)
rsync -az /srv/mallard/data/ backup-server:/backups/mallard/$(date +%Y%m%d)/

Restore:

# Stop the stack
docker compose -f deploy/docker-compose.production.yml down

# Restore data files
rsync -a /srv/mallard/backup/ /srv/mallard/data/

# Restart
docker compose -f deploy/docker-compose.production.yml up -d

GeoIP setup

Mallard supports MaxMind GeoLite2-City for country/region/city resolution.

  1. Create a free MaxMind account at maxmind.com
  2. Download GeoLite2-City.mmdb
  3. Copy it to the data volume:
    cp GeoLite2-City.mmdb /srv/mallard/data/GeoLite2-City.mmdb
    
  4. Update deploy/.env:
    MALLARD_GEOIP_DB=/data/GeoLite2-City.mmdb
    
  5. Restart Mallard:
    docker compose -f deploy/docker-compose.production.yml restart mallard
    

Set up weekly automatic updates (MaxMind databases are updated Tuesdays and Fridays):

# Install geoipupdate
apt-get install -y geoipupdate

# Configure with your MaxMind account ID and licence key
# /etc/GeoIP.conf:
# AccountID YOUR_ACCOUNT_ID
# LicenseKey YOUR_LICENSE_KEY
# EditionIDs GeoLite2-City

# Run update
geoipupdate

# Link to the data volume
ln -sf /usr/share/GeoIP/GeoLite2-City.mmdb /srv/mallard/data/GeoLite2-City.mmdb

Monitoring

The detailed health endpoint returns rich status JSON:

curl -s https://analytics.example.com/health/detailed | jq .

Example response:

{
  "status": "ok",
  "version": "0.1.0",
  "buffered_events": 0,
  "auth_configured": true,
  "geoip_loaded": false,
  "behavioral_extension_loaded": true,
  "filter_bots": true,
  "cache_entries": 0,
  "cache_empty": true
}

Prometheus metrics (requires MALLARD_METRICS_TOKEN):

curl -H "Authorization: Bearer $MALLARD_METRICS_TOKEN" \
  https://analytics.example.com/metrics

Available metrics:

  • mallard_events_ingested_total — cumulative event count
  • mallard_flush_failures_total — Parquet flush failures
  • mallard_rate_limit_rejections_total — rate-limited requests
  • mallard_login_failures_total — failed dashboard logins
  • mallard_cache_hits_total / mallard_cache_misses_total — query cache
  • mallard_behavioral_extension — 1 if the behavioral extension loaded

UptimeRobot / Better Uptime:

Monitor https://<domain>/health/ready with a 1-minute interval. It returns HTTP 200 when the database is reachable, 503 otherwise.


Troubleshooting

Caddy shows "certificate error" or HTTP instead of HTTPS

# Check Caddy logs for ACME errors
docker compose -f deploy/docker-compose.production.yml logs caddy | grep -i "acme\|cert\|error"

# Common causes:
# 1. CLOUDFLARE_API_TOKEN is wrong or lacks Zone:DNS:Edit permission
# 2. DNS A record not yet propagated (allow up to 10 minutes)
# 3. You hit Let's Encrypt rate limits — wait 1 hour or switch to staging
#    (uncomment the acme_ca staging line in deploy/Caddyfile)

Mallard container exits immediately

docker compose -f deploy/docker-compose.production.yml logs mallard

# Common cause: MALLARD_SECRET is blank (required at startup)
# Check deploy/.env has MALLARD_SECRET set to a non-empty value

Data volume not mounted after reboot

# Check if LUKS device is open
ls -la /dev/mapper/mallard-data || echo "LUKS device not open"

# Check mount
mountpoint /srv/mallard/data || echo "Not mounted"

# Manually open and mount
sudo cryptsetup luksOpen --key-file /etc/mallard-data.key \
    /srv/mallard/data.img mallard-data
sudo mount /dev/mapper/mallard-data /srv/mallard/data

# Restart stack
docker compose -f deploy/docker-compose.production.yml up -d

Health check returns 503

# Mallard is running but the DuckDB VIEW rebuild failed
docker compose -f deploy/docker-compose.production.yml logs mallard | tail -50

# Try restarting Mallard only (Caddy stays up, no TLS interruption)
docker compose -f deploy/docker-compose.production.yml restart mallard

Port 443 already in use

sudo ss -tlnp | grep :443
# If another process (nginx, apache) is listening:
sudo systemctl stop nginx apache2 2>/dev/null || true
docker compose -f deploy/docker-compose.production.yml up -d caddy

Out of disk space

df -h /srv/mallard/data   # Check LUKS volume usage
df -h /var/lib/docker     # Check Docker overlay usage

# Trim old Docker layers
docker system prune -f

# Enable data retention if not already set
# In deploy/.env: MALLARD_RETENTION_DAYS=365
# Then restart mallard

Frequently Asked Questions

Q: Can I deploy without Cloudflare?

Yes — use any DNS provider that Caddy supports. The DNS-01 plugin ecosystem includes Route53, GoDaddy, Namecheap, Gandi, and many others. See caddyserver.com/docs/modules/dns for the full list. Alternatively, if port 80 is accessible from the internet, change the Caddyfile global block to remove acme_dns and Caddy will use the HTTP-01 challenge automatically.

Q: Can I run Mallard Metrics on a Raspberry Pi or ARM server?

The current Dockerfile targets x86_64-unknown-linux-musl. To build for ARM64, change the target to aarch64-unknown-linux-musl in the Dockerfile and add platform: linux/arm64 to the compose service. The rest of the stack (Caddy, LUKS) is architecture-agnostic.

Q: How do I add multiple sites?

Mallard Metrics handles multiple sites with a single deployment. Each site uses a different data-domain in the tracking script. All data is partitioned by site_id at the Parquet layer. Dashboard queries are filtered per site.

Q: Is the LUKS keyfile approach secure?

The keyfile provides encryption at rest — protection against an attacker who obtains the raw disk image (e.g., a stolen drive or a cloud snapshot). It does not protect against an attacker who has live root access to a running server, because the decrypted volume is mounted and readable. For higher threat models, use a passphrase-protected LUKS setup with manual unlock after reboot, or consider a dedicated HSM.

Q: How do I change the admin password?

# Set the new password in deploy/.env
sed -i 's/^MALLARD_ADMIN_PASSWORD=.*/MALLARD_ADMIN_PASSWORD=new-password-here/' deploy/.env

# Restart mallard to pick it up
docker compose -f deploy/docker-compose.production.yml restart mallard

Q: Can I use a wildcard certificate?

Yes. DNS-01 challenge (which this setup uses) supports wildcards. Change your domain to *.example.com in the Caddyfile and the certificate will cover all subdomains.

Q: How do I run Mallard Metrics on a private/internal network with no public IP?

Since we use the DNS-01 challenge, the server does not need to be reachable on port 80 from the internet. Any server that can make outbound HTTPS requests to Cloudflare's API can get a certificate — including servers on private VPNs, home labs, and internal networks.

Q: What happens to data if the LUKS container runs out of space?

Mallard will return errors on write (DuckDB INSERT and Parquet COPY TO will fail). Flush failures are counted in the mallard_flush_failures_total Prometheus metric. In-memory buffered events are preserved and retried. To prevent this, monitor disk usage and enable MALLARD_RETENTION_DAYS to automatically delete old partitions.

Q: Can I enable Let's Encrypt staging to test without hitting rate limits?

Yes. In deploy/Caddyfile, uncomment:

acme_ca https://acme-staging-v02.api.letsencrypt.org/directory

Your browser will show a certificate warning (staging certs aren't trusted), but you can verify the issuance flow. Remove the line and docker compose restart caddy to switch back to production.

Q: How do I integrate this with Grafana or another dashboard?

Use the Prometheus /metrics endpoint as a data source. For detailed analytics data, the JSON export endpoint (GET /api/stats/export?format=json) produces daily rollups that can be ingested into any TSDB.


Index

TermSection
A recordDomain and DNS
ACMECaddy and TLS
Admin passwordFAQ — change password
BackupBackup and restore
CaddyArchitecture, Caddy and TLS
Certificate renewalCertificate renewal
Cloudflare API tokenCloudflare API token
ConfigurationConfiguration Reference
crypttabAfter reboot
DNS-01 challengeCloudflare DNS challenge
Docker ComposeOne-Command Deployment
fail2banSSH hardening
FirewallFirewall (UFW)
GeoIPGeoIP setup
Health checkMonitoring
HTTP/3 QUICArchitecture
Kernel hardeningKernel parameters
LUKS encryptionLUKS Encrypted Volume
LoggingView logs
Metrics (Prometheus)Monitoring
Multi-siteFAQ — multiple sites
Resize volumeResizing the volume
setup.shWhat setup.sh Does
SSH keySSH key access
TLSCaddy and TLS
Tracking scriptAdding the Tracking Script
UFWFirewall (UFW)
UpdatesUpdate Mallard Metrics
vps-auditvps-audit integration
Wildcard certificateFAQ — wildcard certificate

Fly.io Deployment

Fly.io is a managed application platform that runs Docker containers in hardware-isolated micro-VMs (Firecracker) across a global network. It is not a "free tier" service — it requires a credit card. However, its Hobby plan includes enough free allowances to run Mallard Metrics at low-to-medium traffic volumes at little or no monthly cost.


Overview

Fly.io runs your Docker image as a Firecracker micro-VM. Mallard Metrics deploys well because:

  • The FROM scratch musl-static binary has no OS dependencies
  • Fly.io provides persistent volumes for DuckDB and Parquet data
  • Fly.io terminates TLS automatically — no Caddy or certbot needed
  • The Fly.io edge network handles HTTP/2 and HTTPS globally
  • Machines auto-start on traffic and can auto-stop when idle

Limitations compared to a dedicated VPS:

  • No LUKS encryption (volume encryption is managed by Fly.io's infrastructure)
  • Auto-stop means cold-start latency if traffic is infrequent
  • Volume size and I/O throughput are lower than a dedicated disk
  • Scaling beyond a single machine requires paid plan upgrades

Fly.io vs VPS: When to Choose Each

CriterionFly.ioDedicated VPS
Setup time< 15 minutes30–60 minutes
Monthly cost (light traffic)~$0–$5$4–$10
TLS managementAutomaticCaddy (setup.sh handles)
Data encryption at restPlatform-managedLUKS (user-managed)
Cold-start latencyYes (if auto-stop)No
Custom kernel tuningNoYes
Multi-regionYesManual
Persistent storageVolumes (3 GB included)LUKS image (you size it)
SSH accessfly ssh consoleDirect SSH

Choose Fly.io if you want zero infrastructure maintenance and are comfortable with platform-managed data storage.

Choose a VPS if you need full control, LUKS encryption, or higher data volumes.


Pricing and Allowances

Fly.io's Hobby plan (requires a payment method) includes monthly allowances:

ResourceIncluded free
Shared-CPU-1x 256 MB VMs3 VMs
Persistent volume storage3 GB
Outbound data transfer160 GB
TLS certificatesUnlimited

Mallard Metrics needs:

  • 1 VMshared-cpu-1x with 256 MB RAM is sufficient for up to ~10k daily events. Scale to 512 MB or 1x CPU for higher loads.
  • 1 Volume — minimum 1 GB (DuckDB grows with data). 3 GB is comfortable for a year of moderate traffic.

At low traffic, your deployment may fit entirely within the free allowances. At higher traffic or with a large data volume, expect $1–5/month.


Prerequisites

  • A Fly.io account — sign up at fly.io (credit card required)
  • flyctl installed on your local machine
  • The mallardmetrics repository cloned locally
  • A domain name (optional — Fly.io provides a .fly.dev subdomain for free)

Initial Setup

Install flyctl

macOS:

brew install flyctl

Linux:

curl -L https://fly.io/install.sh | sh
# Add to PATH (add this to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/.fly/bin:$PATH"

Windows:

iwr https://fly.io/install.ps1 -useb | iex

Verify:

fly version

Authenticate

fly auth login
# Opens a browser — log in to your Fly.io account

Configure the Application

fly.toml

Create fly.toml in the repository root:

# Mallard Metrics — Fly.io configuration
# Replace "mallard-metrics-YOURNAME" with a globally unique app name.

app            = "mallard-metrics-YOURNAME"
primary_region = "ord"    # Chicago. See: fly platform regions

[build]
  # Use the existing Dockerfile (FROM scratch, musl binary)
  dockerfile = "Dockerfile"

[env]
  # Non-secret configuration — secrets go in fly secrets (see below)
  # IMPORTANT: env var names must match config.rs exactly:
  #   MALLARD_RATE_LIMIT  → config.rate_limit_per_site  (NOT _PER_SITE suffix)
  #   MALLARD_CACHE_TTL   → config.cache_ttl_secs        (NOT _SECS suffix)
  #   MALLARD_GEOIP_DB    → config.geoip_db_path          (NOT _PATH suffix)
  MALLARD_DATA_DIR           = "/data"
  MALLARD_HOST               = "0.0.0.0"
  MALLARD_PORT               = "8080"
  MALLARD_LOG_FORMAT         = "json"
  MALLARD_FILTER_BOTS        = "true"
  MALLARD_SECURE_COOKIES     = "true"
  MALLARD_RETENTION_DAYS     = "365"
  MALLARD_RATE_LIMIT         = "200"
  MALLARD_CACHE_TTL          = "300"
  MALLARD_MAX_LOGIN_ATTEMPTS = "5"
  MALLARD_LOGIN_LOCKOUT      = "300"
  RUST_LOG                   = "mallard_metrics=info,tower_http=warn"

[http_service]
  internal_port       = 8080
  force_https         = true       # Fly.io handles TLS; redirect HTTP → HTTPS
  auto_stop_machines  = "stop"     # Stop idle machines to save cost
  auto_start_machines = true       # Auto-start on new traffic
  min_machines_running = 1         # Keep at least 1 machine alive (prevents cold starts)
  processes            = ["app"]

  [http_service.concurrency]
    type       = "requests"
    soft_limit = 200
    hard_limit = 250

[[vm]]
  cpu_kind = "shared"
  cpus     = 1
  memory   = "256mb"     # Increase to "512mb" for >50k daily events

[mounts]
  source      = "mallard_data"    # Volume name (created below)
  destination = "/data"
  initial_size = "3gb"

[checks]
  [checks.health]
    grace_period = "10s"
    interval     = "30s"
    method       = "GET"
    path         = "/health/ready"
    port         = 8080
    timeout      = "5s"
    type         = "http"

Choose your region (primary_region):

fly platform regions
# Pick the region closest to your users or your DNS provider
# Common choices: ord (Chicago), iad (Virginia), lax (Los Angeles),
#                 lhr (London), fra (Frankfurt), nrt (Tokyo), sin (Singapore)

Dockerfile note

The existing Dockerfile targets x86_64-unknown-linux-musl. Fly.io runs on x86-64 by default — no changes to the Dockerfile are needed.

If you want to build for Fly.io's ARM machines (--vm-cpu-kind performance), change the target to aarch64-unknown-linux-musl and update the rust-toolchain.toml accordingly.


Create a Persistent Volume

The Fly.io volume stores DuckDB and Parquet data between deployments and machine restarts.

# Create a 3 GB volume in your primary region (included in Hobby allowances)
fly volumes create mallard_data \
  --size 3 \
  --region ord \
  --app mallard-metrics-YOURNAME

# Verify
fly volumes list --app mallard-metrics-YOURNAME

Important: Volumes are single-region and single-machine by default. If you scale to multiple machines, each machine needs its own volume — but Mallard Metrics is a single-instance application (DuckDB is embedded). Do not scale to more than 1 machine without understanding the data consistency implications.


Set Secrets

Fly.io secrets are encrypted at rest and injected as environment variables at runtime. Never put secrets in fly.toml.

APP=mallard-metrics-YOURNAME

# Required secrets — generate strong values:
fly secrets set \
  MALLARD_SECRET="$(openssl rand -base64 48)" \
  MALLARD_ADMIN_PASSWORD="$(openssl rand -base64 24 | tr -d '=+/' | head -c 32)" \
  MALLARD_METRICS_TOKEN="$(openssl rand -hex 32)" \
  --app "$APP"

Save the admin password before running the above — it is not retrievable after setting:

# Generate and save before setting:
ADMIN_PASS="$(openssl rand -base64 24 | tr -d '=+/' | head -c 32)"
echo "Admin password: $ADMIN_PASS"  # Save this!
fly secrets set MALLARD_ADMIN_PASSWORD="$ADMIN_PASS" --app "$APP"

To update a secret later:

fly secrets set MALLARD_ADMIN_PASSWORD="new-password" --app "$APP"
# Fly.io triggers a rolling restart automatically

To view which secrets are set (names only — values are never shown):

fly secrets list --app "$APP"

Deploy

# From the repository root directory
fly deploy --app mallard-metrics-YOURNAME

# Or launch for the first time (creates app + prompts for config):
fly launch
# Answer the prompts; Fly.io will detect the Dockerfile and suggest settings.
# Review the generated fly.toml and adjust as described above.

Fly.io will:

  1. Build the Docker image remotely (using Fly's build infrastructure)
  2. Push it to Fly.io's container registry
  3. Create a Firecracker micro-VM from the image
  4. Mount the mallard_data volume at /data
  5. Inject secrets as environment variables
  6. Start the machine and run health checks

Deployment typically takes 2–4 minutes. Watch progress:

fly deploy --app mallard-metrics-YOURNAME 2>&1 | tee deploy.log

Configure a Custom Domain

By default your app is available at https://mallard-metrics-YOURNAME.fly.dev.

To use a custom domain:

# 1. Add the domain to your Fly.io app
fly certs add analytics.example.com --app mallard-metrics-YOURNAME

# 2. Fly.io will show you the DNS records to create:
fly certs show analytics.example.com --app mallard-metrics-YOURNAME

Create the DNS records shown (usually a CNAME to <app>.fly.dev or an A/AAAA to Fly's IPs). Fly.io obtains a Let's Encrypt certificate automatically via the HTTP-01 or DNS-01 challenge.

Update the app to know its domain:

fly secrets set \
  MALLARD_DASHBOARD_ORIGIN="https://analytics.example.com" \
  --app mallard-metrics-YOURNAME

Verify the Deployment

# Check machine status
fly status --app mallard-metrics-YOURNAME

# View machine health
fly checks list --app mallard-metrics-YOURNAME

# Quick smoke test
curl -s https://mallard-metrics-YOURNAME.fly.dev/health/ready
# Expected: ready

# View all available endpoints
curl -s https://mallard-metrics-YOURNAME.fly.dev/health/detailed | jq .

Open https://mallard-metrics-YOURNAME.fly.dev (or your custom domain) in a browser. Log in with the admin password you set.


Logs and Monitoring

# Stream live logs
fly logs --app mallard-metrics-YOURNAME

# Historical logs (last N lines)
fly logs --app mallard-metrics-YOURNAME -n 200

# Parse JSON structured logs
fly logs --app mallard-metrics-YOURNAME | jq 'select(.fields.uri != "/health/ready")'

# Machine console (SSH equivalent — note: FROM scratch has no shell)
# Use this to inspect the volume contents:
fly ssh console --app mallard-metrics-YOURNAME
# > ls /data/

Prometheus metrics:

METRICS_TOKEN=$(fly secrets list --app mallard-metrics-YOURNAME | grep METRICS_TOKEN)
curl -H "Authorization: Bearer $YOUR_METRICS_TOKEN" \
  https://mallard-metrics-YOURNAME.fly.dev/metrics

Fly.io built-in monitoring:

The Fly.io dashboard at fly.io/apps/YOUR-APP shows:

  • Machine CPU and memory graphs
  • HTTP request rate and latency
  • Health check pass/fail history

Scaling and Regions

Increase VM memory (if DuckDB queries are slow or OOMing):

# Edit fly.toml:
# [[vm]]
#   memory = "512mb"   # or "1gb"

fly deploy  # Apply the change

Prevent cold starts (machine auto-stops when idle):

# In fly.toml, ensure:
# [http_service]
#   min_machines_running = 1

This keeps 1 machine always running, eliminating cold-start latency at the cost of ~1 machine's worth of compute (within Hobby allowances).

Multi-region (advanced):

Fly.io supports deploying machines in multiple regions for lower global latency. However, Mallard Metrics uses an embedded single-file DuckDB database — volumes cannot be shared across regions. Multi-region deployment is not recommended without a replication strategy.


Updating Mallard Metrics

# Pull latest changes
git pull origin main

# Deploy (Fly.io builds the new image and does a rolling restart)
fly deploy --app mallard-metrics-YOURNAME

# Monitor the deploy
fly status --app mallard-metrics-YOURNAME
fly logs --app mallard-metrics-YOURNAME

Fly.io performs a blue/green-style deploy — it starts the new machine, runs health checks, and only terminates the old machine once the new one is healthy. Downtime is typically < 5 seconds.


Backup and Restore

Fly.io volumes are not automatically backed up. Back up the DuckDB file and Parquet data regularly.

Export via API (for structured backup):

# CSV export of all data
curl -H "Authorization: Bearer $API_KEY" \
  "https://mallard-metrics-YOURNAME.fly.dev/api/stats/export?site_id=example.com&format=json" \
  > backup-$(date +%Y%m%d).json

Volume snapshot (Fly.io feature):

# List volumes
fly volumes list --app mallard-metrics-YOURNAME

# Create a snapshot (may cause brief I/O pause)
fly volumes snapshots create <VOLUME_ID> --app mallard-metrics-YOURNAME

# List snapshots
fly volumes snapshots list <VOLUME_ID> --app mallard-metrics-YOURNAME

Restore from snapshot:

# Create a new volume from snapshot
fly volumes create mallard_data_restore \
  --snapshot-id <SNAPSHOT_ID> \
  --size 3 \
  --region ord \
  --app mallard-metrics-YOURNAME

Troubleshooting

Machine fails to start

fly logs --app mallard-metrics-YOURNAME | tail -50

# Common causes:
# 1. MALLARD_SECRET not set — run: fly secrets list
# 2. Volume not found — run: fly volumes list
# 3. Port mismatch — ensure MALLARD_PORT=8080 matches fly.toml internal_port=8080

Health checks failing

fly checks list --app mallard-metrics-YOURNAME

# Test the endpoint manually
fly ssh console --app mallard-metrics-YOURNAME
# Inside the console (if you have a shell):
wget -qO- http://localhost:8080/health/ready
# Note: FROM scratch has no shell — use fly proxy instead:
fly proxy 8080 --app mallard-metrics-YOURNAME
# Then in another terminal: curl http://localhost:8080/health/ready

Volume not mounted / data missing after update

# Check the mount
fly ssh console --app mallard-metrics-YOURNAME
ls /data/

# If /data is empty, the volume may have been detached
# Verify volume attachment in fly.toml [mounts] section matches the volume name
fly volumes list --app mallard-metrics-YOURNAME

Out of disk space on volume

# Extend the volume (Fly.io allows online resize)
fly volumes extend <VOLUME_ID> --size 10 --app mallard-metrics-YOURNAME

# Enable retention to prune old data
fly secrets set MALLARD_RETENTION_DAYS=180 --app mallard-metrics-YOURNAME

Machine auto-stopped unexpectedly

# Check if auto_stop_machines is enabled in fly.toml
# Ensure min_machines_running = 1 to prevent full auto-stop

# Or disable auto-stop entirely:
# [http_service]
#   auto_stop_machines = false

Frequently Asked Questions

Q: Does Fly.io encrypt volume data at rest?

Yes — Fly.io encrypts all volume data at rest using AES-256. You do not need to manage LUKS yourself. For compliance requirements, consult Fly.io's security documentation.

Q: Do I need a credit card?

Yes. Fly.io requires a payment method for all accounts, including those that stay within the free allowances. There is no truly card-free free tier.

Q: What is the cold-start latency?

When auto_stop_machines = "stop" and min_machines_running = 0, an idle machine is stopped after ~5 minutes. The first request after that triggers a cold start — typically 2–5 seconds for the Firecracker VM to boot. For an analytics ingestion endpoint, this means some requests may be delayed or dropped during cold start. Set min_machines_running = 1 to keep the machine always warm.

Q: Can I use Fly.io without a custom domain?

Yes. Fly.io provides a free *.fly.dev subdomain with a valid TLS certificate. Use it in your tracking script and dashboard URL.

Q: How do I SSH into the machine?

fly ssh console --app mallard-metrics-YOURNAME

Note that the Mallard container is FROM scratch and has no shell. The fly ssh console command connects to the VM's outer shell (not the container), so you can run ls / but not exec into the container.

To inspect the data volume:

fly ssh console --app mallard-metrics-YOURNAME
ls /data/         # See DuckDB and Parquet files
du -sh /data/     # Check usage

Q: Can I run Mallard Metrics alongside other services?

Fly.io apps are isolated. You can deploy other services as separate Fly apps in the same organisation and they share the same billing account. Each service gets its own machine(s) and volume(s).

Q: How do I migrate from Fly.io to a VPS?

  1. Export your data via the API (/api/stats/export)
  2. Or copy the volume contents: create a volume snapshot, restore it locally
  3. Copy mallard.duckdb and the Parquet data directory to your VPS LUKS volume
  4. Follow the VPS Deployment Guide

Q: Does the behavioral extension work on Fly.io?

Yes, if the behavioral extension binary is included in the build. Check GET /health/detailed"behavioral_extension_loaded": true confirms it loaded successfully.

Q: What happens to in-flight events if the machine is auto-stopped?

Mallard handles SIGTERM with a graceful shutdown — it flushes the in-memory event buffer to Parquet before the machine stops. As long as the shutdown completes within MALLARD_SHUTDOWN_TIMEOUT_SECS (default 30s), no events are lost. Events buffered after the flush starts may be lost. Set min_machines_running = 1 to avoid auto-stop entirely for high-reliability deployments.

Monitoring

Health Checks

Three health endpoints are available without authentication:

GET /health

Returns ok with HTTP 200 when the server is running. Use this for:

  • Load balancer health checks.
  • Container orchestrator liveness probes.
# Kubernetes liveness probe
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

GET /health/ready

Executes a lightweight DuckDB query (SELECT 1 FROM events_all LIMIT 0) to verify the database is operational. Returns:

  • 200 OK — database is ready and accepting queries.
  • 503 Service Unavailable — database is not ready (use this as a readiness probe, not liveness).
# Kubernetes readiness probe
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

GET /health/detailed

Returns a JSON object with component-level status. See Health & Metrics API for the full schema.


Prometheus Metrics

GET /metrics returns Prometheus text format metrics (text/plain; version=0.0.4).

If MALLARD_METRICS_TOKEN is set, this endpoint requires Authorization: Bearer <token>.

Gauges

MetricTypeDescription
mallard_buffered_eventsgaugeEvents in memory, not yet flushed to Parquet
mallard_cache_entriesgaugeCached query results in memory
mallard_auth_configuredgauge1 if admin password is set, 0 otherwise
mallard_geoip_loadedgauge1 if GeoIP database loaded successfully
mallard_filter_botsgauge1 if bot filtering is active
mallard_behavioral_extensiongauge1 if behavioral extension loaded, 0 otherwise

Counters

MetricTypeDescription
mallard_events_ingested_totalcounterTotal events accepted through POST /api/event
mallard_flush_failures_totalcounterTotal buffer flush failures
mallard_rate_limit_rejections_totalcounterTotal requests rejected by the per-site rate limiter
mallard_login_failures_totalcounterTotal failed login attempts
mallard_cache_hits_totalcounterTotal query cache hits
mallard_cache_misses_totalcounterTotal query cache misses

Prometheus Scrape Configuration

scrape_configs:
  - job_name: mallard_metrics
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics
    scrape_interval: 30s
    # If MALLARD_METRICS_TOKEN is set:
    authorization:
      credentials: your-metrics-token

Example Output

# HELP mallard_buffered_events Number of events in the in-memory buffer
# TYPE mallard_buffered_events gauge
mallard_buffered_events 42

# HELP mallard_cache_entries Number of cached query results
# TYPE mallard_cache_entries gauge
mallard_cache_entries 3

# HELP mallard_behavioral_extension Whether behavioral extension is loaded
# TYPE mallard_behavioral_extension gauge
mallard_behavioral_extension 1

# HELP mallard_events_ingested_total Total events ingested
# TYPE mallard_events_ingested_total counter
mallard_events_ingested_total 158432

# HELP mallard_cache_hits_total Total query cache hits
# TYPE mallard_cache_hits_total counter
mallard_cache_hits_total 9871

# HELP mallard_cache_misses_total Total query cache misses
# TYPE mallard_cache_misses_total counter
mallard_cache_misses_total 1204

Grafana Dashboard

A minimal Grafana panel configuration for key metrics:

{
  "panels": [
    {
      "title": "Ingestion Rate",
      "targets": [{"expr": "rate(mallard_events_ingested_total[5m])"}]
    },
    {
      "title": "Buffered Events",
      "targets": [{"expr": "mallard_buffered_events"}]
    },
    {
      "title": "Cache Hit Rate",
      "targets": [{"expr": "rate(mallard_cache_hits_total[5m]) / (rate(mallard_cache_hits_total[5m]) + rate(mallard_cache_misses_total[5m]))"}]
    },
    {
      "title": "Rate Limit Rejections",
      "targets": [{"expr": "rate(mallard_rate_limit_rejections_total[5m])"}]
    }
  ]
}

Structured Logging

Mallard Metrics uses tracing for structured logging. Two formats are supported:

Text (default)

Human-readable output with timestamps, log levels, and structured fields:

2024-01-15T10:00:00.123Z  INFO mallard_metrics: Starting Mallard Metrics host="0.0.0.0" port=8000
2024-01-15T10:00:00.456Z  INFO mallard_metrics: Behavioral extension loaded
2024-01-15T10:00:00.457Z  INFO mallard_metrics: Listening addr="0.0.0.0:8000"

JSON

Set MALLARD_LOG_FORMAT=json for machine-parseable output compatible with log aggregators (Loki, Elasticsearch, Splunk):

{"timestamp":"2024-01-15T10:00:00.123Z","level":"INFO","fields":{"message":"Flushed events to Parquet","count":42},"target":"mallard_metrics::ingest::buffer","request_id":"a3f2c1d8-..."}

Every log line emitted during a request carries a request_id field matching the X-Request-ID response header, enabling end-to-end log correlation.

Log Level Control

Use the RUST_LOG environment variable (standard tracing-subscriber env-filter syntax):

RUST_LOG=mallard_metrics=debug,tower_http=info

Default: mallard_metrics=info,tower_http=info


Alerting Recommendations

AlertConditionSeverity
Server downup{job="mallard_metrics"} == 0Critical
Large event buffermallard_buffered_events > 5000Warning
High flush failuresincrease(mallard_flush_failures_total[5m]) > 0Warning
Auth not configuredmallard_auth_configured == 0Warning
High rate limit rejectionsrate(mallard_rate_limit_rejections_total[5m]) > 10Info
Low cache hit rate(cache_hits / (cache_hits + cache_misses)) < 0.5Info
GeoIP not loadedmallard_geoip_loaded == 0Info
Behavioral extension missingmallard_behavioral_extension == 0Info

Data Management

Storage Layout

Events are stored as date-partitioned, ZSTD-compressed Parquet files under data_dir/events/:

data/events/
├── site_id=example.com/
│   ├── date=2024-01-15/
│   │   ├── 0001.parquet   ← first flush for this day
│   │   └── 0002.parquet   ← second flush for this day
│   └── date=2024-01-16/
│       └── 0001.parquet
└── site_id=other.org/
    └── date=2024-01-15/
        └── 0001.parquet

Each Parquet file contains one batch of flushed events for a specific site and date. Files are numbered sequentially within each partition. Parquet files are self-describing and can be read by any Parquet-compatible tool.


Buffer and Flush Lifecycle

flowchart TD
    EVENT["Incoming Event\nPOST /api/event"]
    EVENT --> BUF["In-Memory Buffer\nVec&lt;Event&gt;"]

    BUF --> T1{"Count reached\nflush_event_count?"}
    BUF --> T2{"Periodic timer\nevery flush_interval_secs?"}
    BUF --> T3{"SIGINT or SIGTERM\ngraceful shutdown?"}

    T1 -->|"Yes"| FLUSH
    T2 -->|"Yes"| FLUSH
    T3 -->|"Yes"| FLUSH

    FLUSH["Flush — spawn_blocking\nDuckDB Appender API\nbatch column insert"]
    FLUSH --> PARQUET["COPY TO Parquet\nZSTD compression\ndate-partitioned file"]
    PARQUET --> DELETE["DELETE FROM events\nhot table cleared"]
    DELETE --> VIEW["Refresh events_all VIEW\nglobbed over new Parquet files"]
    VIEW --> READY(["All data queryable\nevents_all VIEW\nhot union cold"])

Failure safety: If the Appender insertion fails, drained events are restored to the front of the buffer and the flush returns an error. No events are lost due to a failed flush attempt.

Flush triggers:

  1. Event count reaches flush_event_count (default 1000).
  2. Periodic timer fires every flush_interval_secs (default 60 seconds). Runs in spawn_blocking to avoid blocking the async runtime.
  3. Graceful shutdown — bounded by shutdown_timeout_secs (default 30 seconds).

Data Retention

When retention_days is set to a non-zero value, a background task runs daily and removes Parquet partition directories older than the configured threshold.

# Delete partitions older than 90 days
retention_days = 90

What is deleted: the entire date=YYYY-MM-DD/ directory and all Parquet files within it.

What is not deleted: the site_id=*/ parent directory (it remains even if all date partitions have been removed).

To keep data indefinitely, set retention_days = 0 (the default).

GDPR Right to Erasure

Mallard Metrics provides an admin-authenticated DELETE /api/gdpr/erase endpoint to permanently delete analytics data for a given site_id within a date range. Because visitor IDs are pseudonymous daily-rotating HMAC hashes that cannot be reverse-mapped to individuals, erasure operates at the site + date-range granularity — the finest granularity available without the original IP address and User-Agent. See PRIVACY.md for the full analysis and operator obligations.


Backup and Restore

Parquet files are self-describing and portable. To back up:

# Sync data directory to a backup location
rsync -a --checksum /data/events/ /backup/mallard-events/

# Or with rclone to S3
rclone sync /data/events s3:my-bucket/mallard-events

To restore:

rsync -a /backup/mallard-events/ /data/events/

After restore, restart Mallard Metrics. The events_all VIEW automatically picks up all Parquet files on startup.

Tip: Include data/mallard.duckdb and data/mallard.duckdb.wal in your backups to preserve any hot (not yet flushed) events.


Inspecting Data with DuckDB CLI

You can query Parquet files directly with the DuckDB CLI, independent of the Mallard Metrics server:

duckdb

-- Daily visitor and pageview counts for a site
SELECT
    CAST(timestamp AS DATE) AS date,
    COUNT(DISTINCT visitor_id)                            AS visitors,
    COUNT(*) FILTER (WHERE event_name = 'pageview')       AS pageviews
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
GROUP BY date
ORDER BY date DESC;

-- Top pages last 30 days
SELECT pathname, COUNT(*) AS views
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
WHERE event_name = 'pageview'
  AND CAST(timestamp AS DATE) >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY pathname
ORDER BY views DESC
LIMIT 20;

-- Revenue by product
SELECT
    json_extract_string(props, '$.product') AS product,
    SUM(revenue_amount)                      AS total_revenue,
    COUNT(*)                                 AS transactions
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
WHERE event_name = 'purchase'
GROUP BY product
ORDER BY total_revenue DESC;

Schema

The events table schema (also the Parquet file schema):

ColumnTypeNullableDescription
site_idVARCHARNoSite identifier
visitor_idVARCHARNoHMAC-SHA256 privacy-safe visitor ID
timestampTIMESTAMPNoUTC event timestamp
event_nameVARCHARNoEvent type (e.g. pageview, signup)
pathnameVARCHARNoURL path
hostnameVARCHARYesURL hostname
referrerVARCHARYesReferrer URL
referrer_sourceVARCHARYesParsed referrer source name
utm_sourceVARCHARYesUTM source parameter
utm_mediumVARCHARYesUTM medium parameter
utm_campaignVARCHARYesUTM campaign parameter
utm_contentVARCHARYesUTM content parameter
utm_termVARCHARYesUTM term parameter
browserVARCHARYesBrowser name
browser_versionVARCHARYesBrowser version string
osVARCHARYesOperating system name
os_versionVARCHARYesOS version string
device_typeVARCHARYesdesktop, mobile, or tablet
screen_sizeVARCHARYesScreen viewport width in pixels (e.g. 1920)
country_codeVARCHAR(2)YesISO 3166-1 alpha-2 country code
regionVARCHARYesRegion/state name
cityVARCHARYesCity name
propsVARCHARYesCustom properties (JSON string, queryable via json_extract)
revenue_amountDECIMAL(12,2)YesRevenue amount
revenue_currencyVARCHAR(3)YesISO 4217 currency code