Data Management

Storage Layout

Events are stored as date-partitioned, ZSTD-compressed Parquet files under data_dir/events/:

data/events/
├── site_id=example.com/
│   ├── date=2024-01-15/
│   │   ├── 0001.parquet   ← first flush for this day
│   │   └── 0002.parquet   ← second flush for this day
│   └── date=2024-01-16/
│       └── 0001.parquet
└── site_id=other.org/
    └── date=2024-01-15/
        └── 0001.parquet

Each Parquet file contains one batch of flushed events for a specific site and date. Files are numbered sequentially within each partition. Parquet files are self-describing and can be read by any Parquet-compatible tool.


Buffer and Flush Lifecycle

flowchart TD
    EVENT["Incoming Event\nPOST /api/event"]
    EVENT --> BUF["In-Memory Buffer\nVec<Event>"]

    BUF --> T1{"Count reached\nflush_event_count?"}
    BUF --> T2{"Periodic timer\nevery flush_interval_secs?"}
    BUF --> T3{"SIGINT or SIGTERM\ngraceful shutdown?"}

    T1 -->|"Yes"| FLUSH
    T2 -->|"Yes"| FLUSH
    T3 -->|"Yes"| FLUSH

    FLUSH["Flush — spawn_blocking\nDuckDB Appender API\nbatch column insert"]
    FLUSH --> PARQUET["COPY TO Parquet\nZSTD compression\ndate-partitioned file"]
    PARQUET --> DELETE["DELETE FROM events\nhot table cleared"]
    DELETE --> VIEW["Refresh events_all VIEW\nglobbed over new Parquet files"]
    VIEW --> READY(["All data queryable\nevents_all VIEW\nhot union cold"])

Failure safety: If the Appender insertion fails, drained events are restored to the front of the buffer and the flush returns an error. No events are lost due to a failed flush attempt.

Flush triggers:

  1. Event count reaches flush_event_count (default 1000).
  2. Periodic timer fires every flush_interval_secs (default 60 seconds). Runs in spawn_blocking to avoid blocking the async runtime.
  3. Graceful shutdown — bounded by shutdown_timeout_secs (default 30 seconds).

Data Retention

When retention_days is set to a non-zero value, a background task runs daily and removes Parquet partition directories older than the configured threshold.

# Delete partitions older than 90 days
retention_days = 90

What is deleted: the entire date=YYYY-MM-DD/ directory and all Parquet files within it.

What is not deleted: the site_id=*/ parent directory (it remains even if all date partitions have been removed).

To keep data indefinitely, set retention_days = 0 (the default).

GDPR Right to Erasure

Mallard Metrics provides an admin-authenticated DELETE /api/gdpr/erase endpoint to permanently delete analytics data for a given site_id within a date range. Because visitor IDs are pseudonymous daily-rotating HMAC hashes that cannot be reverse-mapped to individuals, erasure operates at the site + date-range granularity — the finest granularity available without the original IP address and User-Agent. See PRIVACY.md for the full analysis and operator obligations.


Backup and Restore

Parquet files are self-describing and portable. To back up:

# Sync data directory to a backup location
rsync -a --checksum /data/events/ /backup/mallard-events/

# Or with rclone to S3
rclone sync /data/events s3:my-bucket/mallard-events

To restore:

rsync -a /backup/mallard-events/ /data/events/

After restore, restart Mallard Metrics. The events_all VIEW automatically picks up all Parquet files on startup.

Tip: Include data/mallard.duckdb and data/mallard.duckdb.wal in your backups to preserve any hot (not yet flushed) events.


Inspecting Data with DuckDB CLI

You can query Parquet files directly with the DuckDB CLI, independent of the Mallard Metrics server:

duckdb

-- Daily visitor and pageview counts for a site
SELECT
    CAST(timestamp AS DATE) AS date,
    COUNT(DISTINCT visitor_id)                            AS visitors,
    COUNT(*) FILTER (WHERE event_name = 'pageview')       AS pageviews
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
GROUP BY date
ORDER BY date DESC;

-- Top pages last 30 days
SELECT pathname, COUNT(*) AS views
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
WHERE event_name = 'pageview'
  AND CAST(timestamp AS DATE) >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY pathname
ORDER BY views DESC
LIMIT 20;

-- Revenue by product
SELECT
    json_extract_string(props, '$.product') AS product,
    SUM(revenue_amount)                      AS total_revenue,
    COUNT(*)                                 AS transactions
FROM read_parquet('data/events/site_id=example.com/**/*.parquet')
WHERE event_name = 'purchase'
GROUP BY product
ORDER BY total_revenue DESC;

Schema

The events table schema (also the Parquet file schema):

ColumnTypeNullableDescription
site_idVARCHARNoSite identifier
visitor_idVARCHARNoHMAC-SHA256 privacy-safe visitor ID
timestampTIMESTAMPNoUTC event timestamp
event_nameVARCHARNoEvent type (e.g. pageview, signup)
pathnameVARCHARNoURL path
hostnameVARCHARYesURL hostname
referrerVARCHARYesReferrer URL
referrer_sourceVARCHARYesParsed referrer source name
utm_sourceVARCHARYesUTM source parameter
utm_mediumVARCHARYesUTM medium parameter
utm_campaignVARCHARYesUTM campaign parameter
utm_contentVARCHARYesUTM content parameter
utm_termVARCHARYesUTM term parameter
browserVARCHARYesBrowser name
browser_versionVARCHARYesBrowser version string
osVARCHARYesOperating system name
os_versionVARCHARYesOS version string
device_typeVARCHARYesdesktop, mobile, or tablet
screen_sizeVARCHARYesScreen viewport width in pixels (e.g. 1920)
country_codeVARCHAR(2)YesISO 3166-1 alpha-2 country code
regionVARCHARYesRegion/state name
cityVARCHARYesCity name
propsVARCHARYesCustom properties (JSON string, queryable via json_extract)
revenue_amountDECIMAL(12,2)YesRevenue amount
revenue_currencyVARCHAR(3)YesISO 4217 currency code