Polymarket Orderbook Archive (v2)

Hourly Parquet dumps of the Polymarket CLOB orderbook event stream, stored in Cloudflare R2.

Contents

Overview

Why v2?

v2 was built to address three structural problems with the v1 archive:

Download

curl -O https://r2v2.pmxt.dev/polymarket_orderbook_2026-04-17T12.parquet

Files are 100–400 MB each.

Data format

Bucket layout

Parquet internals

Column schema

16 columns. Columns 1–5 are always populated. Columns 6–16 are nullable — NULL outside the event type that owns them (see matrix below).

Row order: (market, asset_id, timestamp_received) ascending. Preserved at write time so Parquet's dictionary/RLE encoders see long same-asset runs — readers that want to exploit it should avoid re-sorting on load.

Encoding: delta = DELTA_BINARY_PACKED (bit-packed per-row deltas, for monotonic integer columns). dict = dictionary + RLE. All columns are then compressed with ZSTD(9).

# Column Type Encoding Meaning
1 timestamp_received timestamp[ms, UTC] delta When the exporter ingested the event.
2 timestamp timestamp[ms, UTC] delta Source timestamp from Polymarket.
3 market fixed_size_binary[66] dict Condition ID (ASCII 0x + 64 hex chars).
4 event_type string dict One of the four event types.
5 asset_id string dict Outcome token ID (decimal string).
6 bids string (nullable) dict Raw JSON depth on book events: [["price","size"],...].
7 asks string (nullable) dict Raw JSON depth on book events, same shape as bids.
8 price decimal(9,4) (nullable) dict Event price.
9 size decimal(18,6) (nullable) dict Event size.
10 side string (nullable) dict BUY or SELL.
11 best_bid decimal(9,4) (nullable) dict Best bid at event time.
12 best_ask decimal(9,4) (nullable) dict Best ask at event time.
13 fee_rate_bps uint16 (nullable) delta Fee in basis points.
14 transaction_hash string (nullable) dict On-chain tx hash.
15 old_tick_size decimal(9,4) (nullable) dict Tick size before change.
16 new_tick_size decimal(9,4) (nullable) dict Tick size after change.

Column population by event type

Column book price_change last_trade_price tick_size_change
bids
asks
price
size
side
best_bid
best_ask
fee_rate_bps
transaction_hash
old_tick_size
new_tick_size

Unmarked cells are NULL. timestamp_received, timestamp, market, event_type, asset_id are always populated.

Storage optimizations

Every choice below is backed by measurements on ~60M-row hourly partitions.

Choice Why
Sort (market, asset_id, timestamp_received) Same-asset rows cluster; best quote rarely moves tick-to-tick → long RLE runs
DELTA_BINARY_PACKED on timestamp, timestamp_received, fee_rate_bps Monotonic ints → tiny per-row deltas, bit-packed
Decimals / uint16 / fixed_size_binary(66) instead of strings Fixed-width native types; no length prefixes
Dictionary encoding on low-cardinality columns asset_id ~40k distinct, market ~20k, event_type 4, side 2
ZSTD level 9 Write-once archive; decompression speed unchanged
bids/asks as JSON strings (not parallel arrays) Only 0.031% of rows are book; empty-array metadata would dominate
Nullable on event-specific columns Cheaper than zero/empty placeholders; distinguishes absent from zero
data_page_version = 2.0 Tighter per-page headers

Grafana dashboard

Live archive health at grafana.pmxt.dev/monitor. Refreshes every 15s.

Market Coverage

Panel Description
Subscribed Markets Live count of markets the exporter's WebSocket is subscribed to.
Events per Minute Orderbook events written to ClickHouse in the last minute.
Market Subscribed Over Time History of subscribed-market count — drops indicate subscription loss.

Data Loss Events

Panel Description
Assets Currently Down Assets with no recent data. Green at 0, red above.
Asset Down Events (current period) Count of asset-down events in the selected time range.
Asset Down Events (lifetime) Count of asset-down events over the last 30 days.
Asset Down Events Bar chart of asset-down events per minute.
Recent Asset Data Gap Events Table of recent gaps: asset, time, duration, recovery time.