Polymarket Orderbook Archive (v2)
Hourly Parquet dumps of the Polymarket CLOB orderbook event stream, stored in Cloudflare R2.
Contents
Overview
- Source: Polymarket CLOB WebSocket market channel (
wss://ws-subscriptions-clob.polymarket.com/ws/market). We subscribe to every live asset and persist each event with its native fields. - Access: public HTTPS, no credentials required.
- Cadence: one Parquet per UTC hour, written ~5 min after the hour closes. Empty hours are skipped.
- Coverage: starts 2026-04-13T19 UTC. ~31 GB / 94 files as of 2026-04-18.
- Event types:
book,price_change,last_trade_price,tick_size_change.price_changeis ~99.7% of rows. See the market-channel docs for upstream payload shapes.
Why v2?
v2 was built to address three structural problems with the v1 archive:
- Storage efficiency. v1's schema was verbose and poorly typed, producing bloated Parquet files. v2 uses a tighter, natively-typed schema (fixed-size binary for
market, decimals for prices/sizes, delta-encoded timestamps) — the resulting files are substantially more compact for the same event stream. - Market coverage. v1 was missing roughly 50% of live markets due to incomplete subscription handling. v2 subscribes to every live asset, giving full market coverage.
- Redundancy and gap elimination. v2 runs redundant exporters so a single ingestion failure no longer creates a hole in the archive. Data gaps are now rare.
Download
- Browse: archive.pmxt.dev/Polymarket/v2
- Direct URL:
https://r2v2.pmxt.dev/polymarket_orderbook_YYYY-MM-DDTHH.parquet
curl -O https://r2v2.pmxt.dev/polymarket_orderbook_2026-04-17T12.parquetFiles are 100–400 MB each.
Data format
Bucket layout
- Key pattern:
polymarket_orderbook_YYYY-MM-DDTHH.parquet(UTC). - One object per hour. A missing key means zero events that hour.
- Filter by prefix: e.g.
polymarket_orderbook_2026-04-17Tfor one day.
Parquet internals
- Sort order:
(market, asset_id, timestamp_received). - Row groups: 1,048,576 rows each (pyarrow default).
- Compression: ZSTD(9), dictionary-encoded on all columns except
timestamp,timestamp_received,fee_rate_bps, which useDELTA_BINARY_PACKED. - Data page version: 2.0.
- Fast predicates: exact match on
market/asset_id, time ranges ontimestamp_received. Array/string-content filters won't push down.
Column schema
16 columns. Columns 1–5 are always populated. Columns 6–16 are nullable — NULL outside the event type that owns them (see matrix below).
Row order: (market, asset_id, timestamp_received) ascending. Preserved at write time so Parquet's dictionary/RLE encoders see long same-asset runs — readers that want to exploit it should avoid re-sorting on load.
Encoding: delta = DELTA_BINARY_PACKED (bit-packed per-row deltas, for monotonic integer columns). dict = dictionary + RLE. All columns are then compressed with ZSTD(9).
| # | Column | Type | Encoding | Meaning |
|---|---|---|---|---|
| 1 | timestamp_received |
timestamp[ms, UTC] |
delta |
When the exporter ingested the event. |
| 2 | timestamp |
timestamp[ms, UTC] |
delta |
Source timestamp from Polymarket. |
| 3 | market |
fixed_size_binary[66] |
dict |
Condition ID (ASCII 0x + 64 hex chars). |
| 4 | event_type |
string |
dict |
One of the four event types. |
| 5 | asset_id |
string |
dict |
Outcome token ID (decimal string). |
| 6 | bids |
string (nullable) |
dict |
Raw JSON depth on book events: [["price","size"],...]. |
| 7 | asks |
string (nullable) |
dict |
Raw JSON depth on book events, same shape as bids. |
| 8 | price |
decimal(9,4) (nullable) |
dict |
Event price. |
| 9 | size |
decimal(18,6) (nullable) |
dict |
Event size. |
| 10 | side |
string (nullable) |
dict |
BUY or SELL. |
| 11 | best_bid |
decimal(9,4) (nullable) |
dict |
Best bid at event time. |
| 12 | best_ask |
decimal(9,4) (nullable) |
dict |
Best ask at event time. |
| 13 | fee_rate_bps |
uint16 (nullable) |
delta |
Fee in basis points. |
| 14 | transaction_hash |
string (nullable) |
dict |
On-chain tx hash. |
| 15 | old_tick_size |
decimal(9,4) (nullable) |
dict |
Tick size before change. |
| 16 | new_tick_size |
decimal(9,4) (nullable) |
dict |
Tick size after change. |
Column population by event type
| Column | book |
price_change |
last_trade_price |
tick_size_change |
|---|---|---|---|---|
bids |
✓ | |||
asks |
✓ | |||
price |
✓ | ✓ | ||
size |
✓ | ✓ | ||
side |
✓ | ✓ | ||
best_bid |
✓ | |||
best_ask |
✓ | |||
fee_rate_bps |
✓ | |||
transaction_hash |
✓ | |||
old_tick_size |
✓ | |||
new_tick_size |
✓ |
Unmarked cells are NULL. timestamp_received, timestamp, market, event_type, asset_id are always populated.
Storage optimizations
Every choice below is backed by measurements on ~60M-row hourly partitions.
| Choice | Why |
|---|---|
Sort (market, asset_id, timestamp_received) |
Same-asset rows cluster; best quote rarely moves tick-to-tick → long RLE runs |
DELTA_BINARY_PACKED on timestamp, timestamp_received, fee_rate_bps |
Monotonic ints → tiny per-row deltas, bit-packed |
Decimals / uint16 / fixed_size_binary(66) instead of strings |
Fixed-width native types; no length prefixes |
| Dictionary encoding on low-cardinality columns | asset_id ~40k distinct, market ~20k, event_type 4, side 2 |
| ZSTD level 9 | Write-once archive; decompression speed unchanged |
bids/asks as JSON strings (not parallel arrays) |
Only 0.031% of rows are book; empty-array metadata would dominate |
Nullable on event-specific columns |
Cheaper than zero/empty placeholders; distinguishes absent from zero |
data_page_version = 2.0 |
Tighter per-page headers |
Grafana dashboard
Live archive health at grafana.pmxt.dev/monitor. Refreshes every 15s.
Market Coverage
| Panel | Description |
|---|---|
| Subscribed Markets | Live count of markets the exporter's WebSocket is subscribed to. |
| Events per Minute | Orderbook events written to ClickHouse in the last minute. |
| Market Subscribed Over Time | History of subscribed-market count — drops indicate subscription loss. |
Data Loss Events
| Panel | Description |
|---|---|
| Assets Currently Down | Assets with no recent data. Green at 0, red above. |
| Asset Down Events (current period) | Count of asset-down events in the selected time range. |
| Asset Down Events (lifetime) | Count of asset-down events over the last 30 days. |
| Asset Down Events | Bar chart of asset-down events per minute. |
| Recent Asset Data Gap Events | Table of recent gaps: asset, time, duration, recovery time. |