Solana State History
This writeup is dedicated to a single RPC method.

/// Retrieve account state at a historical transaction.
///
/// Returns the state of an account just before a given tx.
fn get_historical_account_info(Address, Signature) -> AccountInfo
As of now, no service provides this simple query. Let's explore the technical background of why state history is notoriously hard. We also propose a practical design for implementing a state history archive.
Background
Solana's state layer is the accounts database, a key-value store assigning addresses to arbitrary byte buffers and some metadata. Programs use these mutable buffers to persist data such as currency prices, token balances, or exchange orders. Just like a file on a computer, programs can change bytes in accounts.
Yet, it is impossible to get the state of an account at a previous point in time. For example, you might want to retrieve historical order books for backtesting/auditing, see the ownership history of an NFT, or balances to analyze your asset holdings over time.
Solana's RPC API – the de-facto standard interface for accessing blockchain data – only has a getAccountInfo
method. It returns the latest state of an account, but you cannot go back in time. That's also the case for other proprietary APIs (GraphQL).

Granted, workarounds enable these use-cases today, mainly using specialized indexes fed by Geyser plugins. The Solana validator client has hardcoded support for token program state history, even.
What's missing is a generic solution that works on the accounts level. This is a hard requirement for the bpf.wtf project: In order to debug arbitrary transactions, we need raw account inputs.
Several other projects have expressed interest in historical state too over the past few weeks. Without further ado, let's design a time machine. 🤓
Challenges
Availability of historical data
Solana nodes prune their databases aggressively to only keep recent state and ledger data. Before we can index historical accounts, we need to acquire and verify this data externally.
The validator peer-to-peer interface is limited to transfers of recent ledger (repair protocol) and recent state snapshots only (bootstrap nodes), making it useless for historical data.
For now, our best bet is to aggregate data from external sources. Triton has kindly offered to help; they keep a complete copy of the ledger in blockstore format and state snapshots in periodic intervals.
Verification
Just using their data would be too easy. Although they seem like nice people, we trust nobody, and want to ensure that nobody has tampered with the data (looking at you, bit rot). The industry broadly considers two approaches acceptable:
- Full verification by re-syncing ledger data
- Simplified verification by verifying an authenticated data structure against the validator supermajority (a.k.a light client)
On Solana v1.11, simplified state verification is impossible because validators do not aggregate-sign state commitments. This leaves only a re-sync from genesis which only uses ledger data. Implying we won't use external state snapshots to build our state database. 🧐
A convenient side-effect of full re-syncing is the generation of – you guessed it – historical state data! Let's revisit technical details of ledger replaying below after looking at the remaining problems.
Scale
Suppose you store every revision of every account that has ever existed. The amount of data would be immense. Every second, programs write-lock accounts totaling roughly one gigabyte. That's 2.62 petabyte per month.
This is not only a huge recurring investment in raw storage space. The harsh reality of the infrastructure business is the fact that storage costs for a static amount of data itself are recurring. Servers, RAID controllers, and drives will fail and electricity isn't free.
Trying to be the first mover on state history is a deceiving opportunity for blockchain services companies. Did they give up after seeing these numbers?
Consistency
The latest few seconds of block history aren't finalized/"rooted" because of optimistic confirmation. A state archive would need to retain copies for every possible fork and delete stale forks that finalization didn't select for eternity.
Concepts
Ledger Replay
To fully verify old blocks, we use the Solana CLI.
USAGE:
solana-ledger-tool verify [FLAGS] [OPTIONS]
OPTIONS:
--accounts <PATHS> Persistent accounts location
-l, --ledger <DIR> Use DIR as ledger location
--halt-at-slot <SLOT> Halt processing at the given slot
In short, the ledger tool takes an accounts DB at a historical slot and executes blocks on top from the blockstore DB. It takes the same code path as syncing a full node.
tar -Izstd -xvf snapshot-*.zst
But there's a catch: Solana v1.11.5 can't actually extract any state info that gets produced during replay. The first concerete piece of work arises here: The ledger tool needs a Geyser plugin integration. Fortunately, it only takes 36 lines of code to bolt the solana-geyser-plugin-manager
crate onto solana-ledger-tool
.
Blockdaemon's Kafka publisher is one of the many plugins that can be used with this new integration. For example, this config sends a message to the Kafka message broker whenever an account changes during replay.
{
"libpath": "/usr/opt/lib/libsolana_accountsdb_plugin_kafka.so",
"kafka": {
"bootstrap.servers": "localhost:9092",
"request.required.acks": "1",
"message.timeout.ms": "30000",
"compression.type": "lz4",
"partitioner": "murmur2_random"
},
"shutdown_timeout_ms": 30000,
"update_account_topic": "solana.mainnet.replay_1234.account_updates",
"publish_all_accounts": true
}
Version hell
The validator community approves breaking changes (e.g. new VM instructions) at a quick pace. These breaking changes cause "hard forks" – new state transition rules that require a software upgrade. Keep an unsupported version, and verification fails, "forking" the client away from the majority.
Standard stuff so far, except that Solana v1.11 cannot verify ledger history from 2 years ago, by design. While Bitcoin and Ethereum clients choose to support all possible ledger rules since genesis, the Solana team cleans up old and unused code paths a few months after each hard fork. We're forced to use multiple versions for blockchain verification.

That said, milestone complete: We have a solution to verify arbitrary ledger data and export state history.
Ad-hoc execution
Our Kafka topic above still stores data with one full copy per account (optionally compressed). Tackling the scaling challenge requires reducing the amount of data in storage.
Geth famously introduced ad-hoc tracing via the debug_traceTransaction
RPC. It allows you to inspect a state data at every EVM instruction executed through partial replaying, demonstrating that omitting historical state works as long as we can reconstruct it on-demand.
Let's apply this to Solana: The state transition function of a transaction transforms inherent input (execution context) and user extriniscs (accounts + tx data) into a new set of accounts. If one of the account inputs required for reconstruction is missing, no worries either; this approach can be applied recursively. Implicit state transitions between epochs are omitted for brevity – these involve inflation, slashing, and rent.

Space-time tradeoffs
Time machine builders would call this a space-time-tradeoff.
Our data took too much space, so we traded it against time. This time is paid as query latency. Ad-hoc lookups spend time during three stps.
- Query planning: Finding all transactions that write-locked an account.
- Fetching dependencies: Loading all available account dependencies
- Partial replay: Actually re-executing transactions (probably the fastest part)

Exponential dependencies
Ad-hoc outputs can be modeled as a dependency tree. Start with your output and draw a line for every missing input. These inputs could also be missing, and in turn, depend on other preceding inputs. A query against a sparse database could quickly devolve into chaos. To get to your stem (output), you'll need to start at every leaf and execute every branch.

So, the less revisions a database stores, the higher the probability that any revision in the database needs ad-hoc replaying. Decreasing account copies results in an exponentially greater average number of transaction replays. Plotting this tradeoff looks somewhat like this:

Finally, let's address practicality. The Rust Solana runtime wasn't built with point-select replaying in mind. Building a partial-replay database requires refactoring and maintaining a runtime that can simulate transactions on historical state. Not easy!
Delta compression
Modeling accounts as independent time series avoids a dependency explosion. We can capture the way in which accounts change over time. Specifically, instead of storing two versions of an account, we only store one and a difference.
The Serum program dominates in terms of write-locks weighted by account size. Yet, when we observe the live writes to an account, most data stays the same.
Virtually all on-chain transactions behave like this. Because programs use stable data layouts, they only change a few specific bytes per operation.
— Richard Patel (@terorie_dev) July 20, 2022
Pyth accounts, for example, host the data of multiple publishers. Each one just updates its respective entry.
Delta compression algorithms like VCDiff are designed to encode these changes efficiently. Deltas can also be arranged recursively to build an arbitrarily long chain of diffs that roots in a full copy.
The size of a diff is most likely equal-or-less than the size of an account in its place. We can treat the diffs-per-copy ratio as a dial for the compression ratio of our database. This comes at the sacrifice of query time yet again – this time only O(n).
Sharding & Compaction
Generating and storing diffs is not trivial given thousands of concurrently active accounts. Applying delta compression in real-time requires keeping a large accounts DB lookup cache around to generate diffs against. Cache requirements worsen if updates are received out-of-order. E.g. when indexing two parts of ledger history in parallel.
Counterintuitively, the most effective way to compress is to insert uncompressed and sorted entries. Then fix them up later, repeatedly. Distributed compaction-oriented storage engines got popularized by Google Bigtable and are now ubiquitous.
A custom compaction strategy also allows us to generate deltas and index data simultaneously, further simplifying architecture.

Sharding reduces a large problem to smaller localized problems: We view addresses as 256-bit integers and create equal-sized shards each covering a range of this integer space. As long as our partitioning stays the same, each account has one assigned shard. When viewing state history as a 2D space, addresses are our horizontal axis.
The vertical axis is time. We split by slot numbers to chop account history into blocks. Now, these blocks contain a sample of updates of a subset of accounts (X-axis) in a certain time-frame (Y-axis).
The first tier of blocks is created on ingestion (e.g. from Kafka), as the smallest tier, in a few simple steps.
- Collect new updates for a small amount of time (~8 slots), up to ~2 gigabytes.
- Re-arrange accounts into their respective block.
- Sort updates by address and time.
- With account updates locally sorted, apply delta compression in a linear pass.

At this point, these we have successfully compressed updates that lie within in a few seconds from each other. Some accounts obviously change at a slower pace.
The process of compaction involves recursively aggregating blocks into larger and larger blocks. With each level, we bring updates belonging to the same accounts closer together and create longer diff chains. Aggregating two sorted collections is trivially doable in a single pass (low memory requirements).
The final component of our database is a metadata table; A small database that tracks the locations of each block so we can identify where an account revision sits. Additionally, each block gets prefixed with a local immutable index (compact B-tree) for point-select and range queries.
ClickHouse is a powerful analytics database that does it all out-of-the-box.
- Multi-server sharding
- Delta-compression
- Compaction
- Replication
- Query planning / execution
Thanks to Leo, an avid Clickhouse fan, I was saved from having to build my own storage engine.

What's next?
First of all, thanks so much for reading all of this. 🙌
I believe that state history is an essential service for the Solana ecosystem.
Essential services should be open-source; permissionless ledger data must be accessible. Public goods push the web3 ecosystem forward in a way proprietary tech cannot, while leveling the playing ground for competition in hosted services space.