Blockchain Data: Time Series or Relational? Uncovering the Best Solutions by Use Cases

Sentio
12 min readJan 19, 2023

--

Authors: Kan Qiao at Sentio XYZ

TLDR;

  • Generally, Time series data is more suitable for real-time monitoring, such as in DevOps, while relational databases are better suited for analytics use cases.
  • Time series DBs make simple/time range related query/computation really fast, but often have issues with a large cardinality of labels.
  • In crypto use cases, there is a significant overlap between monitoring and business analytics.

Introduction

Common data-related tasks are typically accomplished by storing data within designated storage systems.

A widely used example of data storage is using a relational database management system, which can be further classified as OLAP (online analytical processing) and OLTP (online transactional processing). For instance, BigQuery and Snowflake are both well-known for their ability to handle analytical and data science workloads, such as finding out the top-selling products by month across different sales regions. On the other hand, systems like Oracle, SQL Server and Spanner are recognized for handling transactional workloads, such as processing bank transactions.

A Time Series Database is specifically designed to handle time-stamped data, also known as time series data. These data consist of measurements or events that are tracked, monitored, downsampled, and aggregated over time. Examples include server metrics, application performance monitoring, network telemetry, sensor data, events, clicks, and trades in a market, among many other types of data that will be used for analytical purposes.

This document focuses on the tradeoffs between these systems. To keep things simpler, it focuses on the use case of storing simple metrics (i.e. with numerical floating-point values) which are keyed by metric keys. Let’s dive into some more details.

Storage

Data in relational databases are physically organized into rows and columns. Historically, databases were primarily row-oriented. With increasing demand of analytical workload, more systems are born as column-oriented.

Row-oriented storage is optimized for reading and writing individual rows of data, Often these systems are optimized for cases where one transaction reads, writes or modifies a single row. Conversely, column-based storage is more efficient for data compression, scanning large amounts of data, and joining columns, making it highly suitable for OLAP workloads. In a column-based architecture, a query usually reads/writes a group of rows.

Time series databases typically store data in chunks based on time intervals. This makes it very easy to scan through a specific range of data and allows users to easily remove old data or reduce the resolution of (often called ‘downsampling’) old data. The storage is often in the form of a key-value structure, where the key is a timestamp, and the value is a number (e.g. a float for a metric) or a string (e.g. for logs). In addition, users can also add labels to key-value pairs, to distinguish values (for example, coming from different sources), and often these labels are essentially dictionary-compressed by storing them in a different table, mapping these labels to simple numeric ids.

For example, to record the transaction volume of a blockchain protocol, where the protocol has X tokens and runs on Y different chains, this data could be recorded as:

txnVolume.record(volUSD, {“chain”: chain, “token”: token})

In this example, chain and token are labels. Note that every combination of chain and token becomes a time series and are stored somewhat separately in the database.

The total number of combinations of all labels is called the cardinality of the labels. Time-series databases often use encoding techniques, such as a global dictionary, to efficiently store multiple time series. This method involves replacing labels with an integer id and storing the id in the database, along with an entry in the dictionary that maps the id to the actual label.

It is not uncommon for a time series database to have a cardinality of millions or even billions of labels. Dictionary encoding effectively reduces the storage need if you happen to have labels that occur more than once throughout the database.

However, this is not always the case. Consider a typical scenario where you want to put user wallet addresses as labels. In this case, the cardinality of the labels is very large, and dictionary encoding is not very effective. We refer to scenarios like this as use cases with ‘large-cardinality’ labels, which causes issues even with modern time-series databases.

Query

Users can write queries to perform analytical work on top of the storage system. Normally the query is compiled/converted following this path:

Query String -> AST (abstract syntax tree) -> Logical Plan -> Physical plan

Logical plan is a tree structure that represents how the query would be carried out logically, while physical plan is the actual operators that is executed by the database. A component within the database called the query optimizer is responsible for converting the logical plan into a physical plan. Suppose we are to find out the trade volume of each day, over the past 90 days. Logically, this can be done by grouping all trades by its day, then compute the sum of the volume, within each day. This is a typical job that can be represented with a logical plan, done by a query planner. This is how the logical plan would look like:

With the logical plan, the query optimizer can then choose the best physical plan to execute the query. One common optimization that a query optimizer would do is to choose hash join over other type of join operators, as the query optimizer knows that date would have a low cardinality (since we are taking only 90 days’ numbers). As you can see, problems arise when the cardinality is high, i.e. when we want to find out the trade volume of each wallet per day, where memory is insufficient to hold all state that is needed for computation. In this case, the intermediate state needs to be spilled to disk, which tends to be quite slow. Either a SQL database or a time-series database can perform said computation, though represented in different query languages, with different query plans. One great thing about time-series databases is that they can perform this computation very quickly, since the query pattern has a good way of taking full advantage of time-series data layout (neither sort nor join is necessary).

Query Languages

Relational databases typically use SQL as the main language (with different dialects of course), while time series databases use some form of specialized query language, like PromQL. One can perform join, scan, and aggregation either by SQL and PromQL, but they have quite some differences in flexibility and efficiency.

Query Semantics

SQL has a somewhat well-defined semantics, though different dialects vary slightly, you get exactly what you described in the query string. The time series query language, e.g. PromQL has a sampling concept builtin. For example, you need to specify a sampling step at query time. This is enough for users to see a trend with low query latency, for example discovering a big value change that sustained a given time period, but might surprise you on tasks that require extreme precision (e.g. locate one single/sparse data point).

For example, in Prometheus, you need to give a lookback_delta parameter at setup time. This is essentially the staleness of the data you are willing to accept. At query time, you need to specify:

  • start_time: the start time of the query
  • end_time: the end time of the query
  • step: the sampling time-step. The system randomly samples a data point within each step.

The query will return a list of data points, where each data point is a tuple of (timestamp, value). Because of the combination of lookback_delta and step, it is sometimes hard to locate a single data point.

Query Pattern v.s. Data Layout

SQL database workloads are categorized as OLAP and OLTP. Thus, column-oriented storage and row-oriented storage are designed to make the corresponding query perform better (in terms of both time and storage). Similarly, for time series use cases, the engines are optimized for timestamp-bounded range scan, and a limited set of aggregations within a time range. Therefore, the data is stored as timestamped chunks.

From these, we can understand that there is a tradeoff between storage layout and query pattern. We make conscious decisions to offload some query complexity to storage, thus making certain queries run more efficiently, while sacrificing performance of other queries (sometimes it is an engineering priority issue).

Comparing the 2 Systems

Simple Scan Queries

To scan the data in a given range, or perform simple aggregations, time series databases are very efficient since the data are already stored together physically, like the example we walked you through in Query Section. You need much less IO operations to fetch the data and perform the computation than SQL databases.

Complex Join/Aggregation Queries

SQL systems really spend a lot of effort optimizing flexible and general queries. Thus, they often perform better and more consistently than a time series database. One typical example is called ‘large cardinality’ aggregation. We have talked about what is large cardinality in the Storage section. This is a very famous/common problem that exists in time series databases. This is often a big problem in the aggregation query, since most time series databases did not implement spooling (thus all the aggregations are done in memory) or efficient aggregation optimizations, while many SQL databases have many optimizations for this type of task.

It is worth noting that large cardinality is not only an issue for time series databases, it is very hard to solve in SQL databases too. It all comes down to SQL databases prioritize engineering effort to solve this type of problems, and also push a bunch of complexity to the storage layer of the system.

Data Retention/Sampling

When executing the query, PromQL execution engine has sampling directly builtin (so that data gets sampled automatically). SQL also has certain sampling functions, however, they need to be explicitly applied. A big advantage of time series DB is that it leverages the data layout to do retention. It is very easy to down sample or directly drop a complete chunk of old data.

Write performance

Relational systems tend to use a B-Tree layout for data, giving consistent (but not amazing) write performance to any type of write workload (i.e. writing random rows vs writing sequential rows might not perform all that differently). Versus most time-series databases are optimized to write to many labels, but all at the same timestamp (i.e. current time), allowing the databases to essentially have a very efficient LSM (log-structured merge-tree) style write workload where data rarely (if ever) needs to be rewritten. However, if the write workload is different (i.e. every point is to a new label, or points are written to a wide variety of timestamps), then the performance will often be much worse.

Traditional Time Series Use Cases

SQL databases are widely used for both analytical and transactional workloads. For time series databases, server metrics, application performance monitoring, network telemetry, sensor data, events, clicks, and trades in a market often are naturally timestamped.

Here is a sample dashboard:

For modern DevOps and monitoring, alerts are often set up based on metrics. For example:

  1. Alert when CPU or Memory usage is above a threshold.
  2. 2. Alert when the 99% latency is above 1 second.
  3. 3. Alert when the change rate over the last 5 mins is above 30%.

These are all perfect fit for time series databases. How about crypto/blockchain data?

Current Blockchain Data Infrastructure

Currently, most people put blockchain data into a SQL database. For example, there are tables for blocks, transactions, events, traces, etc. There might also be tables of decoded data for the top projects (Uniswap, Compound, etc). Here is an example of a possible table schema of blocks.

The true pioneer of this space is Dune Analytics. They built an easy-to-understand schema, clean visualization as well as a great community. There are also a few other companies who are approaching this problem from a similar angle. They might add a query interface of GraphQL, which is very convenient for frontend developers to query data.

Blockchain Data Case Studies

We will go over a few cases to determine which database system performs better.

Case 1: TVL (Total Value Locked)

First, the most common use cases on Dune Analytics are computing TVL (total value locked). Here is one example of Dune dashboard to compute Loopring token balances (TVL). You can see token balances are a table of contents. It only captured the balance of the latest date.

In fact, users might want to see the balance of a token changes over time. This is definitely doable in SQL databases. But it is not very efficient. You’d need either join the table with dates, or do a window aggregation.

However, if you use time series databases, you can directly store the data as you compute the balance over time. Then you can visualize it as a time series chart. Moreover, the latest data is instantly available.

Case 2: Transaction Volume

There are 2 types of transaction volume.

Accumulative Transaction Volume

Accumulative transaction volume, which is the total amount of transactions over time. Taking loopring’s total NFT minted, the Dune dashboard has a total number.

This has 2 issues

  • Since it is aggregating all the historical data, it is very slow to compute since it must recompute transactions every time it runs the query, which can lead to timeouts. (Using Dune Engine V2 can definitely mitigate the issue)
  • Similar to TVL computation, it is not very efficient to compute the cumulative volume. Time series databases can do it much more efficiently.

Transaction Volume of a Given Time Period

This is also another common use case. Taking loopring’s deposit volume, dune dashboard has the deposits over the last 3 months. You probably noticed that the time range is fixed. (In theory, it is possible to make time range a parameter binding to control the time window, but it is not very convenient for users to do so)

Using time series databases, you get the benefits of viewing any given time range with very low latency

Case 3: Real-time Alert

This is another use case for time series databases to achieve the best performance. Say if you’d like to set up an alert when the trading volume of the last 5 mins is above a threshold. You would expect to get the alert instantly. Users need to set up an alert rule and the alert should be automatically triggered when the condition is met.

Case 4: Top Trading Users

Users might want to see the top trading users at the given time period. For this case, time series DB may not be the best choice even if it can solve this problem at a smaller scale. This is due to the large cardinality problem mentioned earlier.

A Little About Sentio

Sentio is an observability platform for Web3. We are pushing the limits of both time series and relational databases and using them together to give great experiences to our users. Sentio generates metrics, logs, and traces from existing smart contracts data through our low code solution, which could be used to build dashboards, set up alerts, analyze user behaviors, create API/Webhooks, simulate/debug transactions and more. Sentio supports Ethereum, BSC, Polygon, Solana, and Aptos. Sentio is built by veteran engineers from Google, Linkedin, Microsoft and TikTok, and backed by top investors like Lightspeed Venture Partners, Hashkey Capital and Canonical Crypto.

Visit Sentio at sentio.xyz. Follow us on Twitter for more updates.

--

--

Sentio

End-to-end observability platform to help you gain insights, secure assets and troubleshoot transactions for your decentralized applications.