List of resources on testing distributed systems curated by Andrey Satarin (@asatarin).
If you are interested in my other stuff, check out talks page.
For any questions or suggestions you can reach out to me on Twitter (@asatarin),
Mastodon (https://discuss.systems/@asatarin)
or LinkedIn.
{% comment %}
Private notes https://docs.google.com/document/d/1xHt_PK9yGMTP6JNDMydQLF4SHIdlq-BF9IZeTOXtIOg/edit
{% endcomment %}
Table of Contents
- A Markdown unordered list which will be replaced with the ToC
{:toc}
Overview of Testing Approaches
Research Papers
Bugs
-
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems—
study of actual bugs in different popular distributed systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper
and Flume)
-
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems—
comprehensive taxonomy of bugs in distributed systems (Cassandra, Hadoop MapReduce, HBase, ZooKeeper)
-
An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems —
based on bug database from "What Bugs Live in the Cloud?" paper researchers focus specifically on crash recovery bugs
in Hadoop MapReduce, HBase, Cassandra, ZooKeeper. There is review of this paper
by Murat Demirbas
in his blog.
-
An empirical study on the correctness of formally verified distributed systems—
study of bugs in formally verified distributed systems. Analysis includes
Microsoft's IronFleet distributed key-value store
built from formal model.
-
What bugs cause cloud production incidents? —
research focused on bugs (and their resolution strategies) that actually cause production incidents in large-scale
distributed services at Microsoft Azure.
Testing
-
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems —
Great overview of how even simple testing can help a lot, you just need the right focus
-
Early detection of configuration errors to reduce failure damage—
why and how to test configuration files of your system
-
Why Is Random Testing Effective for Partition Tolerance Bugs? — just
what it says in a title, authors try to explain why random testing (Jepsen) is effective and introduce
notions of test coverage relating to network partition, see
also "The Morning Paper" review
or a video from POPL 2018.
-
FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems —
novel approach of systematically exploring interleavings in distributed systems augmented with static analysis and
prioritization. This approach is faster than previous techniques and found old and new bugs in several systems
(Cassandra, Ethereum Blockchain, Hadoop, Kudu, Raft LogCabin, Spark, ZooKeeper).
-
Torturing Databases for Fun and Profit —
checking ACID guarantees of open source and commercial databases under power
loss, additional material
-
Understanding and Detecting Software Upgrade Failures in Distributed Systems —
paper presents first study of upgrade failures in distributed systems (Cassandra, HBase, Kafka, Mesos, YARN,
ZooKeeper, etc).
Authors look at severity, symptoms, causes and triggers of these failures and summarize results in a number of
findings.
They propose two new tools to improve testing targeting upgrade failures specifically and apply those tools
to a few systems with good results (new bugs and potential bugs found).
I gave an overview talk of the
paper in September 2022.
Fault Tolerance
-
Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions—
study of several distributed systems (Redis, ZooKeeper, MongoDB, Cassandra, Kafka, RethinkDB) on how fault-tolerant
they are to data corruption and read/write errors
-
The Case for Limping-Hardware Tolerant Clouds— research on effect of limping
hardware on performance of a distributed systems (aka limplock), see also great blog post by Dan Luu on a similar
topic Distributed systems: when limping hardware is worse than dead hardware
-
Toward a Generic Fault Tolerance Technique for Partial Network Partitioning —
overview of network partition failures in various distributed systems (MongoDB, HBase, HDFS, Kafka, RabbitMQ,
Elasticsearch, Mesos, etc), common traits among them and strategies to mitigate those failures.
-
Understanding, Detecting and Localizing Partial Failures in Large System Software —
what happens if your system loses some functionality due to failure as opposed to full fail-stop?
Authors study how these partial failures manifest in distributed systems (ZooKeeper, Cassandra, HDFS, Mesos) and what
triggers them.
They propose runtime approach to detect those failure with mimic-style intrinsic watchdogs and show how these
watchdogs could be generated automatically.
They managed to reproduce 20 out of 22 real world partial failures and detect them using intrinsic watchdogs with
great code localization and reaction time within a few seconds.
See also overview talk of the paper.
Resilience In Complex Adaptive Systems
These materials are not directly related to testing distributed systems, but they greatly contribute to general
understanding of such systems.
Jepsen
State-of-the-art approach to testing stateful distributed systems.
Elle transactional consistency checker for black-box databases:
Some notable Jepsen analyses:
Jepsen is used by CockroachDB, VoltDB, Cassandra,
ScyllaDB and others.
TLA+
Companies using TLA+ to verify correctness of algorithms:
Deterministic Simulation
Pioneered by FoundationDB, deterministic simulation approach to testing distributed systems gained
more popularity in recent years.
More companies and systems adopt deterministic simulation as a primary testing strategy:
See also autonomous testing, FoundationDB.
Autonomous Testing
This approach is currently represented by Antithesis — pioneers in autonomous testing,
defining the space and the state of the art. Will Wilson (of FoundationDB fame) is one of the founders.
See also deterministic simulation, FoundationDB and fuzzing.
Lineage-driven Fault Injection
Netflix adopted lineage-driven fault injection techniques for testing microservices.
Chaos Engineering
Netflix pioneered chaos engineering discipline.
Fuzzing
There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:
And input fuzzing, where message contents or user inputs are fuzzed:
See also autonomous testing.
Microservices
Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.
Series of blog posts specifically on testing in production — best practices, pitfalls, etc:
See also benchmarking tools.
Misc
Testing in a Distributed World
Great overview of techniques for testing distributed systems from practitioner,
the video did age well and still an excellent overview of the landscape.
Additional materials could be found in this GitHub repo
Game Days
Technologies for Testing Distributed Systems
Colin Scott shares his viewpoint from academia on testing distributed systems,
specifically regression testing for correctness and performance bugs.
Test Case Reduction
Specific Approaches in Different Distributed Systems
Google
Amazon Web Services
See also formal methods and deterministic simulation sections.
Netflix
Automated failure injection (see also Lineage-driven Fault Injection):
Random/manual failure injection testing:
See also chaos engineering and lineage-driven fault injection.
Microsoft
See also formal methods section.
-
BellJar: A new framework for testing system recoverability at scale —
BellJar
is a testing framework focused on answering question "What service dependencies are required for the service to
recover after large scale disaster?". BellJar puts service in a vacuum environment with only handful of direct
dependencies allow-listed to verify that recovery procedures succeed under those constraints. It checks those recovery
procedures in CI/CD pipeline preventing unconstrained growth of dependency graph and circular dependencies. Based on
BellJar tests one can construct the entire dependency graph of the services allowing to boostrap them in the
correct order from bottom to top.
-
Vacuum Testing for Resiliency: Verifying Disaster Recovery in Complex — talk on how
BellJar is used at Meta to test recovery of distributed systems
FoundationDB
See also deterministic simulation and autonomous testing.
Cassandra
ScyllaDB
They published series of blog posts on testing ScyllaDB:
Dropbox
-
Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service—
example of how to use QuickCheck to test synchronization in Dropbox and similar tools (Google Drive). John Hughes gave
a talk on this. See also QuickCheck.
-
Data Checking at Dropbox — If you have lots of data, you have to verify that it did
not suffer from bit rot and protect it against rare bugs (e.g. race conditions) to guarantee long term durability.
This talks explains intricacies of building data consistency checker(s) at Dropbox scale.
-
Dropbox's Exabyte Storage System (aka Magic Pocket)
talk by James Cowling — describes number of strategies to achieve extremely high durability.
This includes:
- guard against faulty disks,
- guard against software defects,
- guard against black swan events,
- operational safeguards to reduce blast radius,
- safeguards against deletes with multi stage soft-delete,
- comprehensive testing strategy in-depth with increased scale,
- redundancy across various axis in software and hardware stacks,
- continuous data integrity validation on many levels,
- etc
-
Testing sync at Dropbox — comprehensive overview
of two test frameworks at Dropbox for new sync engine implementation. CanopyCheck — single threaded and fully
deterministic randomized testing framework with minimization for synchronization planner component of the engine. The
other framework Trinity focuses on concurrency and larger surface area of components. Great discussion on tradeoffs
between determinism, strength of test oracles vs width of coverage and size of the
system under test.
Elastic (Elasticsearch)
See also formal methods section.
MongoDB
See also formal methods section.
Confluent (Kafka)
See also formal methods section.
CockroachLabs (CockroachDB)
SingleStore
Formerly known as MemSQL.
LinkedIn
Salesforce
VoltDB
Series of post on testing at VoltDB:
Additional resources:
PingCap (TiDB)
See also formal methods section.
Cloudera
Wallaroo Labs
There is also talk from Sean T. Allen on testing stream processing system at Wallaroo Labs (ex. Sendence)
YugabyteDB
FaunaDB
Shopify
Hazelcast
Basho (Riak)
Etcd
Red Planet Labs
See also deterministic simulation section.
Atomix Copycat
Onyx
Druid.io
TigerBeetle
See also deterministic simulation section.
Convex
See also QuickCheck, FoundationDB, Dropbox, Jepsen,
deterministic simulation.
RisingWave
In a series of two blog posts, RisingWave team talks about their experience using deterministic simulation for testing
distributed SQL-based stream processing platform:
As a result of this work, they open sourced MadSim — Magical Deterministic
Simulator for the Rust language ecosystem.
See also deterministic simulation section.
Single Node Systems
These examples are not about distributed systems, but they demonstrate testing concurrency and level of sophistication
required in distributed systems.
SQLite
SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing
of a database implementation.
Sled
See also deterministic simulation section.
Clickhouse
Network Simulation
QuickCheck
Benchmarking
Linkbench
YCSB