How Walmart's Data Platform Handles 100x Scale During Peaks
Global Scalability and 99.999% Availability with Google Spanner
Hi, this is Samuel from Enginuity 👋 This post is part of the System Design track and focuses on globally scalable data platforms.
You can find all three tracks in the main menu of the Enginuity Newsletter.
As a company expands its operations worldwide, robust, scalable, and highly available databases become a must-have.
For Walmart, a retail giant with 11,000 stores in 20 countries and 265 million weekly active users, managing data efficiently is crucial.
The company has faced significant challenges during peak periods, such as Black Friday and Cyber Monday, when transaction volumes can increase by 100x.
In this post, we look at how Walmart solved its scalability challenges using Google Spanner, a globally distributed and horizontally scalable relational database that combines the benefits of traditional SQL databases with the scalability of NoSQL databases.
Walmart's Need for a Scalable Database System
Walmart processes vast amounts of data related to sales transactions, inventory levels, supply chain logistics, customer interactions, and much more.
Traditional database systems posed several challenges for Walmart. The original data platform used SQL engines such as MySQL and Oracle, as well as NoSQL systems such as Cassandra.
The challenges include limited scalability, high maintenance costs, and difficulties in ensuring high availability and fault tolerance.
Walmart required a database system that can:
📈 Scale horizontally
📚 Handle massive data loads
📉 Ensure high availability with minimal downtime
🔗 Provide strong consistency across globally distributed operations
That’s why Walmart decided to adopt Google Spanner.
The Power of SQL With The Scale of NoSQL
Google Spanner is designed to offer the best of both traditional SQL databases and modern NoSQL databases.
Horizontal Scalability
Traditional SQL databases, such as MySQL, face several challenges with horizontal scaling:
Manual data sharding and cross-shard queries
Distributed transactions with two-phase commits and increased latency
Synchronized schema management and data migrations between shards
On the other hand, Spanner natively scales horizontally by adding more nodes. This allows it to handle massive data volumes and high transaction rates.
At the same time, Spanner’s horizontal scalability allows scaling across multiple regions.
For Walmart, even a single minute of downtime can result in a huge revenue loss. Their platform cannot afford to wait multiple minutes for the system to failover to another available data center or region.
Global scalability is a core design factor that enabled Spanner to have >= 99.999% availability SLA and remain operational even in case of hardware or network failures.
Global Consistency
Spanner automatically replicates data across multiple regions. This results in lower latency because users access data replicas nearest to them.
The key strength is its ability to maintain strong consistency across globally distributed data centers. This is achieved through the use of synchronized clocks and advanced algorithms (more about this later), ensuring that all replicas of the data are up-to-date and accurate.
The result is that Spanner supports ACID transactions, ensuring data accuracy and reliability across all replicas. Therefore, it offers full SQL support (also with PostgreSQL interface) with complex queries and analytics.
The Architecture of Google Spanner
The architecture is composed of several main components:
Spanservers:
They store data and handle read and write operations.
Each server manages a subset of the data.
Data is automatically partitioned and distributed across multiple servers.
Zones:
Data are deployed across multiple geographic regions called zones.
Each zone has one Zonemaster, which assigns data to Spanservers
Each zone contains multiple Spanservers, providing redundancy and fault tolerance.
Directories:
Directories are the units of data distribution.
Each directory is a collection of data that is managed and replicated as a single unit.
To achieve global consistency, Spanner implements a highly accurate time synchronization mechanism using GPS and atomic clocks. This setup is known as the TrueTime API.
TrueTime API
The core function of TrueTime is to provide accurate time information, which is critical for coordinating distributed transactions and maintaining data consistency.
Clock synchronization
Each data center is equipped with GPS clocks that provide precise time information directly from satellites.
In addition to GPS, atomic clocks are used as a fallback to ensure time accuracy even if GPS signals are unavailable.
Each server periodically synchronizes its clock with the TrueTime service within its data center.
Time Representation
TrueTime represents time as an interval [earliest, latest].
earliest and latest are the lower and upper bounds of the current time.
This accounts for any potential clock skew, ensuring that all operations are timestamped within a known uncertainty range.
Achieving External Consistency
1️⃣ When a transaction starts, Spanner assigns it a timestamp represented by a TrueTime interval. This timestamp is guaranteed to be consistent across all nodes.
2️⃣ After a transaction is assigned a timestamp, Spanner waits until the TrueTime interval has safely passed beyond this timestamp before committing. This ensures that no two transactions are assigned the same timestamp.
3️⃣ Any subsequent transactions will have a later timestamp, preserving the order of transactions globally.
Key Insights from Walmart’s Digital Transformation
🎯 Aligning Technology with Business Objectives
Walmart ensured that its technology infrastructure could support its strategic objectives and drive growth. They didn’t just go for technology “upgrades” but for transformation that would unlock more business value.
At the same time, Walmart decided to adopt such a robust technology only when its business needs required it. Their original stack (MySQL, Oracle, Cassandra, etc.) handled the original requirements well enough until true global scale and adaptability were needed.
🤝 Building a Partnership with Technology Providers
A strong partnership with Google was crucial for Walmart’s successful adoption of Spanner.
Such partnership allowed them to leverage Google’s expertise through close communication, joint planning, and tailoring the solution to Walmart’s specific needs.
🌱 Focusing on Customer Experience
Walmart’s choice was between building its own data platform on top of its already existing stack or adopting an already existing solution and focusing on its market differentiator.
They chose the latter and decided to focus their resources on delivering new capabilities within their product and to shorten the time-to-market for their new features.
Summary
Walmart’s scale with 11,000 stores and 265 million weekly active users required transitioning from homegrown solutions based on a combination of SQL and NoSQL engines to:
A unified data platform that combines SQL’s ACID properties with the horizontal scalability of NoSQL.
A solution that will allow them to shorten time-to-market so they can focus on their business differentiators.
A technology with high availability and dynamic scalability to sustain 10-100x increases in traffic.
And their choice of Google Spanner has satisfied these requirements.
Disclaimer: This case study showcases industry best practices based on publicly available information. The specific underlying structure of Walmart might be different.
📖 More From Enginuity
Last week’s post was focused on the Product Engineer track and focused on the role, benefits, and key responsibilities of Product engineers and how to adopt the Product engineering mindset:
📣 Top Picks
How I plan my week as a Senior Engineer in Big Tech by
inMargin of Safety - Mental Model by
inThe Apprentice, The New Boss, The Successor and The Pioneer by
in
Spanner is just great, yet expansive. Do you know how many nodes Walmart uses?
I enjoyed reading this. The accuracy is amazing, thanks to the TrueTime API.
Thanks for the shoutout, Samuel!