Non-abstract Large Systems Design

The design and construction of any Internet site has to foresee fast growth on demand. This is done via scalability, automation using several concepts and tools. Find below some of them.


Caches are “everywhere” !!

Wide description of caches (hardware and software) at Wikipedia

If you have already read about the Translation look-aside buffer you know the it a cache based on the hardware.

Here we will see caching in large systems.

graph LR; A(Client) --> B(Server) B --> C(Database)

It can be placed in the client, in the server, among server and database and in the database itself. For instance, some static content caching happens on the browser.

To start with, cache information that is almost static. For instance, cache a Instagram profile for very demanded (famous) people. This caching normally occurs at the server level, where many servers responding followers requests can go to the same cache to retrieve the data.

When caching, there are “two sources of truth” (normally the data in the database and the data in the cache). What happens when data is updated (for instance, users update their previous posts)? There are a several write policies. Two if the most commons are:

Write-through: write is done synchronously both to the cache and to the backing store.

PROS: Cache and back-end always i sync
CONS: When writing, you need 2 write twice and the back-end is a time consuming operation
USAGE: better when there are many reads and few writes.

Write-back (also called write-behind): initially, writing is done only to the cache and the request returns to the client. The write to the backing store is postponed. It can be done on different ways, like until the modified content is about to be replaced by another cache block. Another way could be writing each ‘X’ time interval.

PROS: Fast return to the clients when writing/updating
CONS: Losing the cache makes data lost in the back-end

A write-back cache is more complex to implement, since it needs to track which of its locations have been written over, and mark them as dirty for later writing to the backing store. The data in these locations are written back to the backing store only when they are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss in a write-back cache (which requires a block to be replaced by another) will often require two memory accesses to service: one to write the replaced data from the cache back to the store, and then one to retrieve the needed data.

Replacement policies

As memory or space is not infinite, data in caches has to be replaced mainly due new data entering in the cache or data that is not up to date anymore.

There are many replacement policies. Some of the most common are

Least recently used (LRU)

Discards the least recently used items first. This algorithm requires keeping track of what was used when, which is expensive if one wants to make sure the algorithm always discards the least recently used item.

Least recumbently used (LFU)

Counts how often an item is needed. Those that are used least often are discarded first. This works very similar to LRU except that instead of storing the value of how recently a block was accessed, we store the value of how many times it was accessed.

Others polices are LIFO or even random

Examples (or not) of caching

Comments in Youtube videos

We want all the users see the most recent comments. The systems has several servers managing the comments. A design could be to let each server to have its own local cache, but this has the risk of old messages (original poster has edited after initial post) or stale message (original poster has deleted it) and sync among servers is complicates. A much better design is to have a single cache where all servers go to find the comments. Challenge: Remember though that if we think global, there has to be more than one cache point… My bet is for some geographical split (Chinese people will look for Chinese famous, Americans to Americans, etc.)

Video views

On the other hand, “video views” might not be so accurate. In this case, up to the nanosecond updates are not needed and even caching might no be so important (might depend, for instance only for highly demanded videos…)

As a general statement, is data is static or almost static, caching plays easy and well..


(Forward) proxy - sits between a client and a server and acts on behalf of the client (like when masking the client IP)

Reverse proxy - sits between a client and a server and acts on behalf of the server (normally for logging, load balancing or caching)

Load Balancer

graph LR A[Client] --> B[Load Balancer] B --> C[Server01] B --> D[Server02]

Related to horizontal scaling


Hashing uses a hash function that takes in a specific data type (such as a string or an identifier) and outputs fixed sized value, typically an integer value. Different inputs may have the same output, but a good hashing function attempts to minimize those hashing collisions (which is equivalent to maximizing uniformity).

Industry standard hashing functions are MD5 (Message-Digest algorithm - series 5), SHA (Secure Hash Algorithms) or decrypt()

In NALS, hashing is used to hash diverse types of data: IP addresses, user names or http requests (or a combination of them…).

We can use hashing in a Load Balancer in order to assign a request to a server (we suppose a system serving back-end servers). This is more clear if, for instance,some request are computationally expensive. We would probably like to address all request of the same type to the first server that did the calculation. To achieve this, we can also assume that a memory cache is used by each server, so that expensive data is stored in the cache. This is a simple example that, for instance, can be evolved by the usage of data aging.

Although simple, this method is more efficient than, for instance, using the Round-Robin approach in the LB, where request will go to servers that do no contain the data stored in the cache and therefore being re-calculated for each server!!

Try it !!

On a Linux box, you can test some hashing functions:

echo -n | sha1sum

echo -n | md5sum

You can ‘mod’ the hash result by, lets say, 4 (four servers) to see the assignment in action (you will see bash is not the best for this…)

echo $(( 16#c9a6b2f79f43e1d69ab4676f839a1061 % 4 ))
-3           <============   it is negative due integer overflow in bash...)

A (rather naive) server assignment strategy could be to hash the user name and “mod” the returned integer by the number of servers. Should any server “crash”, we only need to change the number of servers in the “mod” operation. This also applies when in the need to add servers.

BUT…, changing the number of servers will imply that a previously hashed user name will go to a different server !! Therefore, all cache hits will be also lost. Here is where consistent hashing and **rendezvous hashing” come into play.

Consistent hashing

A type of hashing that minimizes the number of keys that need to be remapped when a hash table gets resized. It’s often used by load balancers to distribute traffic to servers. It minimizes the number of requests that get forwarded to different servers when new servers are added or when existing servers are brought down.

This is achieved by giving each node a number (by hashing its own server name). These numbers mark the boundary for a range of numbers, so the numbers of the hashed elements (for instance the user names) will fall among the numbers given to servers. In many places this is explained as servers number falling in the “continuum” of all the possible numbers (represented by a circle). Server numbers mark the (random) start and end of zones. These zones are assigned to, for instance, the higher server (clockwise or to the right).

From a coding point of view, servers with their numbers are maintained in a table, so each time we get the hash value of an element, we compare it against the table to get the server number.

Should a server fail, we only need to assign a new server in the table to the range of the dead server. In a production environment, perhaps you prefer to have some “hot spare server” if server set-up is slow… In any case, only the affected hashes (representing what ever you selected to hash - users, IP or any other) will be affected.

Should a server be added, for instance reduce a hot-spot, the range of the hot-spot can be split and assigned to the new server. The opposite range management (joining ranges) can be done for removing a server and adding its range to and existing server.

Rendezvous hashing

A type of hashing also coined highest random weight hashing. It also allows for minimal re-distribution of mappings when a server goes down. There are several hash functions to implement it. They normally use some operation on both the server value and the element selected from the request.

An important advantage is that no preallocation (or multiple preallocation) is needed, and all nodes share the same ‘formula’. In other words, by the former, calculation is done on request. Strictly speaking, this is an O(N) regarding time, but in this case N is the number of nodes. So the “implicit” ‘for …’ search is up ‘max num of nodes’, which normally is a well and controlled finite number. By the latest, any node can calculate the destination as all results will point to the same point.

In fact, Consistent hashing can be viewed as a specific case of the more generic Rendezvous hashing

It sample of Rendezvous hashing (with weights) implementation can be found in Wikipedia


Relational databases

A type of structured database in which data is stored following a tabular format. It follows a well defined structure. They are also known as SQL Databases

SQL: Structured Query Language - Language to query Relational Databases (although there are very few with other query languages)

Non Relational databases: databases without such tabular format (they are also known as NonSQL databases)


A type of database transaction that accomplishes with this four properties

Database index: allows to search in a NON linear fashion. They can be seen as an auxiliary data structure optimized for searching in an specific field (column). They provide faster “read”, but a slightly slower writes in the database as the index also needs to be updated. Consider also that the index will also consume some space.

An advantage for SQL databases (important for distributed systems) is that when using SQL, you DO NOT need to store in memory all the data as you would need if querying via a, for instance Python, program. SQL manages directly the stored data. This is very true for distributed systems, with might have terabits of data in several machines. In any case, and going back to SQL, if data is distributed on several machines, we need an SQL layer that can handle distributed queries !!

Key-Value Stores

One of the most commonly used NoSQL paradigms today, the key-value store bases its data model on the associative array data type.

The result? A fast, flexible storage machine that resembles a hash table. That’s right folks, our favorite friendly neighborhood data structure strikes again!

A Key-Value store is a database that consists of Keys, typically strings. These keys map to arbitrary values.

They are flexible as they do not impose a structure. They are also simple, as they resemble a hash table, which is one of the most conceptually simple key-value mappings.

They fit really well for caching (as values are stored in hashed fashion) or dynamic configuration.

As the data is accessed via the key, no search or lookup is involved.

Some examples are Google BigTable, Amazon DynamoDB, Etcd, Redis or ZooKeeper. Some of them use disk to persist data, some other use memory for faster data management. Some provide strong consistency, some other provide eventual consistency.

Replication And Sharding

A system’s performance is often only as good as its database’s; optimize the latter, and watch as the former improves in tandem!

On that note, data redundancy and data partitioning techniques can be used to enhance a system’s fault tolerance, throughput, and overall reliability.

Replication and Sharing are best suited for systems confortable with “eventual consistency”, so they do not require the A.C.I.D protocol.


The act of duplicating the data from one database server to others. This is sometimes used to increase the redundancy of your system and tolerate regional failures for instance. Other times you can use replication to move data closer to your clients, thus decreasing the latency of accessing specific data.

A decreasing latency example is a “news system” (for example,Linkedin), where few people write post that are read by many. Replicas can serve the users reading, and replicas can be located closer to readers. Writers cna post also to databases close to them to get faster writes, while replicas are sent to the rest of replicated databases. Crearly, there are some compromises: an reader located in Indiua will “see” a bit later, in his regional replica, the post of a writer located in the USA, that has written in his USA replica. The “news system” does not have to be updated at the same time everywhere instantly. As an exercise to reader, think if for the comments on m¡news this might be the same or not…

A common model is that there is a main database and a replicated database. Updates to the replicated database can occur synchronously or asynchronously. In any case, we need to guarantee that the replica is always up to date with the main database. There are several mechanism explained in the Consensus section that cover some of them.


Data Partitioning sometimes called sharding, is the act of splitting a database into two or more pieces called shards and is typically one to increase the throughput of your database. Popular sharding strategies include:

Sharding uses a scaling architectural approach known as horizontal scaling (or also known as ‘scale-out’), where several ‘nodes’ (computers) to hold normally, but not necessarily, a shard/partition per computer. In contrast, vertical scaling (also known as scale-up) is based on scaling by incrementing the resources (normally CPU, RAM or disk) to add more throughput.

Depending on the strategy, hot-spots (overloded nodes) might appear. Here is where consistent hashing can help, for both, distribute the data evently and allow fast node insertion and deletion. But consistent hashing will not stop for a node (a shard) goes down. We will loose that data, so we can have a replica for each shard. In other words, hashing will help on the data distribution and replicas on data availability.

The selected strategy can by implemented in different ways. One can consist on putting the strategy logic at the Application Server level (tipically with ‘if then’ logic). But, probably, a better aproarch is to have an intermediate reverse proxy among the Application Server and the databases imlmenting the logic for selecting the shard.

The “Share Nothing” architecture is another name for sharding.

Much more on sharding at Wikipedia or DigitalOcean


(all from Wikipedia)

NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.

These systems automatically split databases across multiple nodes (transparent sharding) using Raft or Paxos consensus algorithms.

H-Store was promoted as a new class of parallel database management systems (NewSQL), that provide the high-throughput and high-availability of NoSQL systems, but without giving up the transactional consistency of a traditional DBMS known as ACID (atomicity, consistency, isolation and durability). Such systems operate across multiple machines, as opposed to a single, more powerful, more expensive machine. (from Wikipedia)


A clear consequence of distributed systems is the any sense of “state” is shared among several machines. Imagine an invoicing subsystem of a subscription service (for instance, like Netflix). Netflix, as service provider, or you, as a customer, do no want to be charged twice in a given payment. You can imagine that Netflix has more than one machine invoicing millions of customers. As Netflix wants to provide the best service to its clients, has build a highly available and consistent billing subsystem. So, for sure, it implies at least 2 machines, but by the size of Netflix, there are probably tenth or hundreds machines.

So we need a mechanism to make the subsystem as consistent as possible, making sure that there is an agreement (consensus) about if you have been charged or not!! In other words, the participants of the subsystem need to have a way to know the “state” of your invoicing and be able to provide “strong consistency” (similar to ACID for databases). Strong consistency ensures that a client writing o reading data will always get “correct” value.

Consensus is a fundamental problem in fault-tolerant distributed systems. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).

Mico Maco likes very much the definition of consensus found at “In Search of an Understandable Consensus Algorithm” paper.

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members.Because of this,they play a key role in building reliable large-scale software systems.

Consensus algorithms are hard to design and implement, but there are several well known algorithms available like Paxos and Raft

As described in Wikipedia Paxos link above, it implements a state machine approach:

Consensus protocols are the basis for the [state machine replication]( approach to distributed computing. State machine replication is a technique for converting an algorithm into a fault-tolerant, distributed implementation.

Raft uses a different approach, based on log replication:

Raft achieves consensus via an elected leader. [...] The leader is responsible for log replication to the followers.

The good news is that these algorithms are already implemented “under the hood” in many useful services/products, like Key-Value stores like Apache ZooKeeper or Etcd. ZooKeeper, developed before Raft, uses its own consensus algorithm called Zab. Etcd provides high availability and strong consistency using the leader election and log replication features of Raft. This means that many clients can write/read to the distributed machines/nodes of the Etcd database in a consistent manner, and guarantying it a single source of true for their data. By implementing Raft, there is a unique leader to achieve this. Leader election is, like log replication, and safety, key elements to achieve consensus. We will focus now in leader election, as it guarantees “who” (which machine/node) is responsible for (to lead) the group.

A word of warning: Some times you can achieve distributed agreement with algorithms like Rendezvous hashing, where n leader is elected. A “pre-established” agreement sets beforehand the consensus…

Leader election

Each of the previously commented consensus algorithms have a leader election mechanism. Raft uses a voting mechanism that you can visualize in The secret lives of data or at Raft home

You can use the etcd key-value db to store which is the leader among your servers. Because etcd is HA and strongly consistent, you know that the leader value is always correct. For example, you can store a key-value like theLeader -> node33, 5 where ‘5’ is a TTL for the “lease” to be leader. When the lease expires (the current leader has not renew it), a new election happens. Find a python example at

Some more resources about consensus
Raft video

In Search of an Understandable Consensus Algorithm Usenix conference.

C.A.P Theorem

(from Wikipedia)

Consistency: Every read receives the most recent write or an error Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Note that consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions (defines as ensuring that a transaction can only bring the database from one valid state to another)

In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied.

CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.

Database systems designed with traditional ACID guarantees in mind, such as RDBMS, choose consistency over availability, whereas systems designed around the BASE philosophy (Basically Available, Soft state, Eventual consistency), common in the NoSQL movement for example, choose availability over consistency.

An interesting article about how CAP Twelve Years Later: How the “Rules” Have Changed


Database scalability

Distributed data stores

Distributed transaction