Suitcase stacked picture from some where in Doylestown
Replicated and Partitioned storage - somewhere in Doylestown (PA)

Cheatsheet from Designing data-intensive applications — Part I

Vishal Kumar

--

The book “Designing data-intensive applications” (DDIA) by Martin Kleppmann, is a must-read book for Software professionals. Many of us do not get much time or have not come across this book. In a 2 or 3-part series blog, I hope to make you aware of the nomenclature and design principles behind data-intensive applications. I will follow book chapter sequencing and present their takeaways in each section. In Part-I, we will discuss Database choices, their physical layout, access patterns, schema evolution, and things to watch when using replication and partitioning. The next posts will cover time, transactions, batch, event processing, and ways to maintain heterogeneous systems in sync.

Database(DB) choices

Data stored in DB’s and its usage in code usually have different structures. NO SQL DB re-gained popularity in the early 2000s by making this gap smaller, aided by the explosion of data volume that required support for partition and replication. There are three popular database choices :

a. Relational — Good for storing data where most have one-to-one relationships. Joins are optimized, but making too many joins for a single query can make things slower. It is usually called DB with schema but we can also think about it as DB with write time schema.

b. Document-based — Good for one-to-many relationships. In such a scenario, relational joins may start becoming a bottleneck. Data is kept denormalized to avoid joins. Many can call it schema-less, but in reality, it is more appropriate to think about it as schema at read.

c. Graph-based — For many-to-many relationships. When denormalization for document based becomes hard to maintain. Imagine, a single update, that can have cascading effect and force updates in multiple documents. In reality, most database designs can benefit from adopting graphs, where a single record can be reached via multiple records following different relationships. It just feels natural. Recently I had a chance to prototype with Neo4J and use cypher. It was fun and helped me to realize how graphs can be very effective designs.

Storage and Retrieval

A note on how data is stored inside DB — the physical layout. This will help to use the right storage for a given context.

a. Log-based — Append only sequence of records. Writes are fast, wicked fast. But to read need to do a full scan, unless there is a secondary data structure to support the search. Systems, do employ compaction to make things faster by de-dupling multiple records with samekey and getting rid of Tombstone(used in deletes) records.

b. Indexed — Things are faster for retrieval but slower for writing. Need to keep metadata in sync with actual data.

c. Hash-based — When there is a high volume of writes, but a small number of keys.

d. In memory AVL Tree + Log-based, which allows range queries.

e. If in memory AVL trees are not desired, on disk BTrees provide crash protection, with range query benefit.

f. When queries are multidimensional, data structures like RTrees are well suited.

Access patterns

a. Online transactional processing(OLTP) — Usually a single user is involved in the transaction. Usually needs a small set of data, and can result in random writes.

b. Online analytic processing(OLAP) — Businesses making sense of their data or third-party (3P) data about their customers. One can use materialized view, and keep it away from OLTP flow. Usually, such materialized views have 100s of columns, so using column-based DB like Snowflake is suggested.

Encoding and Evolution

a. Forward compatibility — old code reading new data. One thing to keep in mind, if old code reads new data and then writes back, new fields can be nullified.

b. Backward compatibility — new code reading old data.

c. JSON is king here. Use swagger / openapi to generate schema for client interaction.

d. Binary Encoding like Thrift or Protobuf uses tag #, thus changing the field's name has no concerns. But while changing tag # or type, one should keep compatibility in mind.

e. Avro does not store type making it most compact. It is good for dynamic schema generation.

Replication

To keep DB running in case one fails and to have the reliability of storage, the dataset is replicated to more than one node, sometimes across countries and continents. These duplicated data needs to get in sync and can produce undesired result if not well planned.

a. Single Leader — If the write volume is not high, but the read volume is very high. Also providing serializability is easy. One should consider, a read from clients about their data should go to the leader.

b. Multi Leader — When write volume is also high. Also, one should consider sticky routing in such a scenario.

c. Leader Less — These usually use quorum-based read and write.

Synchronous vs Asynchronous replication — when to get back to the client about write has been done. Having all followers synchronous does not makes sense, and partial (some quorum) is good.

Adding new followers — Take a snapshot from Lader and then the follower can sync with the leader for changes after the snapshot.

Handling failover — How to detect a leader is gone and how to choose a new leader. In most systems timeout is used for detecting the loss of a leader. One should be careful about Split Brain scenario when two nodes consider themselves as the leader.

Some choices on how to replicate

a. Statement based — very concise but problematic to implement. Function like now() can cause different values on a different row.

b. Write ahead log — It makes replication highly coupled with storage.

c. Logical or Row-based replication — This is most desired and most databases are supporting Change data capture (CDC) now.

Replication challenges

These are issues we should be aware of when multiple replicas are involved and different queries can interact with different replicas.

a. Read your own write — logical timestamp can help to fix this. Writing to the leader and reading from followers can cause this.

b. Monotonic read — Reading from two different replicas can result in inconsistency. The use of sticky routing can help.

c. Consistent prefix reads — This one is challenging. The Issue is, records may have different keys, and thus seems unrelated, but are connected via causality.

How to handle write conflicts

a. By not having it — Single leader replication

b. Converging towards a consistent state — 1. The last write wins. 2. Give higher replica preference 3. Merge conflict (like GIT) 4. Or let the client know and ask them to resolve the conflict.

c. Using CRDTs — Confict-free replicated datatypes.

Partitioning

When one replica can not store the whole dataset. Or can not support read or write volume, data can be broken down or partitioned and different nodes can work as sources for each subset. Partitioning can be done by either :

a. Random — write to any random partition. Reading will become a nightmare as the read queries will have to go to all partitions.

b. By range — Different partitions are responsible for a different set of range values. Range need not be of equal size and different ranges can have a different amount of data.

c. Hash-based — Take a key, and hash it to partition/node id. Rather than using modulo based, one should always use consistent hashing.

d. Consistent hashing — All nodes are positioned on an abstract ring or hash ring based on their node hash value. The same hash function is used with the data key and the node right on the virtual circle is used for storage. It is advised to assign more than one label to a node, which causes a node to appear multiple times on the hash ring. This allows better distribution and minimal shuffling on node failure. Also, machines with higher computing can be given assigned a higher number of node labels.

e. Sometimes no single key is perfect — Here one can use the composite key. In such a scenario, the first key is for finding the partition node, and the rest is for sorting data on the partition node.

How to route requests to partitions

There is a need to map the key to the partition node. One can use Zookeeper to store this metadata.

Alternatively one can use gossip protocol, where any partition can accept the request and forward it to owning partition. This avoids any external mapping.

That is a wrap for this part. Hopefully, you will give it a try to DDIA book if not already.

Happy reading !!!

--

--

Vishal Kumar

I am a software engineer who enjoys collaborative environment and building value !!!