Designing Data-Intensive Applications - Why read it at least once

So, I’ve been reading Designing Data-Intensive Applications (DDIA) - First Edition by Martin Kleppmann. Definitely a best seller. I’m aware that there is a new edition coming out soon, so if you are interested on this book I would recommend to wait for it to release and support the author! In this article, I will talk about the reading experience I had and when to read it.

Firstly, this is the first book I ever bought regarding to tech. Other books I read were mostly online pdfs. Now, this was almost a blind purchase. This year I was looking to level up a little bit, and saw on different social media posts that this book was a game changer or a must buy. It’s incredible even for an 8 year old edition book, while considering the constant evolution on this field, to still stay relevant in the public debate. So I just bought it right away from Amazon. Boom.

Once the book arrived, I started to read the table of contents and started to think that I “heard” or “knew” most of the concepts it shows. Or I thought so.

I jumped into different sections from Part II and Part III, and started reading. Turns out, I was kinda wrong. I know the high level concepts behind modern tools and how to use them. Even with years of developing software, all the tutorials/Youtube videos and all the classes I passed in the past, I got to admit I still lack a deeper understanding of those concepts, and thats normal for everyone. It’s not depressing, but rather a way to tell yourself you can still improve.

So, my first impression was “Oh, I kinda know what this book is talking about…but actually, I’m still missing some valuable core concepts”.

A good example would be the whole Part II and Part III of DDIA (at least for me).

Part II dives into how can data be distributed across different machines. Replication? These days, we can implement replication almost instantly, just type the commands ChatGPT tells you, tweak some configuration files for a Docker container or buy the actual SaaS service and you are mostly set.

Now, have you ever heard about Anti-entropy?

Well, it’s actually one of the concepts for replication in distributed systems. I knew about Read repair before, but never heard of Anti-entropy. In short, it’s a background process that relies on different techniques that copies missing data from node to node. Cloud Storage Services like Amazon S3 or CDNs by CloudFlare heavily uses these mechanisms for data integrity. As you can see…it’s that important and we sometimes miss these concepts exist on the tools we use every day. The point here is that you will definitely find and discover content.

Do you know about MapReduce? On Part III, the book literally shows us the first steps of Big Data with Hadoop and how this framework processes file logs for data analysis from disk access. If you worked with legacy systems, you probably knew or used Hadoop at the time, but I didn’t…and that’s the good part. Nowadays, we have other tools that originated from MapReduce but are much faster since they have access to memory, like Apache Spark. Learning where the current abstraction layer we have today comes from, in my opinion, it’s worth it. No present without past.

Obviously, the previous examples are really basic and you will eventually find more complex stuff you didn’t know about through your read. And that’s why this book is glorified.

Based on my experience with DDIA, I think that this book is a must read for everyone (at least once). Then there are three kind of people. If you are finishing college or just starting your internship, this book will help you solidify core concepts on how data integrity is maintained in applications. I’m confident that you should have some initial knowledge to start reading this book. It’s not necessary to read the whole book, find Chapters that catch your interest and start reading, I assure you it’s very rewarding. If you have 10+ years of knowledge in the field and never read DDIA, this apply for you also, read it. Finally, if you are a professor that needs a good reference for your distributed systems lectures, make yourself a favor and buy this book!

Since we engineers, programmers or developers, create software by ourselves with tools already implemented every single day, reading this book won’t feel exactly “productive”, but you will have much better control over the backend. In my opinion, more conceptual control it’s always well seen and a big win against the competition.

I will eventually read it end to end. It’s great to have around to revisit definitions and how data manipulation works internally!

— asz