Are you lost in the jungle of (big-)data technologies? Are you wondering how the Flink, Spark, Hive, Pig, Airflow and the likes fit together? Then this is the right book!
I found it as a recommendation on Goodreads.com. I wanted to learn about data technologies as I work more and more on this and I felt it was a good match: It was recent enough (2017), and was available on Google Play within my budget (EUR 12).
In 616 pages, it covers a lot of ground, through three parts. Part I lays some foundations: It defines reliability, scalability, maintainability, data models, query languages and storage engines. Part II explores data distribution and covers replication, partitioning/sharding, transaction, consistency and consensus. Part III focuses on data processing and discuss batch processing and stream processing.
Martin Kleppmann is a researcher at Cambridge University, who focuses on data. He has also been an entrepreneur and speaks at conferences regularly.
I enjoyed the read: Despite a very technical subject, I liked the way Martin explains complicated issues with simple diagrams and intuitive ideas. No greek symbols, no pedantic sentences. Yet, the book is full of pointers to both research articles, blog posts for those who would like to dig deeper. Each chapter starts with a map—you know, those maps of imaginary worlds we found in fantasy books or role-playing games. Here, those maps relate all the data concepts and technologies and taught me a lot! I saw connections I had never thought of. Overall, I feel I learned a lot, and at the end, Martin discusses ethical issues with data and machine-learning that made me think—I am still thinking.
I give it four stars! I definitely recommend it to those who want to enter the world of data-intensive systems. I am also convinced that the sheer number of pointers makes it an interesting reference for experts! Next, we have to choose a specific technology to dive deeper. Martin gave us the map, we just have to follow our interest!