Tuesday, January 5, 2021

 

NoSQL Opensource Database: Architecture, Tools, Algorithm, Design, Interchange and Distributed architecture

Created on 2020-08-21 11:15

Published on 2020-08-22 14:05

Cloud Platforms (AWS/Azure/Google/Amazon) offer a range of non-hierarchical, non-RDBMS databases services (not necessarily No-SQL): Wide-Column, Document, Time Series, Graph, and Analytics. Powered by opensource, and launched on cloud platforms, these have seen a growth of 52% in 2019, with projected revenues exceeding 25% by 2022.

The next-gen cloud-native application architectures and the hunger for data variety for Analytics/AI will drive adoption of these data management services rapidly if one understands the underlying algorithms, architecture and design principles. Here is a compendium of the same:

No alt text provided for this image

Data Interchanges - Added 7th Sept 2020

Traditional and bulky interchange - Services/WS/RPC/streams/messages/events/Actor-based for data pipelines: JSON, XML, CSV. May not be a best choice for data intensive apps working over n/w. Fast-forward and consider these options:

  1. Protobuf: If you are thinking RPC as interchange mechanisms and potentially gRPC
  2. Thrift: Social streaming with facebook
  3. Apache Avro: Opensource for Hadoop and large data, order agnostice (writer vs reader schema)

The challenges in all the interchange formats is to take care of implementing the forward and backward incompatibilities.

Architecture: Replication, Partitions, Transaction - Added 9th Sept 2020

Distributed data bases will be increasingly common as we head to cloud. Its important to understand and account for the challenges while devising applications/algorithms:

  1. Replication challenges and solution:
No alt text provided for this image

2. Partitioning/Sharding: Horizontal scaling enabled by Key based, hash-based, primary/secondary index based, etc. The challenges are primary around need for rebalancing and routing, they are solved well by many large databases efficiently.

3. Transaction: Applications have to account for inconsistent implementation of Atomicity, Consistency, Isolation and Durability properties. This leads to Dirty reads, Dirty writes, lost updates, read/write skews (timing anomaly). The hard problem are solved by offering solutions that applications need to implement: Snapshot isolation with "BEGIN TRANSACTION", materializing conflicts with "SELECT FOR UPDATE", and Serialized isolation methods: poor performing Serial execution (REDIS), poor scaling 2-Phase locking (MYSQL using predicate, range-index, shared & exclusive locks), the latest Serializable snapshot isolation (PostGres) where conflicting writes are isolated.