Distributed Systems

Distributed systems are collections of independent computers that appear to users as a single coherent system. They enable building scalable, fault-tolerant, and globally accessible applications.

What is a Distributed System?

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.

Characteristics

Concurrency

Multiple components execute simultaneously.

Lack of Global Clock

No single, globally agreed-upon time reference.

Independent Failures

Components can fail independently without affecting the entire system.

Challenges

Network Communication

Latency: Delays in message delivery
Bandwidth: Limited data transfer capacity
Reliability: Messages can be lost, duplicated, or reordered

Consistency

Ensuring all nodes have the same view of data.

Fault Tolerance

System continues operating despite component failures.

Coordination

Managing interactions between distributed components.

Architectural Patterns

Client-Server Architecture

┌──────┐    ┌──────┐    ┌──────┐
│Client│◄───│Server│◄───│Client│
└──────┘    └──────┘    └──────┘

Characteristics:

Centralized server
Multiple clients
Request-response model

Use Cases:

Web applications
Database systems
File servers

Peer-to-Peer Architecture

┌─────┐    ┌─────┐    ┌─────┐
│Peer │◄───│Peer │◄───│Peer │
│  A  │    │  B  │    │  C  │
└─────┘    └─────┘    └─────┘

Characteristics:

No central server
All nodes are equal
Direct communication between peers

Use Cases:

File sharing (BitTorrent)
Blockchain networks
Distributed databases

Microservices Architecture

┌──────┐  ┌──────┐  ┌──────┐
│Service│  │Service│  │Service│
│  A    │  │  B    │  │  C    │
└───┬──┘  └───┬──┘  └───┬──┘
    │          │          │
    └──────────┼──────────┘
               │
        ┌──────┴──────┐
        │ API Gateway │
        └──────┬──────┘
               │
        ┌──────┴──────┐
        │    Client   │
        └─────────────┘

Characteristics:

Independent services
Own databases
API communication
Loose coupling

Consistency Models

Strong Consistency

All nodes see the same data at the same time.

class StrongConsistencyStore:
    def write(self, key, value):
        # Synchronous replication to all nodes
        for node in self.nodes:
            node.write(key, value)
        
        # Wait for acknowledgment from all nodes
        self.wait_for_acks()
    
    def read(self, key):
        # Read from any node (all have same data)
        return self.nodes[0].read(key)

Eventual Consistency

System will become consistent over time, but may have temporary inconsistencies.

class EventualConsistencyStore:
    def write(self, key, value):
        # Write to primary node
        self.primary.write(key, value)
        
        # Asynchronously replicate to other nodes
        for replica in self.replicas:
            self.async_replicate(replica, key, value)
    
    def read(self, key):
        # Read from nearest node
        return self.get_nearest_node().read(key)

Causal Consistency

Operations that are causally related are seen in the same order by all nodes.

class CausalConsistencyStore:
    def write(self, key, value, dependencies):
        # Include causal dependencies
        operation = {
            'key': key,
            'value': value,
            'dependencies': dependencies,
            'timestamp': self.get_logical_clock()
        }
        
        self.broadcast_operation(operation)

Consensus Algorithms

Paxos

Algorithm for achieving consensus in distributed systems.

Roles:

Proposer: Proposes values
Acceptor: Votes on proposals
Learner: Learns the decided value

Phases:

Prepare: Proposer asks acceptors to promise not to accept older proposals
Accept: Proposer sends value to acceptors
Learn: Acceptors notify learners of accepted value

Raft

Simpler consensus algorithm for practical systems.

States:

Follower: Passive state, responds to leaders
Candidate: Campaigning to become leader
Leader: Handles all client requests

Election Process:

class RaftNode:
    def __init__(self):
        self.state = 'follower'
        self.term = 0
        self.voted_for = None
        self.log = []
    
    def start_election(self):
        self.state = 'candidate'
        self.term += 1
        self.voted_for = self.id
        
        votes = 1  # Vote for self
        for peer in self.peers:
            if peer.request_vote(self.term, self.id):
                votes += 1
        
        if votes > len(self.peers) / 2:
            self.become_leader()

Distributed Databases

Sharding

Partition data across multiple nodes.

Sharding Strategies:

Range-based: Partition by key ranges
Hash-based: Partition by hash of key
Directory-based: Central directory maps keys to nodes

class ShardedDatabase:
    def __init__(self, shards):
        self.shards = shards
        self.hash_function = consistent_hash
    
    def get_shard(self, key):
        return self.shards[self.hash_function(key)]
    
    def get(self, key):
        shard = self.get_shard(key)
        return shard.get(key)
    
    def put(self, key, value):
        shard = self.get_shard(key)
        shard.put(key, value)

Replication

Maintain copies of data on multiple nodes.

Replication Strategies:

Master-Slave: One master, multiple slaves
Multi-Master: Multiple masters
Leaderless: No single master

class ReplicatedDatabase:
    def __init__(self, nodes, replication_factor=3):
        self.nodes = nodes
        self.replication_factor = replication_factor
    
    def write(self, key, value):
        # Write to multiple nodes
        target_nodes = self.get_replication_nodes(key)
        
        for node in target_nodes:
            node.write(key, value)
        
        # Wait for quorum
        self.wait_for_quorum(target_nodes)
    
    def read(self, key):
        # Read from multiple nodes for consistency
        target_nodes = self.get_replication_nodes(key)
        responses = []
        
        for node in target_nodes:
            response = node.read(key)
            responses.append(response)
        
        # Resolve conflicts if any
        return self.resolve_conflicts(responses)

Message Passing

Remote Procedure Calls (RPC)

Call procedures on remote machines as if they were local.

class RPCClient:
    def __init__(self, server_address):
        self.server_address = server_address
    
    def call(self, method, *args):
        request = {
            'method': method,
            'args': args,
            'id': self.generate_id()
        }
        
        response = self.send_request(request)
        return response['result']

# Usage
client = RPCClient('server:8080')
result = client.call('add', 5, 3)  # Returns 8

Message Queues

Asynchronous communication between components.

class MessageQueue:
    def __init__(self):
        self.queue = []
        self.subscribers = {}
    
    def publish(self, topic, message):
        for subscriber in self.subscribers.get(topic, []):
            subscriber.receive(message)
    
    def subscribe(self, topic, callback):
        if topic not in self.subscribers:
            self.subscribers[topic] = []
        self.subscribers[topic].append(callback)

# Usage
mq = MessageQueue()
mq.subscribe('orders', process_order)
mq.publish('orders', {'id': 123, 'amount': 100})

Fault Tolerance

Redundancy

Multiple copies of components to handle failures.

Failover

Automatic switching to backup components.

class FailoverService:
    def __init__(self, primary, backup):
        self.primary = primary
        self.backup = backup
        self.current = primary
    
    def call(self, method, *args):
        try:
            return self.current.call(method, *args)
        except Exception as e:
            if self.current == self.primary:
                self.current = self.backup
                return self.current.call(method, *args)
            else:
                raise e

Circuit Breaker

Prevent cascading failures.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'CLOSED'
    
    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if self._should_attempt_reset():
                self.state = 'HALF_OPEN'
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e

Distributed Caching

Consistent Hashing

Distribute cache keys evenly across nodes.

class ConsistentHashRing:
    def __init__(self, nodes, replicas=100):
        self.ring = {}
        self.sorted_keys = []
        
        for node in nodes:
            for i in range(replicas):
                key = self.hash(f"{node}:{i}")
                self.ring[key] = node
                self.sorted_keys.append(key)
        
        self.sorted_keys.sort()
    
    def get_node(self, key):
        if not self.ring:
            return None
        
        hash_key = self.hash(key)
        idx = bisect.bisect_right(self.sorted_keys, hash_key)
        
        if idx == len(self.sorted_keys):
            idx = 0
        
        return self.ring[self.sorted_keys[idx]]

Monitoring Distributed Systems

Distributed Tracing

Track requests as they flow through multiple services.

class DistributedTracer:
    def __init__(self):
        self.spans = []
    
    def start_span(self, operation_name, parent_span=None):
        span = {
            'trace_id': self.generate_trace_id(),
            'span_id': self.generate_span_id(),
            'parent_span_id': parent_span['span_id'] if parent_span else None,
            'operation_name': operation_name,
            'start_time': time.time()
        }
        
        self.spans.append(span)
        return span
    
    def finish_span(self, span):
        span['end_time'] = time.time()
        span['duration'] = span['end_time'] - span['start_time']

Health Checks

Monitor the health of distributed components.

class HealthChecker:
    def __init__(self, services):
        self.services = services
    
    def check_all(self):
        results = {}
        for service_name, service in self.services.items():
            try:
                health = service.health_check()
                results[service_name] = {
                    'status': 'healthy',
                    'details': health
                }
            except Exception as e:
                results[service_name] = {
                    'status': 'unhealthy',
                    'error': str(e)
                }
        return results

Best Practices

Design for Failure: Assume components will fail
Use Idempotent Operations: Safe retry mechanisms
Implement Timeouts: Prevent hanging operations
Monitor Everything: Comprehensive observability
Use Circuit Breakers: Prevent cascading failures
Implement Retry Logic: Handle transient failures
Design for Scalability: Architecture should support growth

Distributed systems enable building powerful, scalable applications but require careful design to handle the complexities of network communication, consistency, and fault tolerance.