Distributed Systems

Distributed systems are collections of independent computers that appear to users as a single coherent system. They enable building scalable, fault-tolerant, and globally accessible applications.

What is a Distributed System?

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.

Characteristics

Concurrency

Multiple components execute simultaneously.

Lack of Global Clock

No single, globally agreed-upon time reference.

Independent Failures

Components can fail independently without affecting the entire system.

Challenges

Network Communication

  • Latency: Delays in message delivery
  • Bandwidth: Limited data transfer capacity
  • Reliability: Messages can be lost, duplicated, or reordered

Consistency

Ensuring all nodes have the same view of data.

Fault Tolerance

System continues operating despite component failures.

Coordination

Managing interactions between distributed components.

Architectural Patterns

Client-Server Architecture

┌──────┐ ┌──────┐ ┌──────┐ │Client│◄───│Server│◄───│Client│ └──────┘ └──────┘ └──────┘

Characteristics:

  • Centralized server
  • Multiple clients
  • Request-response model

Use Cases:

  • Web applications
  • Database systems
  • File servers

Peer-to-Peer Architecture

┌─────┐ ┌─────┐ ┌─────┐ │Peer │◄───│Peer │◄───│Peer │ │ A │ │ B │ │ C │ └─────┘ └─────┘ └─────┘

Characteristics:

  • No central server
  • All nodes are equal
  • Direct communication between peers

Use Cases:

  • File sharing (BitTorrent)
  • Blockchain networks
  • Distributed databases

Microservices Architecture

┌──────┐ ┌──────┐ ┌──────┐ │Service│ │Service│ │Service│ │ A │ │ B │ │ C │ └───┬──┘ └───┬──┘ └───┬──┘ │ │ │ └──────────┼──────────┘ │ ┌──────┴──────┐ │ API Gateway │ └──────┬──────┘ │ ┌──────┴──────┐ │ Client │ └─────────────┘

Characteristics:

  • Independent services
  • Own databases
  • API communication
  • Loose coupling

Consistency Models

Strong Consistency

All nodes see the same data at the same time.

1class StrongConsistencyStore: 2 def write(self, key, value): 3 # Synchronous replication to all nodes 4 for node in self.nodes: 5 node.write(key, value) 6 7 # Wait for acknowledgment from all nodes 8 self.wait_for_acks() 9 10 def read(self, key): 11 # Read from any node (all have same data) 12 return self.nodes[0].read(key)

Eventual Consistency

System will become consistent over time, but may have temporary inconsistencies.

1class EventualConsistencyStore: 2 def write(self, key, value): 3 # Write to primary node 4 self.primary.write(key, value) 5 6 # Asynchronously replicate to other nodes 7 for replica in self.replicas: 8 self.async_replicate(replica, key, value) 9 10 def read(self, key): 11 # Read from nearest node 12 return self.get_nearest_node().read(key)

Causal Consistency

Operations that are causally related are seen in the same order by all nodes.

1class CausalConsistencyStore: 2 def write(self, key, value, dependencies): 3 # Include causal dependencies 4 operation = { 5 'key': key, 6 'value': value, 7 'dependencies': dependencies, 8 'timestamp': self.get_logical_clock() 9 } 10 11 self.broadcast_operation(operation)

Consensus Algorithms

Paxos

Algorithm for achieving consensus in distributed systems.

Roles:

  • Proposer: Proposes values
  • Acceptor: Votes on proposals
  • Learner: Learns the decided value

Phases:

  1. Prepare: Proposer asks acceptors to promise not to accept older proposals
  2. Accept: Proposer sends value to acceptors
  3. Learn: Acceptors notify learners of accepted value

Raft

Simpler consensus algorithm for practical systems.

States:

  • Follower: Passive state, responds to leaders
  • Candidate: Campaigning to become leader
  • Leader: Handles all client requests

Election Process:

1class RaftNode: 2 def __init__(self): 3 self.state = 'follower' 4 self.term = 0 5 self.voted_for = None 6 self.log = [] 7 8 def start_election(self): 9 self.state = 'candidate' 10 self.term += 1 11 self.voted_for = self.id 12 13 votes = 1 # Vote for self 14 for peer in self.peers: 15 if peer.request_vote(self.term, self.id): 16 votes += 1 17 18 if votes > len(self.peers) / 2: 19 self.become_leader()

Distributed Databases

Sharding

Partition data across multiple nodes.

Sharding Strategies:

  • Range-based: Partition by key ranges
  • Hash-based: Partition by hash of key
  • Directory-based: Central directory maps keys to nodes
1class ShardedDatabase: 2 def __init__(self, shards): 3 self.shards = shards 4 self.hash_function = consistent_hash 5 6 def get_shard(self, key): 7 return self.shards[self.hash_function(key)] 8 9 def get(self, key): 10 shard = self.get_shard(key) 11 return shard.get(key) 12 13 def put(self, key, value): 14 shard = self.get_shard(key) 15 shard.put(key, value)

Replication

Maintain copies of data on multiple nodes.

Replication Strategies:

  • Master-Slave: One master, multiple slaves
  • Multi-Master: Multiple masters
  • Leaderless: No single master
1class ReplicatedDatabase: 2 def __init__(self, nodes, replication_factor=3): 3 self.nodes = nodes 4 self.replication_factor = replication_factor 5 6 def write(self, key, value): 7 # Write to multiple nodes 8 target_nodes = self.get_replication_nodes(key) 9 10 for node in target_nodes: 11 node.write(key, value) 12 13 # Wait for quorum 14 self.wait_for_quorum(target_nodes) 15 16 def read(self, key): 17 # Read from multiple nodes for consistency 18 target_nodes = self.get_replication_nodes(key) 19 responses = [] 20 21 for node in target_nodes: 22 response = node.read(key) 23 responses.append(response) 24 25 # Resolve conflicts if any 26 return self.resolve_conflicts(responses)

Message Passing

Remote Procedure Calls (RPC)

Call procedures on remote machines as if they were local.

1class RPCClient: 2 def __init__(self, server_address): 3 self.server_address = server_address 4 5 def call(self, method, *args): 6 request = { 7 'method': method, 8 'args': args, 9 'id': self.generate_id() 10 } 11 12 response = self.send_request(request) 13 return response['result'] 14 15# Usage 16client = RPCClient('server:8080') 17result = client.call('add', 5, 3) # Returns 8

Message Queues

Asynchronous communication between components.

1class MessageQueue: 2 def __init__(self): 3 self.queue = [] 4 self.subscribers = {} 5 6 def publish(self, topic, message): 7 for subscriber in self.subscribers.get(topic, []): 8 subscriber.receive(message) 9 10 def subscribe(self, topic, callback): 11 if topic not in self.subscribers: 12 self.subscribers[topic] = [] 13 self.subscribers[topic].append(callback) 14 15# Usage 16mq = MessageQueue() 17mq.subscribe('orders', process_order) 18mq.publish('orders', {'id': 123, 'amount': 100})

Fault Tolerance

Redundancy

Multiple copies of components to handle failures.

Failover

Automatic switching to backup components.

1class FailoverService: 2 def __init__(self, primary, backup): 3 self.primary = primary 4 self.backup = backup 5 self.current = primary 6 7 def call(self, method, *args): 8 try: 9 return self.current.call(method, *args) 10 except Exception as e: 11 if self.current == self.primary: 12 self.current = self.backup 13 return self.current.call(method, *args) 14 else: 15 raise e

Circuit Breaker

Prevent cascading failures.

1class CircuitBreaker: 2 def __init__(self, failure_threshold=5, timeout=60): 3 self.failure_threshold = failure_threshold 4 self.timeout = timeout 5 self.failure_count = 0 6 self.last_failure_time = None 7 self.state = 'CLOSED' 8 9 def call(self, func, *args, **kwargs): 10 if self.state == 'OPEN': 11 if self._should_attempt_reset(): 12 self.state = 'HALF_OPEN' 13 else: 14 raise Exception("Circuit breaker is OPEN") 15 16 try: 17 result = func(*args, **kwargs) 18 self._on_success() 19 return result 20 except Exception as e: 21 self._on_failure() 22 raise e

Distributed Caching

Consistent Hashing

Distribute cache keys evenly across nodes.

1class ConsistentHashRing: 2 def __init__(self, nodes, replicas=100): 3 self.ring = {} 4 self.sorted_keys = [] 5 6 for node in nodes: 7 for i in range(replicas): 8 key = self.hash(f"{node}:{i}") 9 self.ring[key] = node 10 self.sorted_keys.append(key) 11 12 self.sorted_keys.sort() 13 14 def get_node(self, key): 15 if not self.ring: 16 return None 17 18 hash_key = self.hash(key) 19 idx = bisect.bisect_right(self.sorted_keys, hash_key) 20 21 if idx == len(self.sorted_keys): 22 idx = 0 23 24 return self.ring[self.sorted_keys[idx]]

Monitoring Distributed Systems

Distributed Tracing

Track requests as they flow through multiple services.

1class DistributedTracer: 2 def __init__(self): 3 self.spans = [] 4 5 def start_span(self, operation_name, parent_span=None): 6 span = { 7 'trace_id': self.generate_trace_id(), 8 'span_id': self.generate_span_id(), 9 'parent_span_id': parent_span['span_id'] if parent_span else None, 10 'operation_name': operation_name, 11 'start_time': time.time() 12 } 13 14 self.spans.append(span) 15 return span 16 17 def finish_span(self, span): 18 span['end_time'] = time.time() 19 span['duration'] = span['end_time'] - span['start_time']

Health Checks

Monitor the health of distributed components.

1class HealthChecker: 2 def __init__(self, services): 3 self.services = services 4 5 def check_all(self): 6 results = {} 7 for service_name, service in self.services.items(): 8 try: 9 health = service.health_check() 10 results[service_name] = { 11 'status': 'healthy', 12 'details': health 13 } 14 except Exception as e: 15 results[service_name] = { 16 'status': 'unhealthy', 17 'error': str(e) 18 } 19 return results

Best Practices

  1. Design for Failure: Assume components will fail
  2. Use Idempotent Operations: Safe retry mechanisms
  3. Implement Timeouts: Prevent hanging operations
  4. Monitor Everything: Comprehensive observability
  5. Use Circuit Breakers: Prevent cascading failures
  6. Implement Retry Logic: Handle transient failures
  7. Design for Scalability: Architecture should support growth

Distributed systems enable building powerful, scalable applications but require careful design to handle the complexities of network communication, consistency, and fault tolerance.