Distributed Systems
Distributed systems are collections of independent computers that appear to users as a single coherent system. They enable building scalable, fault-tolerant, and globally accessible applications.
What is a Distributed System?
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.
Characteristics
Concurrency
Multiple components execute simultaneously.
Lack of Global Clock
No single, globally agreed-upon time reference.
Independent Failures
Components can fail independently without affecting the entire system.
Challenges
Network Communication
- Latency: Delays in message delivery
- Bandwidth: Limited data transfer capacity
- Reliability: Messages can be lost, duplicated, or reordered
Consistency
Ensuring all nodes have the same view of data.
Fault Tolerance
System continues operating despite component failures.
Coordination
Managing interactions between distributed components.
Architectural Patterns
Client-Server Architecture
┌──────┐ ┌──────┐ ┌──────┐
│Client│◄───│Server│◄───│Client│
└──────┘ └──────┘ └──────┘
Characteristics:
- Centralized server
- Multiple clients
- Request-response model
Use Cases:
- Web applications
- Database systems
- File servers
Peer-to-Peer Architecture
┌─────┐ ┌─────┐ ┌─────┐
│Peer │◄───│Peer │◄───│Peer │
│ A │ │ B │ │ C │
└─────┘ └─────┘ └─────┘
Characteristics:
- No central server
- All nodes are equal
- Direct communication between peers
Use Cases:
- File sharing (BitTorrent)
- Blockchain networks
- Distributed databases
Microservices Architecture
┌──────┐ ┌──────┐ ┌──────┐
│Service│ │Service│ │Service│
│ A │ │ B │ │ C │
└───┬──┘ └───┬──┘ └───┬──┘
│ │ │
└──────────┼──────────┘
│
┌──────┴──────┐
│ API Gateway │
└──────┬──────┘
│
┌──────┴──────┐
│ Client │
└─────────────┘
Characteristics:
- Independent services
- Own databases
- API communication
- Loose coupling
Consistency Models
Strong Consistency
All nodes see the same data at the same time.
1class StrongConsistencyStore:
2 def write(self, key, value):
3 # Synchronous replication to all nodes
4 for node in self.nodes:
5 node.write(key, value)
6
7 # Wait for acknowledgment from all nodes
8 self.wait_for_acks()
9
10 def read(self, key):
11 # Read from any node (all have same data)
12 return self.nodes[0].read(key)Eventual Consistency
System will become consistent over time, but may have temporary inconsistencies.
1class EventualConsistencyStore:
2 def write(self, key, value):
3 # Write to primary node
4 self.primary.write(key, value)
5
6 # Asynchronously replicate to other nodes
7 for replica in self.replicas:
8 self.async_replicate(replica, key, value)
9
10 def read(self, key):
11 # Read from nearest node
12 return self.get_nearest_node().read(key)Causal Consistency
Operations that are causally related are seen in the same order by all nodes.
1class CausalConsistencyStore:
2 def write(self, key, value, dependencies):
3 # Include causal dependencies
4 operation = {
5 'key': key,
6 'value': value,
7 'dependencies': dependencies,
8 'timestamp': self.get_logical_clock()
9 }
10
11 self.broadcast_operation(operation)Consensus Algorithms
Paxos
Algorithm for achieving consensus in distributed systems.
Roles:
- Proposer: Proposes values
- Acceptor: Votes on proposals
- Learner: Learns the decided value
Phases:
- Prepare: Proposer asks acceptors to promise not to accept older proposals
- Accept: Proposer sends value to acceptors
- Learn: Acceptors notify learners of accepted value
Raft
Simpler consensus algorithm for practical systems.
States:
- Follower: Passive state, responds to leaders
- Candidate: Campaigning to become leader
- Leader: Handles all client requests
Election Process:
1class RaftNode:
2 def __init__(self):
3 self.state = 'follower'
4 self.term = 0
5 self.voted_for = None
6 self.log = []
7
8 def start_election(self):
9 self.state = 'candidate'
10 self.term += 1
11 self.voted_for = self.id
12
13 votes = 1 # Vote for self
14 for peer in self.peers:
15 if peer.request_vote(self.term, self.id):
16 votes += 1
17
18 if votes > len(self.peers) / 2:
19 self.become_leader()Distributed Databases
Sharding
Partition data across multiple nodes.
Sharding Strategies:
- Range-based: Partition by key ranges
- Hash-based: Partition by hash of key
- Directory-based: Central directory maps keys to nodes
1class ShardedDatabase:
2 def __init__(self, shards):
3 self.shards = shards
4 self.hash_function = consistent_hash
5
6 def get_shard(self, key):
7 return self.shards[self.hash_function(key)]
8
9 def get(self, key):
10 shard = self.get_shard(key)
11 return shard.get(key)
12
13 def put(self, key, value):
14 shard = self.get_shard(key)
15 shard.put(key, value)Replication
Maintain copies of data on multiple nodes.
Replication Strategies:
- Master-Slave: One master, multiple slaves
- Multi-Master: Multiple masters
- Leaderless: No single master
1class ReplicatedDatabase:
2 def __init__(self, nodes, replication_factor=3):
3 self.nodes = nodes
4 self.replication_factor = replication_factor
5
6 def write(self, key, value):
7 # Write to multiple nodes
8 target_nodes = self.get_replication_nodes(key)
9
10 for node in target_nodes:
11 node.write(key, value)
12
13 # Wait for quorum
14 self.wait_for_quorum(target_nodes)
15
16 def read(self, key):
17 # Read from multiple nodes for consistency
18 target_nodes = self.get_replication_nodes(key)
19 responses = []
20
21 for node in target_nodes:
22 response = node.read(key)
23 responses.append(response)
24
25 # Resolve conflicts if any
26 return self.resolve_conflicts(responses)Message Passing
Remote Procedure Calls (RPC)
Call procedures on remote machines as if they were local.
1class RPCClient:
2 def __init__(self, server_address):
3 self.server_address = server_address
4
5 def call(self, method, *args):
6 request = {
7 'method': method,
8 'args': args,
9 'id': self.generate_id()
10 }
11
12 response = self.send_request(request)
13 return response['result']
14
15# Usage
16client = RPCClient('server:8080')
17result = client.call('add', 5, 3) # Returns 8Message Queues
Asynchronous communication between components.
1class MessageQueue:
2 def __init__(self):
3 self.queue = []
4 self.subscribers = {}
5
6 def publish(self, topic, message):
7 for subscriber in self.subscribers.get(topic, []):
8 subscriber.receive(message)
9
10 def subscribe(self, topic, callback):
11 if topic not in self.subscribers:
12 self.subscribers[topic] = []
13 self.subscribers[topic].append(callback)
14
15# Usage
16mq = MessageQueue()
17mq.subscribe('orders', process_order)
18mq.publish('orders', {'id': 123, 'amount': 100})Fault Tolerance
Redundancy
Multiple copies of components to handle failures.
Failover
Automatic switching to backup components.
1class FailoverService:
2 def __init__(self, primary, backup):
3 self.primary = primary
4 self.backup = backup
5 self.current = primary
6
7 def call(self, method, *args):
8 try:
9 return self.current.call(method, *args)
10 except Exception as e:
11 if self.current == self.primary:
12 self.current = self.backup
13 return self.current.call(method, *args)
14 else:
15 raise eCircuit Breaker
Prevent cascading failures.
1class CircuitBreaker:
2 def __init__(self, failure_threshold=5, timeout=60):
3 self.failure_threshold = failure_threshold
4 self.timeout = timeout
5 self.failure_count = 0
6 self.last_failure_time = None
7 self.state = 'CLOSED'
8
9 def call(self, func, *args, **kwargs):
10 if self.state == 'OPEN':
11 if self._should_attempt_reset():
12 self.state = 'HALF_OPEN'
13 else:
14 raise Exception("Circuit breaker is OPEN")
15
16 try:
17 result = func(*args, **kwargs)
18 self._on_success()
19 return result
20 except Exception as e:
21 self._on_failure()
22 raise eDistributed Caching
Consistent Hashing
Distribute cache keys evenly across nodes.
1class ConsistentHashRing:
2 def __init__(self, nodes, replicas=100):
3 self.ring = {}
4 self.sorted_keys = []
5
6 for node in nodes:
7 for i in range(replicas):
8 key = self.hash(f"{node}:{i}")
9 self.ring[key] = node
10 self.sorted_keys.append(key)
11
12 self.sorted_keys.sort()
13
14 def get_node(self, key):
15 if not self.ring:
16 return None
17
18 hash_key = self.hash(key)
19 idx = bisect.bisect_right(self.sorted_keys, hash_key)
20
21 if idx == len(self.sorted_keys):
22 idx = 0
23
24 return self.ring[self.sorted_keys[idx]]Monitoring Distributed Systems
Distributed Tracing
Track requests as they flow through multiple services.
1class DistributedTracer:
2 def __init__(self):
3 self.spans = []
4
5 def start_span(self, operation_name, parent_span=None):
6 span = {
7 'trace_id': self.generate_trace_id(),
8 'span_id': self.generate_span_id(),
9 'parent_span_id': parent_span['span_id'] if parent_span else None,
10 'operation_name': operation_name,
11 'start_time': time.time()
12 }
13
14 self.spans.append(span)
15 return span
16
17 def finish_span(self, span):
18 span['end_time'] = time.time()
19 span['duration'] = span['end_time'] - span['start_time']Health Checks
Monitor the health of distributed components.
1class HealthChecker:
2 def __init__(self, services):
3 self.services = services
4
5 def check_all(self):
6 results = {}
7 for service_name, service in self.services.items():
8 try:
9 health = service.health_check()
10 results[service_name] = {
11 'status': 'healthy',
12 'details': health
13 }
14 except Exception as e:
15 results[service_name] = {
16 'status': 'unhealthy',
17 'error': str(e)
18 }
19 return resultsBest Practices
- Design for Failure: Assume components will fail
- Use Idempotent Operations: Safe retry mechanisms
- Implement Timeouts: Prevent hanging operations
- Monitor Everything: Comprehensive observability
- Use Circuit Breakers: Prevent cascading failures
- Implement Retry Logic: Handle transient failures
- Design for Scalability: Architecture should support growth
Distributed systems enable building powerful, scalable applications but require careful design to handle the complexities of network communication, consistency, and fault tolerance.

