Parallel Data Laboratory Talk

— 1:00pm

Location:
Virtual Presentation - ET - Remote Access - Zoom

Speaker:
BIBEK WAGLE , Senior Member of Technical Staff, Oracle
https://www.linkedin.com/in/bibekwagle

Designing for consensus in data plane

Fault tolerance is an important facet of a highly available distributed system. Replicated state machines are used to build fault tolerant systems. RAFT is a popular consensus algorithm used in state machine replication. It finds its popularity in the fact that it is easy to understand and implement. RAFT is based on the idea of a "leader" in the cluster. A leader node (among others in the cluster) is elected and is responsible for replying to client requests. Replies can only be sent if the underlying state machine has been replicated into all the servers. Raft uses messaging-based protocol, such as TCP, for state machine replication. Vanilla RAFT implementation may not be suitable for use in high velocity data planes and applications requiring low tail latencies. In this talk we will focus on improvements to the original RAFT algorithm such as:

  1. RDMA for log replication in the RAFT cluster: RDMA improves throughput and performance as it allows for direct access to remote memory which frees up CPU cycles on the remote node.  Zero copy RDMA allows direct transfer of data from the wire to application memory.
  2. Member Sets to reduce heartbeat messages in a multi raft cluster: A physical node can host Multiple raft clusters with multiple leaders. Each node at a time can be participating in thousands of raft clusters. The amount of network traffic for heartbeats increases as the number of RAFT clusters increase.  Member Sets allows for reduction in the number of heartbeat / election messages that needs to be sent out by sending to a member set rather than individual members of the RAFT cluster.
  3. NVMe-oF for direct to disk log replication: In-Memory log replication requires multi way replication to prevent data loss. Instead of storing logs in memory, NVMe-oF can be used for replicating logs directly to storage. Aside from improved recovery, writing logs directly to disk reduces the number of replications needed.
  4. Leveraging synchronized clocks for increased availability: A leader in the RAFT cluster is assumed dead when no heartbeat is received for specified time plus the clock uncertainty period. Precisely synchronized clock reduces the clock uncertainty period which in turn reduces the total time between a leader death and new leader election. Aside from increasing availability by reducing downtime, embedding synchronized clock timestamps in the log record allows us to relax the constraint that reads be performed only on the leader which improves read scaling.

Bibek Wagle is a senior member of technical staff at Oracle. His interests include parallel and distributed computing, asynchronous task-based programming, and distributed runtime systems. He has a Ph.D. in computer science from Louisiana State University. Zoom Participation. See announcement.

Event Website:
https://pdl.cmu.edu/talk-series/2023/082423.shtml


Add event to Google
Add event to iCal