Palash Chauhan
← Projects

Distributed Fault Tolerant Scheduler in Go

· Graduate Course Project, UC San Diego Distributed Systems Scheduling

A fault-tolerant, low-latency cluster scheduler based on Sparrow, written in Go.

This scheduler is a fault-tolerant, distributed, low-latency task scheduler based on Berkeley's Sparrow, implemented in Go. Like Sparrow, it places tasks across a cluster using decentralized, randomized sampling, which keeps scheduling latency very low. It is built from the same set of components: schedulers, node monitors, executors, and frontends.

Where Sparrow assumes workers never fail, this project adds fault tolerance. It uses ZooKeeper for group membership of the worker nodes, detects worker failures, and recovers and reschedules the incomplete jobs that were running on a failed worker, so a failure does not silently lose work.