As with the DRF paper, this is a real problem: heterogeneous scheduling in large compute clusters will only become more real as clusters become larger and workloads more diverse.
The main idea is to have multiple, completely independent schedulers all running in parallel and all able to access the entire cluster. Transactions to shared state are used to fully claim a resource, but schedulers are allowed to preempt each other.
This is different from previous work partly because the scale of clusters is reaching the point where the easiest design (a monolithic scheduler) is no longer viable. This is also different from newer things like Mesos because it is very specifically tailored to Google's business-needs driven approach and relies heavily on all schedulers in the system being written to play nicely with each other, which is not the view that is normally taken (usually, more focus on fairness between schedulers / users).
The scheduler identifies some hard trade-offs. For one, allowing schedulers to all attempt to take resources in parallel sometimes results in them not being able to do so, causing their attempted transaction to abort, which adds some overhead. There is also an enormous trade-off here in the sense of faster, more distributed scheduling vs less centralized control -- this works fine for Google since they have written their schedulers to all play nicely together, but will not always be the case.
I don't think I see this being influential in 10 years outside of Google -- while the concepts introduced are certainly interesting, I don't know that they will extend well beyond their walls.
No comments:
Post a Comment