Tuesday, October 27, 2015

Review of "Scaling Distributed Machine Learning with the Parameter Server"

Machine learning is becoming increasingly ubiquitous and complex, running over increasingly large datasets. Being able to run these algorithms (some of which don't lend themselves well to large-scale parallelism) in a distributed fashion quickly is increasingly important, so this is definitely a real problem.

The main idea is to use a centralized set of parameter servers which maintain the shared state of the system, coupled with a large number of worker nodes which actually perform computation. Worker nodes push state updates and pull down state. The system highly leverages the fact that machine learning algorithms are generally convergent and are not significantly disrupted by having stale state, so they don't require synchronous state updates.

One trade-off they discuss is convergence rate vs per-iteration performance. As state becomes more stale, things converge less slowly, but having more stale state allows you to do more computation asynchronously.


No comments:

Post a Comment