Skip to content

Latest commit

 

History

History
34 lines (26 loc) · 1.42 KB

README.md

File metadata and controls

34 lines (26 loc) · 1.42 KB

Harmony: A scheduling framework for multiple machine learning training jobs running on distributed resources.

This module includes

  1. a parameter-server (PS) implementation with shared runtime for multiple concurrent jobs and
  2. a long-running master that receives multiple PS job submissions and
  3. a global job scheduler with plugable policy, and
  4. a distributed table abstraction for elastic management of in-memory data across containers.

How to build?

$ mvn clean install (-DskipTests)

Requirements

How to run?

# 1. start a long-running job-server
$ jobserver/bin/start_jobserver.sh -local false -num_executors 5 -executor_mem_size 128 -executor_num_cores 1

# 2. submit applications (Example usages are described in submit scripts.)
$ jobserver/bin/submit_[app].sh -input [file_path]
 (Common parameters: -num_mini_batches -max_num_epoch)
 (App-specific parameters: 
    NMF: -rank -step_size -decay_period -decay_rate
    MLR: -init_step_size -classes -features -features_per_partition -model_gaussian -lambda -decay_period -decay_rate
    LDA: -num_topics -num_vocabs
    etc.)

# 3. stop the job-server
$ jobserver/bin/stop_jobserver.sh