The table below illustrates shows which search modes are available based on the model type:
Single/Multi-Model | BLS | Ensemble | LLM | |
---|---|---|---|---|
Brute | o | - | - | - |
Quick | o | o | o | o |
Optuna | o | o | o | o |
This mode has the following limitations:
- Does not support detailed reporting, only summary reports
Multi-model concurrent search mode can be enabled by adding the parameter --run-config-profile-models-concurrently-enable
to the CLI.
It uses Quick Search mode's hill climbing algorithm to search all models configurations spaces in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with typical runtimes of around 20-30 minutes (compared to the days it would take a brute force run to complete) for a two to three model run.
After it has found the best config(s), it will then sweep the top-N configurations found (specified by --num-configs-per-model
) over the default concurrency range before generation of the summary reports.
Note: The algorithm attempts to find the most fair and optimal result for all models, by evaluating each model objective's gain/loss. In many cases this will result in the algorithm ranking higher a configuration that has a lower total combined throughput (if that was the objective), if this better balances the throughputs of all the models.
An example model analyzer YAML config that performs a Multi-model search:
model_repository: /path/to/model/repository/
run_config_profile_models_concurrently_enable: true
profile_models:
- model_A
- model_B
In addition to setting a model's objectives or constraints, in multi-model search mode, you have the ability to set a model's weighting. By default each model is set for equal weighting (value of 1), but in the YAML you can specify weighting: <int>
which will bias that model's objectives when evaluating for an optimal result.
An example where model A's objective gains (towards minimizing latency) will have 3 times the importance versus maximizing model B's throughput gains:
model_repository: /path/to/model/repository/
run_config_profile_models_concurrently_enable: true
profile_models:
model_A:
weighting: 3
objectives:
perf_latency_p99: 1
model_B:
weighting: 1
objectives:
perf_throughput: 1
Profiling this model type has the following limitations:
- Only supports up to four composing models
- Composing models cannot be ensemble or BLS models
Ensemble models can be optimized using the Quick Search mode's hill climbing algorithm to search the composing models' configuration spaces in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes under one hour (compared to the days it would take a brute force run to complete) for ensembles that contain up to four composing models.
After Model Analyzer has found the best config(s), it will then sweep the top-N configurations found (specified by --num-configs-per-model
) over the concurrency range before generation of the summary reports.
Profiling this model type has the following limitations:
- Only supports up to four composing models
- Composing models cannot be ensemble or BLS models
BLS models can be optimized using the Quick Search mode's hill climbing algorithm to search the BLS composing models' configuration spaces, as well as the BLS model's instance count, in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes under one hour (compared to the days it would take a brute force run to complete) for BLS models that contain up to four composing models.
After Model Analyzer has found the best config(s), it will then sweep the top-N configurations found (specified by --num-configs-per-model
) over the concurrency range before generation of the summary reports.
Profiling this model type has the following limitations:
- Summary/Detailed reports do not include the new metrics
In order to profile LLMs you must tell MA that the model type is LLM by setting --model-type LLM
in the CLI/config file. You can specify CLI options to the GenAI-Perf tool using genai_perf_flags
. See the GenAI-Perf CLI documentation for a list of the flags that can be specified.
LLMs can be optimized using either Quick or Brute search mode.
An example model analyzer YAML config for a LLM:
model_repository: /path/to/model/repository/
model_type: LLM
client_prototcol: grpc
genai_perf_flags:
backend: vllm
streaming: true
For LLMs there are three new metrics being reported: Inter-token Latency, Time to First Token Latency and Output Token Throughput.
These new metrics can be specified as either objectives or constraints.
NOTE: In order to enable these new metrics you must enable streaming
in genai_perf_flags
and the client protocol
must be set to gRPC