[Doc] Fix typo in documentation (#14783)
Signed-off-by: yasu52 <tsuguro4649@gmail.com>
This commit is contained in:
parent
d47807ba08
commit
3fb17d26c8
@ -4,9 +4,9 @@
|
||||
|
||||
A Helm chart to deploy vLLM for Kubernetes
|
||||
|
||||
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values.
|
||||
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
|
||||
|
||||
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
|
||||
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
|
@ -14,7 +14,7 @@ Alternatively, you can also deploy Kubernetes using [helm chart](https://docs.vl
|
||||
|
||||
## Pre-requisite
|
||||
|
||||
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
|
||||
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-metal GPU machine).
|
||||
|
||||
## Deployment using native K8s
|
||||
|
||||
|
@ -419,7 +419,7 @@ List of `v_vec` for one thread
|
||||
which is also `V_VEC_SIZE` elements from `logits`. Overall, with
|
||||
multiple inner iterations, each warp will process one block of value
|
||||
tokens. And with multiple outer iterations, the whole context value
|
||||
tokens are processd
|
||||
tokens are processed
|
||||
|
||||
```cpp
|
||||
float accs[NUM_ROWS_PER_THREAD];
|
||||
|
@ -13,7 +13,7 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
|
||||
Metrics in vLLM can be categorized as follows:
|
||||
|
||||
1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
|
||||
2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histrograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
|
||||
2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
|
||||
|
||||
The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
|
||||
|
||||
@ -47,7 +47,7 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
|
||||
- `vllm:tokens_total` (Counter)
|
||||
- `vllm:iteration_tokens_total` (Histogram)
|
||||
- `vllm:time_in_queue_requests` (Histogram)
|
||||
- `vllm:model_forward_time_milliseconds` (Histogram
|
||||
- `vllm:model_forward_time_milliseconds` (Histogram)
|
||||
- `vllm:model_execute_time_milliseconds` (Histogram)
|
||||
- `vllm:request_params_n` (Histogram)
|
||||
- `vllm:request_params_max_tokens` (Histogram)
|
||||
|
@ -110,7 +110,7 @@ In addition to serving LoRA adapters at server startup, the vLLM server now supp
|
||||
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
|
||||
to change models on-the-fly is needed.
|
||||
|
||||
Note: Enabling this feature in production environments is risky as user may participate model adapter management.
|
||||
Note: Enabling this feature in production environments is risky as users may participate in model adapter management.
|
||||
|
||||
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
|
||||
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
|
||||
|
@ -15,7 +15,7 @@ more are listed [here](#supported-models).
|
||||
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
|
||||
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
|
||||
but they are expected be inferior to models that are specifically trained on embedding tasks.
|
||||
but they are expected to be inferior to models that are specifically trained on embedding tasks.
|
||||
|
||||
______________________________________________________________________
|
||||
|
||||
|
@ -119,7 +119,7 @@ If you're observing the following error: `docker: Error response from daemon: Un
|
||||
|
||||
## Supported configurations
|
||||
|
||||
The following configurations have been validated to be function with
|
||||
The following configurations have been validated to function with
|
||||
Gaudi2 devices. Configurations that are not listed may or may not work.
|
||||
|
||||
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
|
||||
|
@ -19,7 +19,7 @@ Currently, there are no pre-built OpenVINO wheels.
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
First, install Python and ensure you lave the latest pip. For example, on Ubuntu 22.04, you can run:
|
||||
First, install Python and ensure you have the latest pip. For example, on Ubuntu 22.04, you can run:
|
||||
|
||||
```console
|
||||
sudo apt-get update -y
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Installation
|
||||
|
||||
vLLM initially supports basic model inferencing and serving on Intel GPU platform.
|
||||
vLLM initially supports basic model inference and serving on Intel GPU platform.
|
||||
|
||||
:::{attention}
|
||||
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
||||
@ -65,7 +65,7 @@ $ docker run -it \
|
||||
|
||||
## Supported features
|
||||
|
||||
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
|
||||
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
|
||||
|
||||
```console
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
@ -78,6 +78,6 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
-tp=8
|
||||
```
|
||||
|
||||
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
|
||||
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
|
||||
|
||||
There are some new features coming with ipex-xpu 2.6, eg: **chunked prefill**, **V1 engine support**, **lora**, **MoE**, etc.
|
||||
There are some new features coming with ipex-xpu 2.6, e.g. **chunked prefill**, **V1 engine support**, **lora**, **MoE**, etc.
|
||||
|
@ -20,7 +20,7 @@ There is one edge case: if the model fits in a single node with multiple GPUs, b
|
||||
|
||||
## Running vLLM on a single node
|
||||
|
||||
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
|
||||
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray.
|
||||
|
||||
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
|
||||
|
||||
@ -29,7 +29,7 @@ To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size`
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
|
||||
output = llm.generate("San Franciso is a")
|
||||
output = llm.generate("San Francisco is a")
|
||||
```
|
||||
|
||||
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Reinforcement Learning from Human Feedback
|
||||
|
||||
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviours.
|
||||
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors.
|
||||
|
||||
vLLM can be used to generate the completions for RLHF. The best way to do this is with libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl).
|
||||
|
||||
|
@ -127,7 +127,7 @@ configuration for the root vLLM logger and for the logger you wish to silence:
|
||||
"vllm": {
|
||||
"handlers": ["vllm"],
|
||||
"level": "DEBUG",
|
||||
"propagage": false
|
||||
"propagate": false
|
||||
},
|
||||
"vllm.example_noisy_logger": {
|
||||
"propagate": false
|
||||
|
@ -24,6 +24,6 @@ NOTE: If you want to not only transfer KV caches, but adjust the model execution
|
||||
|
||||
The example usage is in [this file](../../../examples/online_serving/disaggregated_prefill.sh).
|
||||
|
||||
Here is the diagram of how we run disaggretgated prefilling.
|
||||
Here is the diagram of how we run disaggregated prefilling.
|
||||
|
||||

|
||||
|
Loading…
x
Reference in New Issue
Block a user