diff --git a/docs/source/deployment/frameworks/helm.md b/docs/source/deployment/frameworks/helm.md index e4fc5e13..7320d727 100644 --- a/docs/source/deployment/frameworks/helm.md +++ b/docs/source/deployment/frameworks/helm.md @@ -4,9 +4,9 @@ A Helm chart to deploy vLLM for Kubernetes -Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values. +Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values. -This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file. +This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file. ## Prerequisites diff --git a/docs/source/deployment/k8s.md b/docs/source/deployment/k8s.md index 64071ba0..dd3769c4 100644 --- a/docs/source/deployment/k8s.md +++ b/docs/source/deployment/k8s.md @@ -14,7 +14,7 @@ Alternatively, you can also deploy Kubernetes using [helm chart](https://docs.vl ## Pre-requisite -Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). +Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-metal GPU machine). ## Deployment using native K8s diff --git a/docs/source/design/kernel/paged_attention.md b/docs/source/design/kernel/paged_attention.md index 5f258287..e1770c82 100644 --- a/docs/source/design/kernel/paged_attention.md +++ b/docs/source/design/kernel/paged_attention.md @@ -419,7 +419,7 @@ List of `v_vec` for one thread which is also `V_VEC_SIZE` elements from `logits`. Overall, with multiple inner iterations, each warp will process one block of value tokens. And with multiple outer iterations, the whole context value - tokens are processd + tokens are processed ```cpp float accs[NUM_ROWS_PER_THREAD]; diff --git a/docs/source/design/v1/metrics.md b/docs/source/design/v1/metrics.md index bed40516..b3981b2d 100644 --- a/docs/source/design/v1/metrics.md +++ b/docs/source/design/v1/metrics.md @@ -13,7 +13,7 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0. Metrics in vLLM can be categorized as follows: 1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus. -2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histrograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking. +2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking. The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are. @@ -47,7 +47,7 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` - `vllm:tokens_total` (Counter) - `vllm:iteration_tokens_total` (Histogram) - `vllm:time_in_queue_requests` (Histogram) -- `vllm:model_forward_time_milliseconds` (Histogram +- `vllm:model_forward_time_milliseconds` (Histogram) - `vllm:model_execute_time_milliseconds` (Histogram) - `vllm:request_params_n` (Histogram) - `vllm:request_params_max_tokens` (Histogram) diff --git a/docs/source/features/lora.md b/docs/source/features/lora.md index dff7e916..a71da72e 100644 --- a/docs/source/features/lora.md +++ b/docs/source/features/lora.md @@ -110,7 +110,7 @@ In addition to serving LoRA adapters at server startup, the vLLM server now supp LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility to change models on-the-fly is needed. -Note: Enabling this feature in production environments is risky as user may participate model adapter management. +Note: Enabling this feature in production environments is risky as users may participate in model adapter management. To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING` is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active. diff --git a/docs/source/getting_started/faq.md b/docs/source/getting_started/faq.md index 4751b325..c1bb2893 100644 --- a/docs/source/getting_started/faq.md +++ b/docs/source/getting_started/faq.md @@ -15,7 +15,7 @@ more are listed [here](#supported-models). By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models, -but they are expected be inferior to models that are specifically trained on embedding tasks. +but they are expected to be inferior to models that are specifically trained on embedding tasks. ______________________________________________________________________ diff --git a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md index 7e52f604..e91ed6fb 100644 --- a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md +++ b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md @@ -119,7 +119,7 @@ If you're observing the following error: `docker: Error response from daemon: Un ## Supported configurations -The following configurations have been validated to be function with +The following configurations have been validated to function with Gaudi2 devices. Configurations that are not listed may or may not work. - [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) diff --git a/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md index 5641c156..ab0db479 100644 --- a/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md +++ b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md @@ -19,7 +19,7 @@ Currently, there are no pre-built OpenVINO wheels. ### Build wheel from source -First, install Python and ensure you lave the latest pip. For example, on Ubuntu 22.04, you can run: +First, install Python and ensure you have the latest pip. For example, on Ubuntu 22.04, you can run: ```console sudo apt-get update -y diff --git a/docs/source/getting_started/installation/gpu/xpu.inc.md b/docs/source/getting_started/installation/gpu/xpu.inc.md index 5a47b16f..84a9b387 100644 --- a/docs/source/getting_started/installation/gpu/xpu.inc.md +++ b/docs/source/getting_started/installation/gpu/xpu.inc.md @@ -1,6 +1,6 @@ # Installation -vLLM initially supports basic model inferencing and serving on Intel GPU platform. +vLLM initially supports basic model inference and serving on Intel GPU platform. :::{attention} There are no pre-built wheels or images for this device, so you must build vLLM from source. @@ -65,7 +65,7 @@ $ docker run -it \ ## Supported features -XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following: +XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following: ```console python -m vllm.entrypoints.openai.api_server \ @@ -78,6 +78,6 @@ python -m vllm.entrypoints.openai.api_server \ -tp=8 ``` -By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the helper script. +By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the helper script. -There are some new features coming with ipex-xpu 2.6, eg: **chunked prefill**, **V1 engine support**, **lora**, **MoE**, etc. +There are some new features coming with ipex-xpu 2.6, e.g. **chunked prefill**, **V1 engine support**, **lora**, **MoE**, etc. diff --git a/docs/source/serving/distributed_serving.md b/docs/source/serving/distributed_serving.md index e6be644b..b36a3dcb 100644 --- a/docs/source/serving/distributed_serving.md +++ b/docs/source/serving/distributed_serving.md @@ -20,7 +20,7 @@ There is one edge case: if the model fits in a single node with multiple GPUs, b ## Running vLLM on a single node -vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray. +vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray. Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. @@ -29,7 +29,7 @@ To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` ```python from vllm import LLM llm = LLM("facebook/opt-13b", tensor_parallel_size=4) -output = llm.generate("San Franciso is a") +output = llm.generate("San Francisco is a") ``` To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: diff --git a/docs/source/training/rlhf.md b/docs/source/training/rlhf.md index 00822aef..72e89c0c 100644 --- a/docs/source/training/rlhf.md +++ b/docs/source/training/rlhf.md @@ -1,6 +1,6 @@ # Reinforcement Learning from Human Feedback -Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviours. +Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF. The best way to do this is with libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl). diff --git a/examples/other/logging_configuration.md b/examples/other/logging_configuration.md index c70b853c..fbdbce6a 100644 --- a/examples/other/logging_configuration.md +++ b/examples/other/logging_configuration.md @@ -127,7 +127,7 @@ configuration for the root vLLM logger and for the logger you wish to silence: "vllm": { "handlers": ["vllm"], "level": "DEBUG", - "propagage": false + "propagate": false }, "vllm.example_noisy_logger": { "propagate": false diff --git a/vllm/distributed/kv_transfer/README.md b/vllm/distributed/kv_transfer/README.md index c408d4a6..349d3dfb 100644 --- a/vllm/distributed/kv_transfer/README.md +++ b/vllm/distributed/kv_transfer/README.md @@ -24,6 +24,6 @@ NOTE: If you want to not only transfer KV caches, but adjust the model execution The example usage is in [this file](../../../examples/online_serving/disaggregated_prefill.sh). -Here is the diagram of how we run disaggretgated prefilling. +Here is the diagram of how we run disaggregated prefilling. ![Disaggregated prefill workflow](./disagg_prefill_workflow.jpg)