[Doc] Convert list tables to MyST (#11594)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung 2024-12-29 15:56:22 +08:00 committed by GitHub
parent 4fb8e329fd
commit 32b4c63f02
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 951 additions and 965 deletions

View File

@ -197,4 +197,4 @@ if __name__ == '__main__':
## Known Issues ## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759). - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) . - To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .

View File

@ -141,26 +141,25 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag. Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
```{eval-rst} ```{list-table} vLLM execution modes
.. list-table:: vLLM execution modes :widths: 25 25 50
:widths: 25 25 50 :header-rows: 1
:header-rows: 1
* - ``PT_HPU_LAZY_MODE`` * - `PT_HPU_LAZY_MODE`
- ``enforce_eager`` - `enforce_eager`
- execution mode - execution mode
* - 0 * - 0
- 0 - 0
- torch.compile - torch.compile
* - 0 * - 0
- 1 - 1
- PyTorch eager mode - PyTorch eager mode
* - 1 * - 1
- 0 - 0
- HPU Graphs - HPU Graphs
* - 1 * - 1
- 1 - 1
- PyTorch lazy mode - PyTorch lazy mode
``` ```
```{warning} ```{warning}

View File

@ -68,33 +68,32 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT --service-account SERVICE_ACCOUNT
``` ```
```{eval-rst} ```{list-table} Parameter descriptions
.. list-table:: Parameter descriptions :header-rows: 1
:header-rows: 1
* - Parameter name * - Parameter name
- Description - Description
* - QUEUED_RESOURCE_ID * - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request. - The user-assigned ID of the queued resource request.
* - TPU_NAME * - TPU_NAME
- The user-assigned name of the TPU which is created when the queued - The user-assigned name of the TPU which is created when the queued
resource request is allocated. resource request is allocated.
* - PROJECT_ID * - PROJECT_ID
- Your Google Cloud project - Your Google Cloud project
* - ZONE * - ZONE
- The GCP zone where you want to create your Cloud TPU. The value you use - The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see depends on the version of TPUs you are using. For more information, see
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_ `TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
* - ACCELERATOR_TYPE * - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, for example - The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information, `v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_. see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION * - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_. - The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT * - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM - The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example: Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
``` ```
Connect to your TPU using SSH: Connect to your TPU using SSH:

File diff suppressed because it is too large Load Diff

View File

@ -4,121 +4,120 @@
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
```{eval-rst} ```{list-table}
.. list-table:: :header-rows: 1
:header-rows: 1 :widths: 20 8 8 8 8 8 8 8 8 8 8
:widths: 20 8 8 8 8 8 8 8 8 8 8
* - Implementation * - Implementation
- Volta - Volta
- Turing - Turing
- Ampere - Ampere
- Ada - Ada
- Hopper - Hopper
- AMD GPU - AMD GPU
- Intel GPU - Intel GPU
- x86 CPU - x86 CPU
- AWS Inferentia - AWS Inferentia
- Google TPU - Google TPU
* - AWQ * - AWQ
- ✗ - ✗
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
* - GPTQ * - GPTQ
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
* - Marlin (GPTQ/AWQ/FP8) * - Marlin (GPTQ/AWQ/FP8)
- ✗ - ✗
- ✗ - ✗
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
* - INT8 (W8A8) * - INT8 (W8A8)
- ✗ - ✗
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
* - FP8 (W8A8) * - FP8 (W8A8)
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
* - AQLM * - AQLM
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
* - bitsandbytes * - bitsandbytes
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
* - DeepSpeedFP * - DeepSpeedFP
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
* - GGUF * - GGUF
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
- ✗ - ✗
``` ```
## Notes: ## Notes:

View File

@ -43,209 +43,208 @@ chart **including persistent volumes** and deletes the release.
## Values ## Values
```{eval-rst} ```{list-table}
.. list-table:: Values :widths: 25 25 25 25
:widths: 25 25 25 25 :header-rows: 1
:header-rows: 1
* - Key * - Key
- Type - Type
- Default - Default
- Description - Description
* - autoscaling * - autoscaling
- object - object
- {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} - {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
- Autoscaling configuration - Autoscaling configuration
* - autoscaling.enabled * - autoscaling.enabled
- bool - bool
- false - false
- Enable autoscaling - Enable autoscaling
* - autoscaling.maxReplicas * - autoscaling.maxReplicas
- int - int
- 100 - 100
- Maximum replicas - Maximum replicas
* - autoscaling.minReplicas * - autoscaling.minReplicas
- int - int
- 1 - 1
- Minimum replicas - Minimum replicas
* - autoscaling.targetCPUUtilizationPercentage * - autoscaling.targetCPUUtilizationPercentage
- int - int
- 80 - 80
- Target CPU utilization for autoscaling - Target CPU utilization for autoscaling
* - configs * - configs
- object - object
- {} - {}
- Configmap - Configmap
* - containerPort * - containerPort
- int - int
- 8000 - 8000
- Container port - Container port
* - customObjects * - customObjects
- list - list
- [] - []
- Custom Objects configuration - Custom Objects configuration
* - deploymentStrategy * - deploymentStrategy
- object - object
- {} - {}
- Deployment strategy configuration - Deployment strategy configuration
* - externalConfigs * - externalConfigs
- list - list
- [] - []
- External configuration - External configuration
* - extraContainers * - extraContainers
- list - list
- [] - []
- Additional containers configuration - Additional containers configuration
* - extraInit * - extraInit
- object - object
- {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} - {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
- Additional configuration for the init container - Additional configuration for the init container
* - extraInit.pvcStorage * - extraInit.pvcStorage
- string - string
- "50Gi" - "50Gi"
- Storage size of the s3 - Storage size of the s3
* - extraInit.s3modelpath * - extraInit.s3modelpath
- string - string
- "relative_s3_model_path/opt-125m" - "relative_s3_model_path/opt-125m"
- Path of the model on the s3 which hosts model weights and config files - Path of the model on the s3 which hosts model weights and config files
* - extraInit.awsEc2MetadataDisabled * - extraInit.awsEc2MetadataDisabled
- boolean - boolean
- true - true
- Disables the use of the Amazon EC2 instance metadata service - Disables the use of the Amazon EC2 instance metadata service
* - extraPorts * - extraPorts
- list - list
- [] - []
- Additional ports configuration - Additional ports configuration
* - gpuModels * - gpuModels
- list - list
- ["TYPE_GPU_USED"] - ["TYPE_GPU_USED"]
- Type of gpu used - Type of gpu used
* - image * - image
- object - object
- {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} - {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
- Image configuration - Image configuration
* - image.command * - image.command
- list - list
- ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] - ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
- Container launch command - Container launch command
* - image.repository * - image.repository
- string - string
- "vllm/vllm-openai" - "vllm/vllm-openai"
- Image repository - Image repository
* - image.tag * - image.tag
- string - string
- "latest" - "latest"
- Image tag - Image tag
* - livenessProbe * - livenessProbe
- object - object
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
- Liveness probe configuration - Liveness probe configuration
* - livenessProbe.failureThreshold * - livenessProbe.failureThreshold
- int - int
- 3 - 3
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
* - livenessProbe.httpGet * - livenessProbe.httpGet
- object - object
- {"path":"/health","port":8000} - {"path":"/health","port":8000}
- Configuration of the Kubelet http request on the server - Configuration of the Kubelet http request on the server
* - livenessProbe.httpGet.path * - livenessProbe.httpGet.path
- string - string
- "/health" - "/health"
- Path to access on the HTTP server - Path to access on the HTTP server
* - livenessProbe.httpGet.port * - livenessProbe.httpGet.port
- int - int
- 8000 - 8000
- Name or number of the port to access on the container, on which the server is listening - Name or number of the port to access on the container, on which the server is listening
* - livenessProbe.initialDelaySeconds * - livenessProbe.initialDelaySeconds
- int - int
- 15 - 15
- Number of seconds after the container has started before liveness probe is initiated - Number of seconds after the container has started before liveness probe is initiated
* - livenessProbe.periodSeconds * - livenessProbe.periodSeconds
- int - int
- 10 - 10
- How often (in seconds) to perform the liveness probe - How often (in seconds) to perform the liveness probe
* - maxUnavailablePodDisruptionBudget * - maxUnavailablePodDisruptionBudget
- string - string
- "" - ""
- Disruption Budget Configuration - Disruption Budget Configuration
* - readinessProbe * - readinessProbe
- object - object
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
- Readiness probe configuration - Readiness probe configuration
* - readinessProbe.failureThreshold * - readinessProbe.failureThreshold
- int - int
- 3 - 3
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
* - readinessProbe.httpGet * - readinessProbe.httpGet
- object - object
- {"path":"/health","port":8000} - {"path":"/health","port":8000}
- Configuration of the Kubelet http request on the server - Configuration of the Kubelet http request on the server
* - readinessProbe.httpGet.path * - readinessProbe.httpGet.path
- string - string
- "/health" - "/health"
- Path to access on the HTTP server - Path to access on the HTTP server
* - readinessProbe.httpGet.port * - readinessProbe.httpGet.port
- int - int
- 8000 - 8000
- Name or number of the port to access on the container, on which the server is listening - Name or number of the port to access on the container, on which the server is listening
* - readinessProbe.initialDelaySeconds * - readinessProbe.initialDelaySeconds
- int - int
- 5 - 5
- Number of seconds after the container has started before readiness probe is initiated - Number of seconds after the container has started before readiness probe is initiated
* - readinessProbe.periodSeconds * - readinessProbe.periodSeconds
- int - int
- 5 - 5
- How often (in seconds) to perform the readiness probe - How often (in seconds) to perform the readiness probe
* - replicaCount * - replicaCount
- int - int
- 1 - 1
- Number of replicas - Number of replicas
* - resources * - resources
- object - object
- {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} - {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
- Resource configuration - Resource configuration
* - resources.limits."nvidia.com/gpu" * - resources.limits."nvidia.com/gpu"
- int - int
- 1 - 1
- Number of gpus used - Number of gpus used
* - resources.limits.cpu * - resources.limits.cpu
- int - int
- 4 - 4
- Number of CPUs - Number of CPUs
* - resources.limits.memory * - resources.limits.memory
- string - string
- "16Gi" - "16Gi"
- CPU memory configuration - CPU memory configuration
* - resources.requests."nvidia.com/gpu" * - resources.requests."nvidia.com/gpu"
- int - int
- 1 - 1
- Number of gpus used - Number of gpus used
* - resources.requests.cpu * - resources.requests.cpu
- int - int
- 4 - 4
- Number of CPUs - Number of CPUs
* - resources.requests.memory * - resources.requests.memory
- string - string
- "16Gi" - "16Gi"
- CPU memory configuration - CPU memory configuration
* - secrets * - secrets
- object - object
- {} - {}
- Secrets configuration - Secrets configuration
* - serviceName * - serviceName
- string - string
- -
- Service name - Service name
* - servicePort * - servicePort
- int - int
- 80 - 80
- Service port - Service port
* - labels.environment * - labels.environment
- string - string
- test - test
- Environment name - Environment name
* - labels.release * - labels.release
- string - string
- test - test
- Release name - Release name
``` ```