[Doc] neuron documentation update (#8671)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
This commit is contained in:
parent
b4e4eda92e
commit
7c8566aa4f
@ -3,8 +3,8 @@
|
|||||||
Installation with Neuron
|
Installation with Neuron
|
||||||
========================
|
========================
|
||||||
|
|
||||||
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK.
|
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
|
||||||
At the moment Paged Attention is not supported in Neuron SDK, but naive continuous batching is supported in transformers-neuronx.
|
Paged Attention and Chunked Prefill are currently in development and will be available soon.
|
||||||
Data types currently supported in Neuron SDK are FP16 and BF16.
|
Data types currently supported in Neuron SDK are FP16 and BF16.
|
||||||
|
|
||||||
Requirements
|
Requirements
|
||||||
|
@ -43,7 +43,7 @@ vLLM is flexible and easy to use with:
|
|||||||
* Tensor parallelism and pipeline parallelism support for distributed inference
|
* Tensor parallelism and pipeline parallelism support for distributed inference
|
||||||
* Streaming outputs
|
* Streaming outputs
|
||||||
* OpenAI-compatible API server
|
* OpenAI-compatible API server
|
||||||
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
|
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
|
||||||
* Prefix caching support
|
* Prefix caching support
|
||||||
* Multi-lora support
|
* Multi-lora support
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user