Bump up version to v0.3.0 (#2656)

This commit is contained in:
Zhuohan Li 2024-01-31 00:07:07 -08:00 committed by GitHub
parent 3dad944485
commit 1af090b57d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 7 additions and 3 deletions

View File

@ -46,7 +46,7 @@ vLLM is fast with:
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph - Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629) - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629), FP8 KV Cache
- Optimized CUDA kernels - Optimized CUDA kernels
vLLM is flexible and easy to use with: vLLM is flexible and easy to use with:
@ -57,6 +57,8 @@ vLLM is flexible and easy to use with:
- Streaming outputs - Streaming outputs
- OpenAI-compatible API server - OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs - Support NVIDIA GPUs and AMD GPUs
- (Experimental) Prefix caching support
- (Experimental) Multi-lora support
vLLM seamlessly supports many Hugging Face models, including the following architectures: vLLM seamlessly supports many Hugging Face models, including the following architectures:

View File

@ -31,7 +31,7 @@ vLLM is fast with:
* Efficient management of attention key and value memory with **PagedAttention** * Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests * Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph * Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_ * Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels * Optimized CUDA kernels
vLLM is flexible and easy to use with: vLLM is flexible and easy to use with:
@ -42,6 +42,8 @@ vLLM is flexible and easy to use with:
* Streaming outputs * Streaming outputs
* OpenAI-compatible API server * OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs * Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support
For more information, check out the following: For more information, check out the following:

View File

@ -8,7 +8,7 @@ from vllm.entrypoints.llm import LLM
from vllm.outputs import CompletionOutput, RequestOutput from vllm.outputs import CompletionOutput, RequestOutput
from vllm.sampling_params import SamplingParams from vllm.sampling_params import SamplingParams
__version__ = "0.2.7" __version__ = "0.3.0"
__all__ = [ __all__ = [
"LLM", "LLM",