20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Norman Mu	2f30e7c72f	[Frontend] Add --log-level option to api server (#4377 )	2024-04-26 05:36:01 +00:00
Cyrus Leung	a74dee9b62	[Bugfix] Fix parameter name in `get_tokenizer` (#4107 )	2024-04-25 19:10:48 -07:00
Hongxia Yang	cf29b7eda4	[ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376 ) Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu>	2024-04-25 18:12:25 -07:00
Nick Hill	efffb63f58	[Core] Move function tracing setup to util function (#4352 )	2024-04-25 16:45:12 -07:00
Nick Hill	15e7c675b0	[Core] Add `shutdown()` method to `ExecutorBase` (#4349 )	2024-04-25 16:32:48 -07:00
Roy	b6dcb4d442	[Misc] Fix flash attention backend log (#4368 )	2024-04-25 12:43:32 -07:00
SangBin Cho	b5b4a398a7	[Mypy] Typing lora folder (#4337 )	2024-04-25 19:13:50 +00:00
Kunshang Ji	f4bc4de1b1	[Core]refactor aqlm quant ops (#4351 )	2024-04-25 15:03:56 -04:00
Caio Mendes	bd7a8eef25	[Doc] README Phi-3 name fix. (#4372 ) Co-authored-by: Caio Mendes <caiocesart@microsoft.com>	2024-04-25 10:32:00 -07:00
Alexei-V-Ivanov-AMD	7ee82bef1e	[CI/Build] Adding functionality to reset the node's GPUs before processing. (#4213 )	2024-04-25 09:37:20 -07:00
Isotr0py	fbf152d976	[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-25 09:35:56 -07:00
Nick Hill	479d69fad0	[Core] Move ray_utils.py from `engine` to `executor` package (#4347 )	2024-04-25 06:52:22 +00:00
Caio Mendes	96e90fdeb3	[Model] Adds Phi-3 support (#4298 )	2024-04-25 03:06:57 +00:00
zifeitong	a395a638c2	[Misc] Use public API in benchmark_throughput (#4300 )	2024-04-24 21:10:24 +00:00
youkaichao	2768884ac4	[Doc] Add note for docker user (#4340 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-24 21:09:44 +00:00
alexm-nm	aae08249ac	[Bugfix] Fix marlin kernel crash on H100 (#4218 ) This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.	2024-04-24 10:35:01 -07:00
Roger Wang	7923dcad12	[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279 )	2024-04-24 09:49:13 -07:00
youkaichao	3cd9b5bb2d	[Core][Distributed] use existing torch.cuda.device (#4318 ) [Core][Distributed] use existing torch.cuda.device context manager (#4318)	2024-04-24 09:00:20 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
youkaichao	e4bf860a54	[CI][Build] change pynvml to nvidia-ml-py (#4302 )	2024-04-23 18:33:12 -07:00
youkaichao	91f50a6fe2	[Core][Distributed] use cpu/gloo to initialize pynccl (#4248 )	2024-04-23 18:32:19 -07:00
Robert Shaw	79a268c4ab	[BUG] fixed fp8 conflict with aqlm (#4307 ) Fixes fp8 iterface which broke in AQLM merge.	2024-04-23 18:26:33 -07:00
Philipp Moritz	eace8bf0b9	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 ) This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Performance: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> Accuracy: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7018\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6472\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7673\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8099\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6131\|± \|0.0083\| ``` this compares favorably with the fp16 results which are ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7020\|± \|0.1313\| \| - humanities \|N/A \|none \| 5\|acc \|0.6425\|± \|0.1349\| \| - other \|N/A \|none \| 5\|acc \|0.7744\|± \|0.1038\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8131\|± \|0.0695\| \| - stem \|N/A \|none \| 5\|acc \|0.6108\|± \|0.1383\| ``` Happy hacking!	2024-04-24 01:18:23 +00:00
Cyrus Leung	1e8f4252aa	[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292 )	2024-04-23 18:19:03 +00:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Simon Mo	62b5166bd4	[CI] Add ccache for wheel builds job (#4281 )	2024-04-23 09:51:41 -07:00
youkaichao	d86285a4a4	[Core][Logging] Add last frame information for better debugging (#4278 )	2024-04-23 09:45:52 -07:00
DefTruth	d87f39e9a9	[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286 )	2024-04-23 09:28:35 -07:00
Jack Gordley	d3c8180ac4	[Bugfix] Fixing max token error message for openai compatible server (#4016 )	2024-04-23 19:06:29 +08:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
SangBin Cho	050f285ff6	[Core] Scheduling optimization 2 (#4280 )	2024-04-23 08:02:11 +00:00
Nick Hill	8f2ea22bde	[Core] Some simplification of WorkerWrapper changes (#4183 )	2024-04-23 07:49:08 +00:00
SangBin Cho	0ae11f78ab	[Mypy] Part 3 fix typing for nested directories for most of directory (#4161 )	2024-04-22 21:32:44 -07:00
Harry Mellor	34128a697e	Fix `autodoc` directives (#4272 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-23 01:53:01 +00:00
youkaichao	c1b4e4157c	[Core][Distributed] use absolute path for library file (#4271 )	2024-04-22 17:21:48 -07:00
Zhanghao Wu	ceaf4ed003	[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276 )	2024-04-22 15:34:31 -07:00
SangBin Cho	ad8d696a99	[Core] Scheduler perf fix (#4270 )	2024-04-22 21:11:06 +00:00
Harry Mellor	3d925165f2	Add example scripts to documentation (#4225 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-22 16:36:54 +00:00
alexm-nm	1543680691	[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217 )	2024-04-22 09:10:48 -07:00
Tao He	077f0a2e8a	[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-04-22 09:19:51 +00:00
Woosuk Kwon	e73ed0f1c6	[Bugfix] Fix type annotations in CPU model runner (#4256 )	2024-04-22 00:54:16 -07:00
Isotr0py	296cdf8ac7	[Misc] Add vision language model support to CPU backend (#3968 )	2024-04-22 00:44:16 -07:00
youkaichao	747b1a7147	[Core][Distributed] fix _is_full_nvlink detection (#4233 )	2024-04-21 23:04:16 -07:00
Hongxia Yang	95e5b087cf	[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129 )	2024-04-21 21:57:24 -07:00
GeauxEric	a37d815b83	Make initialization of tokenizer and detokenizer optional (#3748 ) Co-authored-by: Yun Ding <yunding@nvidia.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-21 22:06:46 +00:00
xiaoji	7f2593b164	[Doc]: Update the doc of adding new models (#4236 )	2024-04-21 09:57:08 -07:00
Harry Mellor	fe7d648fe5	Don't show default value for flags in `EngineArgs` (#4223 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-21 09:15:28 -07:00
Noam Gat	cc74b2b232	Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222 )	2024-04-20 08:33:16 +00:00
nunjunj	91528575ec	[Frontend] multiple sampling params support (#3570 )	2024-04-20 00:11:57 -07:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00

1 2 3 4 5 ...

1190 Commits