vllm/kernels at 2a052011ca473a9dc8160f3daa1f5f63a2ad1fe3 - vllm

20231088/vllm

History

Michael Goin 2a052011ca

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 )

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2024-05-04 11:45:16 -07:00

allclose_default.py

[ROCm] Fix some kernels failed unit tests (#2498 )

2024-02-05 14:25:36 -08:00

conftest.py

[Kernel] Use flashinfer for decoding (#4353 )

2024-05-03 15:51:27 -07:00

test_activation.py

[CI] Try introducing isort. (#3495 )

2024-03-25 07:59:47 -07:00

test_attention.py

[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )