LLM 推理

1、Flexgen

主要优化点为offload CPU 、CPU和GPU的并行计算、模型量化和多GPU并行

2、DeepSpeed

GitHub - microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

GitHub - microsoft/DeepSpeedExamples: Example models using DeepSpeed

DeepSpeed 通过系统优化加速大模型推理

针对现有问题：

对大规模模型缺乏多 GPU 支持并满足延迟要求；
在小批量（small batch size）推理时，GPU 内核性能有限；
难以利用量化，既包括量化模型来减少模型大小，以及支持量化模型的高性能推理且无需专门硬件来减少延迟。

提出解决方案：

推理自适应并行性（Inference-adapted parallelism）：允许用户通过适应多 GPU 推理的最佳并行策略来有效地服务大型模型，同时考虑推理延迟和成本。
针对推理优化的 CUDA 内核（Inference-optimized CUDA kernels）：通过深度融合和新颖的内核调度充分利用 GPU 资源，从而提高每个 GPU 的效率。
有效的量化感知训练（Effective quantize-aware training）：支持量化后的模型推理，如 INT8 推理，模型量化可以节省内存（memory）和减少延迟（latency），同时不损害准确性。

运行代码

from argparse import ArgumentParser
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import deepspeed
import torch
from utils import DSPipeline

inputs = [
         "DeepSpeed is a machine learning framework",
         "He is working on",
         "He has a",
         "He got all",
         "Everyone is happy and I can",
         "The new movie that got Oscar this year",
         "In the far far distance from our galaxy,",
         "Peace is the only way"
]

pipe = DSPipeline(model_name=args.name,
                  dtype=torch.float16,
                  is_meta=args.use_meta_tensor,
                  device=args.local_rank,
                  checkpoint_path=args.checkpoint_path)

ds_kwargs = dict()

pipe.model = deepspeed.init_inference(pipe.model,
                                dtype=torch.int8,
                                mp_size=args.world_size,
                                replace_with_kernel_inject=args.use_kernel,
                                replace_method=args.replace_method,
                                max_tokens=args.max_tokens,
                                save_mp_checkpoint_path=args.save_mp_checkpoint_path,
                                **ds_kwargs
                                )

torch.cuda.synchronize()
outputs = pipe(inputs, num_tokens=args.max_new_tokens, do_sample=(not args.greedy))
torch.cuda.synchronize()

3、FasterTransformer

https://github.com/NVIDIA/FasterTransformer
https://github.com/cameronfr/FasterTransformer
https://zhuanlan.zhihu.com/p/626008090

为了减少kernel调用次数，将除了矩阵乘法的kernel都尽可能合并
针对大batch单独进行了kernel优化
支持选择最优的矩阵乘法
在使用FP16时使用half2类型，达到half两倍的访存带宽和计算吞吐
优化gelu、softmax、layernorm的实现以及选用rsqrt等

FT框架是用C++/CUDA编写的，依赖于高度优化的 cuBLAS、cuBLASLt 和 cuSPARSELt 库，这样可以在 GPU 上进行快速的 Transformer 推理。
调用较为繁琐，只跑通了LLaMA的C++版本demo，修改起来较为困难。

4、exllama学习

ZhiHu LLM推理1：exllama学习

通过Python/C++/CUDA 实现，与 4 位 GPTQ 权重一起使用，旨在在现代 GPU 上实现快速且内存高效。

5、vLLM

https://github.com/vllm-project/vllm
https://vllm.ai/

LLM推理2：vLLM源码学习

# 从理论到实践，深入理解 FlashAttention

vLLM 是在加州大学伯克利分校开发，配备了PagedAttention的vLLM重新定义了 LLM 服务的最新技术水平：它的吞吐量比 HuggingFace Transformers 高出 24 倍，且无需更改任何模型架构

通过Python/C++/CUDA 实现。

运行代码

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

6、llama.cpp/koboldcpp

呵呵哒：LLM推理框架3：llama.cpp/koboldcpp学习

基于 GGML 模型的推理框架，采用了纯 C/C++代码，优势如下：

无需任何额外依赖，相比 Python 代码对 PyTorch 等库的要求，C/C++ 直接编译出可执行文件，跳过不同硬件的繁杂准备；
支持 Apple Silicon 芯片的 ARM NEON 加速，x86 平台则以 AVX2 替代；
具有 F16 和 F32 的混合精度；
支持 4-bit 量化；
无需 GPU，可只用 CPU 运行；

Ref

LLM推理框架总结

LLM推理2：vLLM源码学习