基于Transformer模型的DeepSpeed推理入门

DeepSpeed-Inference v2 已发布，名为 DeepSpeed-FastGen！为了获得最佳性能、最新功能和最新模型支持，请参阅我们的 DeepSpeed-FastGen 发布博客！

DeepSpeed-Inference 引入了多项功能，以高效地服务基于 Transformer 的 PyTorch 模型。它支持模型并行 (MP) 以适应原本无法放入 GPU 内存的大型模型。即使对于较小的模型，MP 也可以用于减少推理延迟。为了进一步降低延迟和成本，我们引入了推理定制内核。最后，我们提出了一种新颖的模型量化方法，称为 MoQ，旨在缩小模型并降低生产中的推理成本。有关 DeepSpeed 中推理相关优化的更多详细信息，请参阅我们的博客文章。

DeepSpeed 为使用 DeepSpeed、Megatron 和 HuggingFace 训练的兼容 Transformer 模型提供了无缝推理模式，这意味着我们不需要在模型侧进行任何更改，例如导出模型或从已训练的检查点创建不同的检查点。要在多 GPU 上对兼容模型运行推理，请提供模型并行度以及检查点信息或已从检查点加载的模型，DeepSpeed 将处理其余部分。它将根据需要自动分区模型，将兼容的高性能内核注入您的模型并管理 GPU 间通信。有关兼容模型的列表，请参见此处。

初始化以进行推理

要使用 DeepSpeed 进行推理，请使用 init_inference API 加载模型进行推理。在这里，您可以指定 MP 度，如果模型尚未加载相应的检查点，您还可以使用 json 文件或检查点路径提供检查点描述。

要注入高性能内核，您需要将兼容模型的 replace_with_kernel_inject 设置为 True。对于 DeepSpeed 不支持的模型，用户可以提交一个 PR，在 replace_policy 类中定义一个新策略，该策略指定 Transformer 层（例如注意力部分和前馈部分）的不同参数。DeepSpeed 中的策略类在原始用户提供的层实现参数与 DeepSpeed 优化推理的 Transformer 层之间创建了映射。

# create the model
if args.pre_load_checkpoint:
    model = model_class.from_pretrained(args.model_name_or_path)
else:
    model = model_class()

# create the tokenizer
tokenizer = model_class.from_pretrained(args.model_name_or_path)
...

import deepspeed

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
                                     tensor_parallel={"tp_size": world_size},
                                     dtype=torch.half,
                                     checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
                                     replace_with_kernel_inject=True)
model = ds_engine.module
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = pipe('Input String')

对于我们不支持内核的模型，要仅使用模型并行运行推理，您可以传递一个注入策略，该策略显示 Transformer 编码器/解码器层上的两个特定线性层：1) 注意力输出 GeMM 和 2) 层输出 GeMM。我们需要层的这些部分来添加 GPU 之间所需的 all-reduce 通信，以合并模型并行级别中的部分结果。下面，我们提供一个示例，展示如何将 DeepSpeed 推理与 T5 模型一起使用

# create the model
import transformers
from transformers.models.t5.modeling_t5 import T5Block

import deepspeed

pipe = pipeline("text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    tensor_parallel={"tp_size": world_size},
    dtype=torch.float,
    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
)
output = pipe('Input String')

加载检查点

对于使用 HuggingFace 训练的模型，可以使用上面显示的 from_pretrained API 预加载模型检查点。对于使用模型并行训练的 Megatron-LM 模型，我们要求在 JSON 配置中传递所有模型并行检查点的列表。下面我们展示如何加载使用 MP=2 训练的 Megatron-LM 检查点。

"checkpoint.json":
{
    "type": "Megatron",
    "version": 0.0,
    "checkpoints": [
        "mp_rank_00/model_optim_rng.pt",
        "mp_rank_01/model_optim_rng.pt",
    ],
}

对于使用 DeepSpeed 训练的模型，检查点 json 文件仅需存储模型检查点的路径。

"checkpoint.json":
{
    "type": "ds_model",
    "version": 0.0,
    "checkpoints": "path_to_checkpoints",
}

DeepSpeed 支持在推理时使用与训练时不同的 MP 度。例如，一个没有使用任何 MP 训练的模型可以以 MP=2 运行，或者一个以 MP=4 训练的模型可以不使用任何 MP 进行推理。DeepSpeed 在初始化时会根据需要自动合并或拆分检查点。

启动

使用 DeepSpeed 启动器 deepspeed 在多个 GPU 上启动推理

deepspeed --num_gpus 2 inference.py

端到端 GPT NEO 2.7B 推理

DeepSpeed 推理可以与 HuggingFace pipeline 结合使用。下面是结合 DeepSpeed 推理和 HuggingFace pipeline 使用 GPT-NEO-2.7B 模型生成文本的端到端客户端代码。

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           tensor_parallel={"tp_size": world_size},
                                           dtype=torch.float,
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

上述脚本修改了 HuggingFace 文本生成流水线中的模型，以使用 DeepSpeed 推理。请注意，即使原始模型在训练时没有使用任何模型并行，并且检查点也是单个 GPU 检查点，我们仍然可以通过跨 GPU 的模型并行张量切片在多个 GPU 上运行推理。要运行客户端，只需运行

deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py

以下是生成的文本输出。您可以尝试其他提示，看看这个模型如何生成文本。

[{
    'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions'
}]

数据类型和量化模型

DeepSpeed 推理支持 fp32、fp16 和 int8 参数。可以使用 init_inference 中的 dtype 设置适当的数据类型，DeepSpeed 将选择针对该数据类型优化的内核。对于量化的 int8 模型，如果模型是使用 DeepSpeed 的量化方法 (MoQ) 进行量化的，则需要将应用量化的设置传递给 init_inference。此设置包括用于量化的组数以及 Transformer 的 MLP 部分是否使用额外分组进行量化。有关这些参数的更多信息，请访问我们的量化教程。

import deepspeed
model = deepspeed.init_inference(model,
                                 checkpoint='./checkpoint.json',
                                 dtype=torch.int8,
                                 quantization_setting=(quantize_groups,
                                                       mlp_extra_grouping)
                                )

恭喜！您已完成 DeepSpeed 推理教程。