HuggingFace 模型自动张量并行

简介

本教程演示了用于推理的全新自动张量并行功能。以前，用户需要向 DeepSpeed 提供注入策略以启用张量并行。现在，只要未启用核注入且未提供注入策略，DeepSpeed 默认支持 HuggingFace 模型的自动张量并行。这使得我们的用户无需提供注入策略，即可提高目前不支持核注入的模型的性能。以下是新方法的示例

# ---------------------------------------
# New automatic tensor parallelism method
# ---------------------------------------
import os
import torch
import transformers
import deepspeed
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
# create the model pipeline
pipe = transformers.pipeline(task="text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    mp_size=world_size,
    dtype=torch.float
)
output = pipe('Input String')

以前，对于不支持核注入的模型，若要仅使用张量并行运行推理，您可以传递一个注入策略，其中显示 Transformer 编码器/解码器层上的两个特定线性层：1) 注意力输出 GeMM 和 2) 层输出 GeMM。我们需要层的这些部分来在 GPU 之间添加所需的 all-reduce 通信，以合并模型并行等级的部分结果。下面，我们展示此先前方法的示例

# ----------------------------------
# Previous tensor parallelism method
# ----------------------------------
import os
import torch
import transformers
import deepspeed
from transformers.models.t5.modeling_t5 import T5Block
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
# create the model pipeline
pipe = transformers.pipeline(task="text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    mp_size=world_size,
    dtype=torch.float,
    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
)
output = pipe('Input String')

通过自动张量并行，我们无需为支持的模型提供注入策略。注入策略将在运行时确定并自动应用。

示例脚本

我们可以使用推理测试套件观察自动张量并行带来的性能提升。此脚本用于测试文本生成模型，并包括每 token 延迟、带宽、吞吐量和内存检查以供比较。有关更多信息，请参阅 README。

启动

使用以下命令在不使用 DeepSpeed 和不使用张量并行的情况下运行。设置 test_performance 标志以收集性能数据

deepspeed --num_gpus <num_gpus> DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py --name <model> --batch_size <batch_size> --test_performance

要启用张量并行，您需要对兼容模型使用 ds_inference 标志

deepspeed --num_gpus <num_gpus> DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py --name <model> --batch_size <batch_size> --test_performance --ds_inference

T5 11B 推理性能比较

以下结果是使用 V100 SXM2 32GB GPU 收集的。

延迟

T5 Latency Graph

吞吐量

T5 Throughput Graph

内存

测试	每个 GPU 分配的内存	最大批处理大小	每个 GPU 的最大吞吐量
无张量并行或 1 个 GPU	21.06 GB	64	9.29 TFLOPS
2 个 GPU 张量并行	10.56 GB	320	13.04 TFLOPS
4 个 GPU 张量并行	5.31 GB	768	14.04 TFLOPS

OPT 13B 推理性能比较

以下结果是使用 V100 SXM2 32GB GPU 收集的。

OPT Throughput Graph

测试	每个 GPU 分配的内存	最大批处理大小	每个 GPU 的最大吞吐量
无张量并行	23.94 GB	2	1.65 TFlops
2 个 GPU 张量并行	12.23 GB	20	4.61 TFlops
4 个 GPU 张量并行	6.36 GB	56	4.90 TFlops

支持的模型

以下模型系列已成功通过自动张量并行测试。其他模型可能有效但尚未经过测试。

albert
arctic
baichuan
bert
bigbird_pegasus
bloom
camembert
chatglm2
chatglm3
codegen
codellama
deberta_v2
electra
ernie
esm
falcon
glm
gpt-j
gpt-neo
gpt-neox
longt5
luke
llama
llama2
m2m_100
marian
mistral
mixtral
mpt
mvp
nezha
openai
opt
pegasus
perceiver
phi
plbart
qwen
qwen2
qwen2-moe
reformer
roberta
roformer
splinter
starcode
t5
xglm
xlm_roberta
yoso
yuan

不支持的模型

以下模型目前不支持自动张量并行。它们可能仍然兼容其他 DeepSpeed 功能（例如，Bloom 的核注入）

deberta
flaubert
fsmt
gpt2
led
longformer
xlm
xlnet

目录

简介

示例脚本

启动

T5 11B 推理性能比较

延迟

吞吐量

内存

OPT 13B 推理性能比较

支持的模型

不支持的模型