浮点运算量分析器
在本教程中,我们将介绍 DeepSpeed 浮点运算量分析器,并提供其用法的示例。
概述
有效利用硬件资源对于良好的性能至关重要,但在现有的大规模模型训练和推理实现中,性能效率低下往往难以发现,也难以归因于特定模块组件。DeepSpeed 浮点运算量分析器帮助用户轻松测量模型及其子模块的模型训练/推理速度(延迟、吞吐量)和效率(每秒浮点运算次数,即 FLOPS),以消除现有实现中的效率低下。
以下是在具有批次大小为 80
的 A100 GPU 上运行的 BERT-Large(NVIDIA) 的示例输出。
-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)
world size: 1
data parallel size: 1
model parallel size: 1
batch size per GPU: 80
params per gpu: 336.23 M
params of model = params per GPU * mp_size: 336.23 M
fwd MACs per GPU: 3139.93 G
fwd flops per GPU: 6279.86 G
fwd flops of model = fwd flops per GPU * mp_size: 6279.86 G
fwd latency: 76.67 ms
bwd latency: 108.02 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 81.9 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 116.27 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 102.0 TFLOPS
step latency: 34.09 us
iter latency: 184.73 ms
samples/second: 433.07
----------------------------- Aggregated Profile per GPU -----------------------------
Top modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
params - {'BertForPreTrainingPreLN': '336.23 M'}
MACs - {'BertForPreTrainingPreLN': '3139.93 GMACs'}
fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
depth 1:
params - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
MACs - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
depth 2:
params - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
MACs - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
depth 3:
params - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
MACs - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
depth 4:
params - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
MACs - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
depth 5:
params - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
MACs - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
depth 6:
params - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
MACs - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}
------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
BertForPreTrainingPreLN(
336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
(bert): BertModel(
335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
(embeddings): BertEmbeddings(...)
(encoder): BertEncoder(
302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
(FinalLayerNorm): FusedLayerNorm(...)
(layer): ModuleList(
302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
(0): BertLayer(
12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
(attention): BertAttention(
4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
(self): BertSelfAttention(
3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
(query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
(key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
(value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
(dropout): Dropout(...)
(softmax): Softmax(...)
)
(output): BertSelfOutput(
1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
(dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
(dropout): Dropout(...)
)
)
(PreAttentionLayerNorm): FusedLayerNorm(...)
(PostAttentionLayerNorm): FusedLayerNorm(...)
(intermediate): BertIntermediate(
4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
(dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
)
(output): BertOutput(
4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
(dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
(dropout): Dropout(...)
)
)
...
(23): BertLayer(...)
)
)
(pooler): BertPooler(...)
)
(cls): BertPreTrainingHeads(...)
)
------------------------------------------------------------------------------
在摘要配置文件中,DeepSpeed 浮点运算量分析器输出模型的参数数量、浮点运算次数(flops)、FLOPS、延迟和吞吐量(以每秒样本数计)。此配置文件显示了当前模型执行与峰值硬件性能相比的性能差距,并帮助用户调整训练或推理设置(例如,超参数、数据并行、模型并行、系统配置等)以获得更好的性能。
DeepSpeed 浮点运算量分析器还测量了模型架构中不同模型深度的重要模块(聚合配置文件)和特定于模块的配置文件(详细配置文件)。使用这些配置文件,DeepSpeed 用户可以了解每一层或子模块如何影响整体模型复杂度/性能。然后用户可以调整或重构模型设计以提高性能。例如,使用分析器,DeepSpeed 用户可以定量地判断堆叠较小的层比使用较大的层是否更轻或更具性能。聚合配置文件和详细配置文件还允许用户快速识别瓶颈模块。在上面的 BERT-Large 示例中,使用 DeepSpeed 浮点运算量分析器,我们发现 BertLayer 是最重要的层,包含相当多的 dropout、softmax 和层归一化以及线性模块。这些模块在 flops 中并不重,但会触发许多 GPU 内核调用,并产生过多对内存的读写请求。详细配置文件中显示的模式表明这是内核融合的完美匹配,我们开发了融合的 Transformer 内核以减少数据移动(参见 DeepSpeedBert)。应用我们的优化后,我们在 DeepSpeed 浮点运算量分析器输出中看到每个 GPU 的 FLOPS 以及总体训练样本数每秒提高了 25%。
DeepSpeed 浮点运算量分析器可以与 DeepSpeed 运行时一起使用,无需用户代码更改,也可以作为独立包与 DeepSpeed 独立使用。在使用 DeepSpeed 进行模型训练时,可以在 DeepSpeed 的 配置文件 中启用分析器。作为独立包,分析器 API 可用于训练和推理代码中。DeepSpeed 分析器仍在积极开发中,仅包含初始功能。请保持关注,更多激动人心的功能即将推出。
浮点运算量测量
与现有的 flops 计算工具或方法类似,DeepSpeed 浮点运算量分析器测量模块的前向传递的 flops,而反向传递的 flops 估计为前向传递的 2
倍。与计算 PyTorch 运算符的 flops 的 PyTorch 分析器不同,DeepSpeed 浮点运算量分析器测量模型中模块内的 flops,并为用户提供更多有关模型执行的见解。flops 估计部分受到 ptflops 的启发,但主要区别在于 DeepSpeed 浮点运算量分析器不仅支持直接在模块级别进行 flops 计算,而且还可以捕获模块中调用的 torch.nn.functional
来估计 flops。因此,DeepSpeed 浮点运算量分析器允许在模型中使用自定义模块,例如 Megatron-LM 中的 ParallelTransformerLayerworks
、ParallelSelfAttention
、RowParallelLinear
等。这与 ptflops 形成对比,ptflops 要求用户为每个自定义模块编写自定义 flops 计算函数。
多 GPU、多节点、数据并行和模型并行
DeepSpeed 浮点运算量分析器输出每个 GPU 的配置文件,以及世界大小、数据并行大小和模型并行大小。
对于在多 GPU 或多节点上运行的模型,只有模型并行性的更改(例如,Megatron-LM 中的 --model-parallel-size
)会影响分析的 flops 和参数数量,即 model_parallel_size * flops = total_flops
和 model_parallel_size * parameters = total_parameters
。数据并行大小或世界大小(与 GPU 或节点的数量相关)不会影响每个 GPU 的配置文件。
用法
DeepSpeed 浮点运算量分析器可以与 DeepSpeed 运行时一起使用,也可以作为独立包使用。在使用 DeepSpeed 进行模型训练时,可以在 deepspeed 配置文件 中配置分析器,无需用户代码更改。要在 DeepSpeed 运行时之外使用 flops 分析器,请安装 DeepSpeed 并导入 flops_profiler
包以直接使用 API。下面给出了每种用法的示例。
与 DeepSpeed 运行时一起使用
在使用 DeepSpeed 进行模型训练时,可以在 deepspeed 配置文件 中配置分析器。无需显式 API 调用即可使用分析器。可以通过将以下字段添加到 deepspeed 的配置文件中来启用分析器。有关详细信息,请参见 flops 分析器。
{
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
示例:Megatron-LM
有关使用 DeepSpeed 运行 Megatron-LM 的信息,请参阅我们的教程 Megatron-LM。
下面显示了 12 层 Megatron-LM 模型 (hidden_size = 8192, num_attention_heads = 32, batch_size = 1024, seq_length = 1024
) 的示例输出。
-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)
world size: 1
data parallel size: 1
model parallel size: 1
batch size per GPU: 1024
params per gpu: 1.29 M
params of model = params per GPU * mp_size: 1.29 M
fwd MACs per GPU: 41271.95 G
fwd flops per GPU: 82543.9 G
fwd flops of model = fwd flops per GPU * mp_size: 82543.9 G
fwd latency: 1.89 s
bwd latency: 5.38 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 43.68 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 30.7 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 34.07 TFLOPS
step latency: 34.12 s
iter latency: 41.39 s
samples/second: 24.74
----------------------------- Aggregated Profile per GPU -----------------------------
Top 1 modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
params - {'GPT2Model': '1.29 M'}
MACs - {'GPT2Model': '41271.95 GMACs'}
fwd latency - {'GPT2Model': '1.84 s'}
depth 1:
params - {'TransformerLanguageModel': '1.29 M'}
MACs - {'TransformerLanguageModel': '39584.03 GMACs'}
fwd latency - {'TransformerLanguageModel': '1.83 s'}
depth 2:
params - {'ParallelTransformer': '1.29 M'}
MACs - {'ParallelTransformer': '39584.03 GMACs'}
fwd latency - {'ParallelTransformer': '1.81 s'}
depth 3:
params - {'ModuleList': '1.28 M'}
MACs - {'ModuleList': '39584.03 GMACs'}
fwd latency - {'ModuleList': '1.3 s'}
depth 4:
params - {'ParallelTransformerLayerPart2': '688.15 k'}
MACs - {'ParallelTransformerLayerPart2': '26388.28 GMACs'}
fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'}
depth 5:
params - {'ParallelMLP': '491.54 k'}
MACs - {'ParallelMLP': '26388.28 GMACs'}
fwd latency - {'ParallelMLP': '849.4 ms'}
------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'.
1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.
GPT2Model(
1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS,
(language_model): TransformerLanguageModel(
1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS,
(embedding): Embedding(
2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS,
(word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, )
(position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192)
(embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
)
(transformer): ParallelTransformer(
1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS,
(layers): ModuleList(
1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS,
(0): ParallelTransformerLayerPart1(
49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS,
(input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS,
(query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, )
(scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, )
(attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
(dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, )
)
)
(1): ParallelTransformerLayerPart2(
57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS,
(post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS,
(dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, )
(dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, )
)
)
...
(23): ParallelTransformerLayerPart2(...)
)
(final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
)
)
)
------------------------------------------------------------------------------
DeepSpeed 运行时之外的用法
分析器可以作为独立包在 DeepSpeed 运行时之外使用。只需安装 DeepSpeed 并导入 flops_profiler
包以直接使用 API。有关安装 DeepSpeed,请参阅 DeepSpeed 的安装。
在模型推理中
要分析已训练模型中的推理,请使用 get_model_profile
函数。以下给出了示例。
示例:AlexNet
以下示例显示了如何使用 DeepSpeed 浮点运算量分析器分析 AlexNet。
import torchvision.models as models
import torch
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator
with get_accelerator().device(0):
model = models.alexnet()
batch_size = 256
flops, macs, params = get_model_profile(model=model, # model
input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
args=None, # list of positional arguments to the model.
kwargs=None, # dictionary of keyword arguments to the model.
print_profile=True, # prints the model graph with the measured profile attached to each module
detailed=True, # print the detailed profile
module_depth=-1, # depth into the nested modules, with -1 being the inner most modules
top_modules=1, # the number of top modules to print aggregated profile
warm_up=10, # the number of warm-ups before measuring the time of each module
as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
output_file=None, # path to the output file. If None, the profiler prints to stdout.
ignore_modules=None) # the list of modules to ignore in the profiling
示例:Bert
from functools import partial
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator
def bert_input_constructor(batch_size, seq_len, tokenizer):
fake_seq = ""
for _ in range(seq_len - 2): # ignore the two special tokens [CLS] and [SEP]
fake_seq += tokenizer.pad_token
inputs = tokenizer([fake_seq] * batch_size,
padding=True,
truncation=True,
return_tensors="pt")
labels = torch.tensor([1] * batch_size)
inputs = dict(inputs)
inputs.update({"labels": labels})
return inputs
with get_accelerator().device(0):
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
batch_size = 4
seq_len = 128
enable_profile = True
if enable_profile:
flops, macs, params = get_model_profile(
model,
kwargs=bert_input_constructor(batch_size, seq_len, tokenizer),
print_profile=True,
detailed=True,
)
else:
inputs = bert_input_constructor((batch_size, seq_len), tokenizer)
outputs = model(inputs)
在模型训练工作流程中
要分析训练工作流程中的模型前向传递,请使用 FlopsProfiler
类。 FlopsProfiler
类提供了以下方法
start_profile()
- 开始分析get_total_flops(as_string=False)
- 返回模型中的总浮点运算次数get_total_macs(as_string=False)
- 返回模型中的总 MAC 数get_total_params(as_string=False)
- 返回模型中的总参数数量print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None)
- 打印模型配置文件stop_profile()
- 停止分析。这会停止模型中的 flops 计数。end_profile()
- 清理。这会清理分析过程中添加到模型中的配置文件属性。这应在分析结束时调用,并在调用get_total_flops
、get_total_params
或print_model_profile
之后调用。
示例训练工作流程
以下是在典型训练工作流程中使用此用法的示例。
from deepspeed.profiling.flops_profiler import FlopsProfiler
model = Model()
prof = FlopsProfiler(model)
profile_step = 5
print_profile= True
for step, batch in enumerate(data_loader):
# start profiling at training step "profile_step"
if step == profile_step:
prof.start_profile()
# forward() method
loss = model(batch)
# end profiling and print output
if step == profile_step: # if using multi nodes, check global_rank == 0 as well
prof.stop_profile()
flops = prof.get_total_flops()
macs = prof.get_total_macs()
params = prof.get_total_params()
if print_profile:
prof.print_model_profile(profile_step=profile_step)
prof.end_profile()
# runs backpropagation
loss.backward()
# weight update
optimizer.step()