将 PyTorch Profiler 与 DeepSpeed 结合用于性能调试

本教程介绍了如何将 PyTorch Profiler 与 DeepSpeed 结合使用。

PyTorch Profiler 是一个开源工具，可为大型深度学习模型提供准确高效的性能分析和故障排除。分析结果可以输出为 .json 跟踪文件，并在 Google 的 Perfetto 跟踪查看器 (https://ui.perfetto.dev) 中查看。Microsoft Visual Studio Code 的 Python 扩展将 TensorBoard 集成到代码编辑器中，包括对 PyTorch Profiler 的支持。

有关更多详细信息，请参阅 PYTORCH PROFILER。

分析模型训练循环

下面展示了如何通过将代码包装在 profiler 上下文管理器中来分析训练循环。Profiler 假定训练过程由多个步骤组成（从零开始编号）。PyTorch profiler 接受许多参数，例如 schedule、on_trace_ready、with_stack 等。

在下面的示例中，profiler 将跳过前 5 个步骤，将接下来的 2 个步骤用作热身，并主动记录接下来的 6 个步骤。由于 repeat 设置为 2，因此 profiler 将在第一次两个循环后停止记录。有关 schedule 的详细用法，请参阅使用 profiler 分析长时间运行的作业。

from torch.profiler import profile, record_function, ProfilerActivity

with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=5, # During this phase profiler is not active.
        warmup=2, # During this phase profiler starts tracing, but the results are discarded.
        active=6, # During this phase profiler traces and records data.
        repeat=2), # Specifies an upper bound on the number of cycles.
    on_trace_ready=tensorboard_trace_handler,
    with_stack=True # Enable stack tracing, adds extra profiling overhead.
) as profiler:
    for step, batch in enumerate(data_loader):
        print("step:{}".format(step))

        #forward() method
        loss = model_engine(batch)

        #runs backpropagation
        model_engine.backward(loss)

        #weight update
        model_engine.step()
        profiler.step() # Send the signal to the profiler that the next step has started.

标记任意代码范围

record_function 上下文管理器可用于使用用户提供的名称标记任意代码范围。例如，以下代码将 "model_forward" 标记为标签

with profile(record_shapes=True) as prof: # record_shapes indicates whether to record shapes of the operator inputs.
    with record_function("""):"
        model_engine(inputs)

分析 CPU 或 GPU 活动

传递给 Profiler 的 activities 参数指定了一个活动列表，以便在用 profiler 上下文管理器包装的代码范围执行期间进行分析

ProfilerActivity.CPU - PyTorch 运算符、TorchScript 函数和用户定义的代码标签 (record_function)。
ProfilerActivity.CUDA - 设备上的 CUDA 内核。请注意，CUDA 性能分析会产生不可忽略的开销。

下面的示例分析了模型前向传播中的 CPU 和 GPU 活动，并按 CUDA 总时间排序打印了摘要表。

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_forward"):
        model_engine(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

分析内存消耗

通过将 profile_memory=True 传递给 PyTorch profiler，我们启用了内存分析功能，该功能记录了模型运算符执行期间分配（或释放）的内存量（由模型的张量使用）。例如

with profile(activities=[ProfilerActivity.CUDA],
        profile_memory=True, record_shapes=True) as prof:
    model(inputs)

print(prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10))

self 内存对应于运算符分配（释放）的内存，不包括对其他运算符的子调用。