DeepSpeed-Ulysses 入门：用于训练具有超长序列的Transformer模型

在本教程中，我们将介绍如何启用 DeepSpeed-Ulysses。DeepSpeed-Ulysses 是一种简单但高效通信和内存的序列并行机制，用于训练具有海量序列长度的大型 Transformer 模型。它沿序列维度划分输入张量，并使用通信高效的 all-2-all 集合操作进行分布式注意力计算。此外，DeepSpeed-Ulysses 还集成了先进的模型和系统优化，例如 Flash attention、稀疏注意力和 ZeRO 优化器，以同时优化计算效率和内存使用。通过 DeepSpeed 序列并行训练，模型大小和序列长度都可以几乎无限地扩展，不受单个 GPU 内存限制，并能达到峰值计算性能的高比例。目前，DeepSpeed-Ulysses 可以在 64 个 A100 GPU 上处理长达 100 万的序列（相当于一本完整的《哈利·波特》书籍长度的 10 倍！）。请阅读我们的DeepSpeed-Ulysses 博客了解更多信息！

1. 安装

您需要安装 DeepSpeed v0.10.2 或更高版本才能使用 DeepSpeed Sequence 功能。安装 DeepSpeed 就像 pip install deepspeed 一样简单，查看更多详细信息。

2. 如何在你的应用中使用 DeepSpeed-Ulysses？

将 DS-Seq 集成到您的训练代码中非常容易，在本节中，我们将介绍如何通过我们的 Megatron-DeepSpeed 代码仓库集成 DeepSpeed-Ulysses。

替换注意力模块：首先，您需要使用 DeepSpeed-Ulysses DistributedAttention 更新您的注意力模块。这里，我们使用来自 Megatron-DeepSpeed 的注意力机制，这是在类似 GPT-3 模型训练中使用的因果注意力。重写注意力块

def __init__():
    ...
    self.local_attn = CoreAttention(self.layer_number, config, self.attn_mask_type)
    self.core_attention = local_attn
    ...

def forward():
    ...
    context_layer = self.core_attention(
                    query_layer, key_layer, value_layer, attention_mask)
    ...

为

from deepspeed.sequence.layer import DistributedAttention

def __init__():
    ...
    self.local_attn = CoreAttention(self.layer_number, config, self.attn_mask_type)
    self.dist_attn = DistributedAttention(self.local_attn, parallel_state.get_sequence_parallel_group())
    ...

def forward():
    ...
    context_layer = self.dist_attn(query_layer, key_layer, value_layer, attention_mask)
    ...

添加序列并行通信组：请注意，DistributedAttention 接受 local_attn 和 sequence_parallel_group 作为参数，其中 local_attn 可以是您原始的注意力块。您还需要构建序列并行通信组并将其传递给 DistributedAttention。一种方法是在模型初始化阶段构建序列并行组。

def initialize_model_parallel(
    ...
    sequence_parallel_size,
    ...
):
    ...
    num_sequence_parallel_groups: int = world_size // sequence_parallel_size
    num_sequence_data_parallel_groups: int = world_size // sequence_parallel_size // data_parallel_size
    ...
    global _SEQUENCE_PARALLEL_GROUP
    for i in range(num_sequence_parallel_groups):
        ranks = range(i * sequence_parallel_size,
                      (i + 1) * sequence_parallel_size)
        group = torch.distributed.new_group(ranks)
        if rank in ranks:
            _SEQUENCE_PARALLEL_GROUP = group

def get_sequence_parallel_group():
    """Get the sequence parallel group the caller rank belongs to."""
    return _SEQUENCE_PARALLEL_GROUP

在 Megatron-DeepSpeed 示例中，要启用序列并行，请使用 –ds-sequence-parallel-size 参数设置并行度。您还需要确保注意力头部的数量可以被此值整除。我们为您准备了脚本，以便您快速获取训练具有超长序列的 GPT-3 类似模型的一些示例

Megatron-DeepSpeed/examples_deepspeed/sequence_parallel$ bash ds_pretrain_gpt_1.3B_seq_parallel_32k.sh
Megatron-DeepSpeed/examples_deepspeed/sequence_parallel$ bash ds_pretrain_gpt_30B_seq_parallel_32k.sh

请注意，我们的序列并行功能目前与 Megatron-LM 的张量并行或流水线并行不兼容。

3. 如何结合 FlashAttention 启用 DeepSpeed-Ulysses？

DeepSpeed 的序列并行可以与不同类型的注意力实现结合使用，以进一步提高长序列训练的内存和计算效率

经典注意力：通过 PyTorch 实现的注意力机制。

FlashAttention：来自 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 的实现。通过 --use-flash-attn 启用。

FlashAttention + Triton：Triton 中的 FlashAttention（已使用 triton==2.0.0.dev20221202 进行测试）。通过 --use-flash-attn-triton 启用。

为获得最佳性能，我们建议使用 FlashAttention + Triton。以下是安装步骤。请注意，FlashAttention 仅与 NVIDIA Turing、Ampere、Ada 或 Hopper GPU 兼容。

# install triton
git clone -b legacy-backend https://github.com/openai/triton
cd triton/python/
pip install cmake
pip install .

# install
cd ${WORK_DIR}
git clone -b v1.0.4 https://github.com/HazyResearch/flash-attention
cd flash-attention
python -m pip install .

您可能还需要确保您的模型配置符合 FlashAttention 的要求。例如，为实现最佳性能，头部大小应能被 8 整除。请参阅 FlashAttention 文档了解更多详细信息。