Fast Transformer Inference With Better Transformer

2023年10月10日17:53に投稿 / カテゴリ : study 閲覧数 1361回

Fast Transformer Inference With Better Transformer

今回は「Better Transformer(BT)」のチュートリアルを見ていこうと思います！
Better Transformerは、CPUやGPU上でハイパフォーマンスを発揮し、Transformerモデルのデプロイを高速化するためのfastpathです。
Better Transformer fastpathによって高速化できるモデルは、

torch.nn.module
TransformerEncoder
TransformerEncoderLayer
MultiHeadAttention

を使っているモデルになります。
Better Transformerは2種類のアクセラレーションを提供します。

CPUとGPUにネイティブのMulti Head Attention(MHA)を実装し、全体的な実行率を向上。
自然言語処理におけるスパース性の活用。
入力の長さは可変であるため、入力トークンには多数のパディングトークンが含まれている場合がある。そのようなトークンの処理はスキップされるため、大幅なスピードアップが実現する。

Fastpathの実行にはいくつかの条件があります。
それは、モデルを推論モード（model.eval()）で実行することと、torch.no_gradを使って実行することです。

1. Setup

1.1 Load pretrained models

torchtext.modelsに従って、定義済みのtorchtextモデルからXLM-Rモデルをダウンロードします。そして、DEVICEを設定します。

import torch
import torch.nn as nn

print(f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim=1024)
model = xmlr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()

1.2 Dataset Setup

ここでは２種類の入力を用意します。

small_input_batch = [
                "Hello world",
                "How are you!"
]
big_input_batch = [
                "Hello world",
                "How are you!",
                """`Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.`
 
 It was in July, 1805, and the speaker was the well-known Anna
 Pavlovna Scherer, maid of honor and favorite of the Empress Marya
 Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
 of high rank and importance, who was the first to arrive at her
 reception. Anna Pavlovna had had a cough for some days. She was, as
 she said, suffering from la grippe; grippe being then a new word in
 St. Petersburg, used only by the elite."""
 ]

次に、小さい入力か大きい入力のどちらかを選択し、前処理をした上でモデルをテストします。

input_batch = big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
print(output.shape)

最終的に、ベンチマークを計算するためにiteration countを設定します。

2. Execution

2.1 Run and benchmark inference on CPU with and without BT fastpath (native MHA only)

CPU上でモデルを実行し、プロファイル情報を収集します。

まず、通常通りで実行。
次に、model.eval()とtorch.no_grad()を使って、BT fastpathを実行。

CPUでモデルを実行した際に、改善が見られます（改善の程度は、CPUのレベルに依存します。）。

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
    for i in range(ITERATIONS):
        output = model(model_input)
print(prof)

mdoel.eval()
print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
    with torch.no_grad():
        for i in range(ITERATIONS):
            output = model(model_input)
print(prof)

torch.autograd.profiler.profileを使うことで、順伝播および逆伝播のプロファイル情報を取得することができます。

2.2 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)

BTのsparsityの設定を確認します。

model.encoder.transformer.layers.enable_nested_tensor

BTのsparsityを無効にします。

model.encoder.transformer.layers.enable_nested_tensor=False

DEVICE上でモデルを実行し、ネイティブのMHAを実行するためのプロファイル情報を収集します。
GPU上で実行する場合、特に入力が小さい場合に大幅なスピードアップが確認されます。

mdoel.to(DEVICE)
model_input = model_inptu.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

2.3 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)

sparsityを有効にします。

model.encoder.transformer.layers.enable_nested_tensor = True

DEVICE上でモデルを実行し、ネイティブのMHAとスパースサポート実行のプロファイル情報を収集します。
GPU上で実行する場合、特にsparsityを含む大きな入力の場合に大幅なスピードアップが確認されます。

model.to(DEVICE)
model_input = model_input.to(DEVICE)
                
print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for i in range(ITERATIONS):
        output = model(model_input)
print(prof)
                
model.eval()
                
print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    with torch.no_grad():
        for i in range(ITERATIONS):
            output = model(model_input)
print(prof)

執筆者

今西　渉

大阪大学大学院
生命機能研究科卒業

一覧に戻る

Fast Transformer Inference With Better Transformer

Fast Transformer Inference With Better Transformer

1. Setup

1.1 Load pretrained models

1.2 Dataset Setup

2. Execution

2.1 Run and benchmark inference on CPU with and without BT fastpath (native MHA only)

2.2 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA only)

2.3 Run and benchmark inference on (configurable) DEVICE with and without BT fastpath (native MHA + sparsity)

今西 渉

今西　渉