Snapdragonデバイスで実行

NPU搭載のSnapdragonデバイスでSLMを実行する

ONNX Runtimeを使用してSnapdragonデバイスでSLMを実行する方法を学びます。

モデル

現在サポートされているモデルは次のとおりです。

Phi-3.5 mini instruct
Llama 3.2 3B

Snapdragon NPUを搭載したデバイスでは、特定のサイズと形式のモデルが必要です。

この形式でモデルを生成する手順は、Snapdragon用モデルのビルドに記載されています。

モデルをビルドまたはダウンロードしたら、モデルアセットを既知の場所に配置します。これらのアセットは次の要素で構成されています。

genai_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
quantizer.onnx
dequantizer.onnx
position-processor.onnx
トランスフォーマーモデルバイナリのセット
- Qualcommコンテキストバイナリ（*.bin）
- コンテキストバイナリメタデータ（*.json）
- ONNXラッパーモデル（*.onnx）

Pythonアプリケーション

デバイスにPythonがインストールされている場合は、簡単な質疑応答スクリプトを実行してモデルにクエリを実行できます。

ランタイムのインストール

pip install onnxruntime-genai

スクリプトのダウンロード

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-qa.py -o model-qa.py

スクリプトの実行

このスクリプトは、モデルアセットがmodels\Phi-3.5-mini-instructというフォルダにあることを前提としています。

python .\model-qa.py -e cpu -g -v --system_prompt "あなたは親切なアシスタントです。簡潔に答えてください。" --chat_template "<|user|>\n{input} <|end|>\n<|assistant|>" -m ..\..\models\Phi-3.5-mini-instruct

Pythonスクリプトの内部

完全なPythonスクリプトはこちらで公開されています：https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-qa.py。スクリプトはAPIを次の標準的な方法で利用します。

モデルのロード
```
model = og.Model(config)
```
これにより、モデルがメモリにロードされます。
プリプロセッサの作成とシステムプロンプトのトークン化
```
 tokenizer = og.Tokenizer(model)
 tokenizer_stream = tokenizer.create_stream()

 # オプション
 system_tokens = tokenizer.encode(system_prompt)
```
これにより、生成されると同時にトークンをユーザーに返すことができるトークナイザーとトークナイザーストリームが作成されます。

対話型入力ループ

while True:
    # プロンプトの読み取り
    # 出力トークンをストリーミングしながら生成を実行

生成ループ

# 1. プロンプトをトークンに前処理する
input_tokens = tokenizer.encode(prompt)

# 2. パラメータとジェネレータ（KVキャッシュなど）を作成し、プロンプトを処理する
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)
generator.append_tokens(system_tokens + input_tokens)

# 3. すべての出力トークンが生成されるまでループし、
# デコードされたトークンを出力する
while not generator.is_done():
    generator.generate_next_token()

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

 print()

 # 別のジェネレータを作成する前に、キャプチャされたグラフを解放するためにジェネレータを削除する
 del generator

C++アプリケーション

C++アプリケーション内でsnadragon NPUでモデルを実行するには、こちらのコードを使用します：https://github.com/microsoft/onnxruntime-genai/tree/main/examples/c。

このアプリケーションのビルドと実行には、Snapdragon NPUを搭載したWindows PCと、次のものが必要です。

cmake
Visual Studio 2022

リポジトリのクローン

git clone https://github.com/microsoft/onnxruntime-genai
cd examples\c

onnxruntimeのインストール

現在、言語モデルのQNNサポートに最新の変更があるため、onnxruntimeのナイトリービルドが必要です。

ONNX Runtime QNNバイナリのナイトリーバージョンをこちらからダウンロードします。

mkdir onnxruntime-win-arm64-qnn
move Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg onnxruntime-win-arm64-qnn
cd onnxruntime-win-arm64-qnn
tar xvzf Microsoft.ML.OnnxRuntime.QNN.1.22.0-dev-20250225-0548-e46c0d8.nupkg
copy runtimes\win-arm64\native\* ..\..\..\lib
cd ..

onnxruntime-genaiのインストール

curl https://github.com/microsoft/onnxruntime-genai/releases/download/v0.6.0/onnxruntime-genai-0.6.0-win-arm64.zip -o onnxruntime-genai-win-arm64.zip
tar xvf onnxruntime-genai-win-arm64.zip
cd onnxruntime-genai-0.6.0-win-arm64
copy include\* ..\include
copy lib\* ..\lib

サンプルのビルド

cmake -A arm64 -S . -B build -DPHI3-QA=ON
cd build
cmake --build . --config Release

サンプルの実行

cd Release
.\phi3_qa.exe <modelへのパス>

C++サンプルの内部

C++アプリケーションはこちらで公開されています：https://github.com/microsoft/onnxruntime-genai/blob/main/examples/c/src/phi3_qa.cpp。スクリプトはAPIを次の標準的な方法で利用します。

モデルのロード
```
auto model = OgaModel::Create(*config);
```
これにより、モデルがメモリにロードされます。
プリプロセッサの作成
```
auto tokenizer = OgaTokenizer::Create(*model);
auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer);
```
これにより、生成されると同時にトークンをユーザーに返すことができるトークナイザーとトークナイザーストリームが作成されます。

対話型入力ループ

while True:
    # プロンプトの読み取り
    # 出力トークンをストリーミングしながら生成を実行

生成ループ

# 1. プロンプトをトークンに前処理する
auto sequences = OgaSequences::Create();
tokenizer->Encode(prompt.c_str(), *sequences);

# 2. パラメータとジェネレータ（KVキャッシュなど）を作成し、プロンプトを処理する
auto params = OgaGeneratorParams::Create(*model);
params->SetSearchOption("max_length", 1024);
auto generator = OgaGenerator::Create(*model, *params);
generator->AppendTokenSequences(*sequences);

# 3. すべての出力トークンが生成されるまでループし、
# デコードされたトークンを出力する
while (!generator->IsDone()) {
  generator->GenerateNextToken();

  if (is_first_token) {
    timing.RecordFirstTokenTimestamp();
    is_first_token = false;
  }

  const auto num_tokens = generator->GetSequenceCount(0);
  const auto new_token = generator->GetSequenceData(0)[num_tokens - 1];
  std::cout << tokenizer_stream->Decode(new_token) << std::flush;
}