Ryzen で機械学習 - インゲージ開発者ブログ

こんにちは、masm11 です。

私は以前、機械学習を勉強したことがありまして、 CPU での機械学習を実装したり NVIDIA GPU で実装したりは経験があったのですが、 Ryzen APU での実装は経験がなく、せっかく手元に実機があるので、やってみました。

ハードウェア環境

PC は HP ENVY x360 13 (ay0000) です。この PC には AMD Ryzen 7 4700U という APU が搭載されています。

今回は、この 13.3" というコンパクトなノート PC で PyTorch で MNIST (手書き数字の判別) をやります。

コンテナを用意して起動する

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/pytorch

これだけです。イメージのダウンロードにちょっと時間はかかりますが、これだけでコンテナが起動し、そこには ROCm で高速化された PyTorch が用意されています。

ROCm というのは、Radeon Open Compute の略で、 AMD の GPU を GPGPU として使うライブラリ群です。

MNIST してみる

コンテナの中で以下を実行します。

git clone https://github.com/pytorch/examples.git
cd examples/mnist
pip3 install -r requirements.txt
HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

PyTorch のサンプルをダウンロードして、必要な Python モジュールをインストールし、 MNIST のプログラムを実行しているだけです。

ただし、HSA_OVERRIDE_GFX_VERSION=9.0.0 については説明が必要ですね。

AMD の GPU にはバージョンがあるようです。コンテナの中で以下のコマンドを実行すると、GPU のバージョンが表示されます。

root@luna:/var/lib/jenkins# rocminfo | grep gfx
  Name:                    gfx90c                             
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
root@luna:/var/lib/jenkins#

gfx90c というのがそれです。

本来であればコンテナ内のライブラリは、この gfx90c を認識し、その性能を発揮するように処理されるべきなのですが、なんと gfx90c には未対応とのことで、代わりに gfx900 として実行します。それが HSA_OVERRIDE_GFX_VERSION=9.0.0 という指定です。

さて、期待の実行結果はというと、

RuntimeError: HIP out of memory. Tried to allocate 142.00 MiB (GPU 0; 512.00 MiB total capacity; 103.83 MiB already allocated; 258.00 MiB free; 126.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

なんとメモリ不足。

先程実行した main.py はデフォルトで epoch 数 14 なのですが、その1回目で、学習した後にテストデータでの精度算出でメモリ不足が発生していました。

512MiB しか載ってないので仕方ないのでしょう。テスト時のバッチサイズを小さくしてみます。 --test-batch-size 100 を指定します。

Train Epoch: 14 [58240/60000 (97%)]  Loss: 0.051543
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.002071
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.002886
/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py:138: UserWarning: An output with one or more elements was resized since it had shape [25088], which does not match the required output shape [32, 1, 28, 28].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at  /var/lib/jenkins/pytorch/aten/src/ATen/native/Resize.cpp:17.)
  return torch.stack(batch, 0, out=out)
/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py:138: UserWarning: An output with one or more elements was resized since it had shape [78400], which does not match the required output shape [100, 1, 28, 28].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at  /var/lib/jenkins/pytorch/aten/src/ATen/native/Resize.cpp:17.)
  return torch.stack(batch, 0, out=out)

Test set: Average loss: 0.0261, Accuracy: 9916/10000 (99%)

root@luna:/var/lib/jenkins/examples/mnist#

(最後の数行のみ)

警告は出てますが、完走しました！

main.py に --no-cuda を指定すると CPU のみで計算してくれるようですので (--help でオプション一覧が表示されます)、かかる時間を比較してみます。

root@luna:/var/lib/jenkins/examples/mnist# time HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py --test-batch-size 100
(略)
real    15m40.876s
user    20m16.960s
sys     0m21.668s
root@luna:/var/lib/jenkins/examples/mnist# time HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py --test-batch-size 100 --no-cuda
(略)
real    15m23.574s
user    113m29.616s
sys     1m18.621s
root@luna:/var/lib/jenkins/examples/mnist#

なんと、高速化してるはずなのに僅かの差で負けてしまってます。学習中は明らかに速いのですが、テストデータでの精度算出で時間がかかってしまっていて、そのせいで逆転されてしまったようです。