Achieving Lightning-Fast Transcription with AMD Ryzen Integrated GPU and Whisper.cpp

I've been using "Minisforum UM870 Slim" as my primary server machine, but until now I've barely been utilizing the integrated GPU's resources. Despite having the AMD Ryzen 8000 series' Radeon 780M integrated GPU, this is quite a waste.

So, I decided to try using ROCm to leverage GPU inference, including for PyTorch-based models.

Whisper, a popular speech-to-text transcription tool, is typically used with NVIDIA GPUs. However, with AMD's integrated GPUs, it should be possible to achieve fast transcription by utilizing ROCm.

In this experiment, I attempted to set up a transcription environment using ROCm 6.4.1 combined with whisper.cpp on Ubuntu Server 24.00, leveraging the AMD integrated GPU.

Environmental Information

項目	--
Product	Minisforum UM870 Slim
OS	Ubuntu Server 24.04
CPU	AMD Ryzen 7 8745H (16スレッド)
GPU	AMD Radeon 780M (Phoenix3 iGPU, gfx1103)
RAM	32GB
ROCm	6.4.1

Why whisper.cpp?

While OpenAI's Whisper offers multiple implementations, I chose whisper.cpp for the following reasons:

Lightweight: Fewer dependencies in the C++ implementation
Fast: Supports quantized models and is memory efficient
Flexible: Easy to use from the command line

Setup Procedure

1. Add the ROCm repository

commandline

# Add ROCm repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/rocm/apt/6.4.1 noble main" | sudo tee /etc/apt/sources.list.d/rocm.list
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/amdgpu/6.4.1/ubuntu noble main" | sudo tee /etc/apt/sources.list.d/amdgpu.list

sudo apt update
sudo apt upgrade

commandline

# Basic Package
sudo apt install -y rocm-core rocm-hip-runtime rocminfo

# Development Tools
sudo apt install -y hip-dev rocm-dev hipblas-dev rocblas-dev

# Build Tools
sudo apt install -y cmake build-essential ffmpeg

3. Setting Environment Variables

The Radeon 780M uses the gfx1103 architecture, but ROCm requires it to be treated as gfx1100.

commandline

# clone
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp

# Build with HIP/ROCm enabled
export CMAKE_PREFIX_PATH=/opt/rocm
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release -j$(nproc)

5. Download the Model

commandline

# Quantized medium model (recommended)
curl -L -o models/ggml-medium-q5_0.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium-q5_0.bin

# Other Models
bash ./models/download-ggml-model.sh small
bash ./models/download-ggml-model.sh large

Usage

Basic Usage Examples

commandline

# Basic Transcription
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja

# Output in SRT subtitle format
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -osrt

# Output in JSON format
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -oj

# Specify output destination
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -oj -of ./output/result

Frequently Used Options

-l ja: Process as Japanese
-osrt: Output in SRT subtitle format
-oj: Output in JSON format
-of <path>: Specify output file path
-t <threads>: Specify number of threads to use

Troubleshooting

when GPU Hang errors occur

Are you not using a quantized model? We strongly recommend using a quantized version (such as q5_0, q8_0, etc.) for integrated GPUs.

What is a quantization model?

Quantization is a technique that converts model weight parameters to lower precision (e.g., 32-bit → 5-bit). This provides the following benefits:

Reduced memory usage: From approximately 1.5GB for the medium model to about 500MB
Improved inference speed: Reduced memory bandwidth load leads to faster processing
Avoids GPU hang: Can operate even with limited VRAM on integrated GPUs

Quantization Model Type

Model Name	Number of bits	Size Guide	precision	Recommended Uses
ggml-medium.bin	32bit(FP32)	~1.5GB	best	For dedicated GPUs ⚠️ Not recommended for integrated GPUs
ggml-medium-q8_0.bin	8bit	~900MB	high	Balance-focused
ggml-medium-q5_0.bin	5bit	~500MB	middle to high	Integrated GPU Recommended ✅
ggml-medium-q4_0.bin	4bit	~400MB	middle	when memory constraints are severe

Why Do We Need Quantization for Embedded GPUs?

The Radeon 780M uses shared memory (a portion of system RAM), so its memory management differs from dedicated GPUs. In this setup, we had 32GB of RAM installed, yet we still encountered GPU hang errors even with the non-quantized medium model (1.5GB).

This is not simply a memory shortage; the following factors may be contributing:

Internal GPU memory bandwidth limitations: Shared memory access is slower than dedicated GPU GDDR memory
ROCm driver optimization: Large model processing for integrated GPUs is not fully optimized
gfx1103 architecture limitation: Emulating as 11.0.0 via HSA_OVERRIDE_GFX_VERSION

In actual experience, despite having 32GB of RAM, I encountered a "GPU Hang" error with non-quantized models. However, switching to the q5_0 model (approximately 500MB) allowed stable operation. I attribute this improvement to the reduced memory bandwidth requirements and decreased computational complexity of the smaller model size.

Impact on Precision

Even the q5_0 model provides sufficient accuracy for practical applications. In Japanese transcription, we didn't notice any significant quality differences between models. For absolute maximum accuracy, consider the large model's q5_0 version.

If the GPU is not recognized

commandline

# GPU Detection Confirmed
rocminfo | grep -E "Name|gfx"

# Check Environment Variables
echo $HSA_OVERRIDE_GFX_VERSION  # 11.0.0 になっているか

when build errors occur

commandline

# RCheck OCm version
apt show rocm-core | grep Version

# Verify that HIP_PATH is properly configured
echo $HIP_PATH  # /opt/rocm になっているか

Operation Verification Command Collection

commandline

# Verify ROCm Version
apt show rocm-core | grep Version

# GPU Detection Confirmed
rocminfo | grep -E "Name|gfx"

# Verify GPU Detection in PyTorch (if PyTorch is installed)
python3 -c "import torch; print(torch.cuda.is_available())"

Conclusion

Even with the integrated GPU (Radeon 780M) in AMD's Ryzen 8000 series, transcription was possible at sufficiently practical speeds when using proper settings and quantized models.

The points are as follows:

HSA_OVERRIDE_GFX_VERSION=11.0.0 must be set
use quantization model (q5_0, etc.)
Works reliably with ROCm 6.4.1

Even without an NVIDIA GPU, having a modern AMD CPU allows you to set up a high-speed transcription environment. We encourage you to give it a try.

Achieving Lightning-Fast Transcription with AMD Ryzen Integrated GPU and Whisper.cpp

Environmental Information

Why whisper.cpp?

Setup Procedure

1. Add the ROCm repository

3. Setting Environment Variables

5. Download the Model

Usage

Basic Usage Examples

Frequently Used Options

Troubleshooting

when GPU Hang errors occur

What is a quantization model?

Quantization Model Type

Why Do We Need Quantization for Embedded GPUs?

Impact on Precision

If the GPU is not recognized

when build errors occur

Operation Verification Command Collection

Conclusion

Latest Tips

画像をWEBP形式へ変換するためのCLIツールの覚書

【Windows】システム修復・検証コマンドの覚書

Pythonのパッケージ管理コマンド「pip」の覚書

certbotコマンドで「Let's Encrypt」のSSL証明書を取得する時の覚書

Environmental Information

Why whisper.cpp?

Setup Procedure

1. Add the ROCm repository

2. Installing ROCm-related packages

3. Setting Environment Variables

5. Download the Model

Usage

Basic Usage Examples

Frequently Used Options

Troubleshooting

when GPU Hang errors occur

What is a quantization model?

Quantization Model Type

Why Do We Need Quantization for Embedded GPUs?

Impact on Precision

If the GPU is not recognized

when build errors occur

Operation Verification Command Collection

Conclusion

Latest Tips

画像をWEBP形式へ変換するためのCLIツールの覚書

【Windows】システム修復・検証コマンドの覚書

Pythonのパッケージ管理コマンド「pip」の覚書

certbotコマンドで「Let's Encrypt」のSSL証明書を取得する時の覚書