
I've been using "Minisforum UM870 Slim" as my primary server machine, but until now I've barely been utilizing the integrated GPU's resources. Despite having the AMD Ryzen 8000 series' Radeon 780M integrated GPU, this is quite a waste.
So, I decided to try using ROCm to leverage GPU inference, including for PyTorch-based models.
Whisper, a popular speech-to-text transcription tool, is typically used with NVIDIA GPUs. However, with AMD's integrated GPUs, it should be possible to achieve fast transcription by utilizing ROCm.
In this experiment, I attempted to set up a transcription environment using ROCm 6.4.1 combined with whisper.cpp on Ubuntu Server 24.00, leveraging the AMD integrated GPU.
Environmental Information
項目 | -- |
|---|---|
Product | Minisforum UM870 Slim |
OS | Ubuntu Server 24.04 |
CPU | AMD Ryzen 7 8745H (16スレッド) |
GPU | AMD Radeon 780M (Phoenix3 iGPU, gfx1103) |
RAM | 32GB |
ROCm | 6.4.1 |
Why whisper.cpp?
While OpenAI's Whisper offers multiple implementations, I chose whisper.cpp for the following reasons:
- Lightweight: Fewer dependencies in the C++ implementation
- Fast: Supports quantized models and is memory efficient
- Flexible: Easy to use from the command line
Setup Procedure
1. Add the ROCm repository
# Add ROCm repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/rocm/apt/6.4.1 noble main" | sudo tee /etc/apt/sources.list.d/rocm.list
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/amdgpu/6.4.1/ubuntu noble main" | sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update
sudo apt upgrade2. Installing ROCm-related packages
# Basic Package
sudo apt install -y rocm-core rocm-hip-runtime rocminfo
# Development Tools
sudo apt install -y hip-dev rocm-dev hipblas-dev rocblas-dev
# Build Tools
sudo apt install -y cmake build-essential ffmpeg3. Setting Environment Variables
The Radeon 780M uses the gfx1103 architecture, but ROCm requires it to be treated as gfx1100.
# clone
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
# Build with HIP/ROCm enabled
export CMAKE_PREFIX_PATH=/opt/rocm
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release -j$(nproc)5. Download the Model
# Quantized medium model (recommended)
curl -L -o models/ggml-medium-q5_0.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium-q5_0.bin
# Other Models
bash ./models/download-ggml-model.sh small
bash ./models/download-ggml-model.sh largeUsage
Basic Usage Examples
# Basic Transcription
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja
# Output in SRT subtitle format
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -osrt
# Output in JSON format
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -oj
# Specify output destination
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -oj -of ./output/resultFrequently Used Options
-l ja: Process as Japanese-osrt: Output in SRT subtitle format-oj: Output in JSON format-of <path>: Specify output file path-t <threads>: Specify number of threads to use
Troubleshooting
when GPU Hang errors occur
Are you not using a quantized model? We strongly recommend using a quantized version (such as q5_0, q8_0, etc.) for integrated GPUs.
What is a quantization model?
Quantization is a technique that converts model weight parameters to lower precision (e.g., 32-bit → 5-bit). This provides the following benefits:
- Reduced memory usage: From approximately 1.5GB for the medium model to about 500MB
- Improved inference speed: Reduced memory bandwidth load leads to faster processing
- Avoids GPU hang: Can operate even with limited VRAM on integrated GPUs
Quantization Model Type
Model Name | Number of bits | Size Guide | precision | Recommended Uses |
|---|---|---|---|---|
ggml-medium.bin | 32bit(FP32) | ~1.5GB | best | For dedicated GPUs ⚠️ Not recommended for integrated GPUs |
ggml-medium-q8_0.bin | 8bit | ~900MB | high | Balance-focused |
ggml-medium-q5_0.bin | 5bit | ~500MB | middle to high | Integrated GPU Recommended ✅ |
ggml-medium-q4_0.bin | 4bit | ~400MB | middle | when memory constraints are severe |
Why Do We Need Quantization for Embedded GPUs?
The Radeon 780M uses shared memory (a portion of system RAM), so its memory management differs from dedicated GPUs. In this setup, we had 32GB of RAM installed, yet we still encountered GPU hang errors even with the non-quantized medium model (1.5GB).
This is not simply a memory shortage; the following factors may be contributing:
- Internal GPU memory bandwidth limitations: Shared memory access is slower than dedicated GPU GDDR memory
- ROCm driver optimization: Large model processing for integrated GPUs is not fully optimized
- gfx1103 architecture limitation: Emulating as 11.0.0 via
HSA_OVERRIDE_GFX_VERSION
In actual experience, despite having 32GB of RAM, I encountered a "GPU Hang" error with non-quantized models. However, switching to the q5_0 model (approximately 500MB) allowed stable operation. I attribute this improvement to the reduced memory bandwidth requirements and decreased computational complexity of the smaller model size.
Impact on Precision
Even the q5_0 model provides sufficient accuracy for practical applications. In Japanese transcription, we didn't notice any significant quality differences between models. For absolute maximum accuracy, consider the large model's q5_0 version.
If the GPU is not recognized
# GPU Detection Confirmed
rocminfo | grep -E "Name|gfx"
# Check Environment Variables
echo $HSA_OVERRIDE_GFX_VERSION # 11.0.0 になっているかwhen build errors occur
# RCheck OCm version
apt show rocm-core | grep Version
# Verify that HIP_PATH is properly configured
echo $HIP_PATH # /opt/rocm になっているかOperation Verification Command Collection
# Verify ROCm Version
apt show rocm-core | grep Version
# GPU Detection Confirmed
rocminfo | grep -E "Name|gfx"
# Verify GPU Detection in PyTorch (if PyTorch is installed)
python3 -c "import torch; print(torch.cuda.is_available())"Conclusion
Even with the integrated GPU (Radeon 780M) in AMD's Ryzen 8000 series, transcription was possible at sufficiently practical speeds when using proper settings and quantized models.
The points are as follows:
HSA_OVERRIDE_GFX_VERSION=11.0.0must be set- use quantization model (q5_0, etc.)
- Works reliably with ROCm 6.4.1
Even without an NVIDIA GPU, having a modern AMD CPU allows you to set up a high-speed transcription environment. We encourage you to give it a try.