Achieving Lightning-Fast Transcription with AMD Ryzen Integrated GPU and Whisper.cpp

Created onDecember 17, 2025 at 11:37 AM
thumbnail Image

I've been using "Minisforum UM870 Slim" as my primary server machine, but until now I've barely been utilizing the integrated GPU's resources. Despite having the AMD Ryzen 8000 series' Radeon 780M integrated GPU, this is quite a waste.

So, I decided to try using ROCm to leverage GPU inference, including for PyTorch-based models.

Whisper, a popular speech-to-text transcription tool, is typically used with NVIDIA GPUs. However, with AMD's integrated GPUs, it should be possible to achieve fast transcription by utilizing ROCm.

In this experiment, I attempted to set up a transcription environment using ROCm 6.4.1 combined with whisper.cpp on Ubuntu Server 24.00, leveraging the AMD integrated GPU.

Environmental Information 

項目

--

Product

Minisforum UM870 Slim

OS

Ubuntu Server 24.04

CPU

AMD Ryzen 7 8745H (16スレッド)

GPU

AMD Radeon 780M (Phoenix3 iGPU, gfx1103)

RAM

32GB

ROCm

6.4.1

Why whisper.cpp? 

While OpenAI's Whisper offers multiple implementations, I chose whisper.cpp for the following reasons:

  • Lightweight: Fewer dependencies in the C++ implementation
  • Fast: Supports quantized models and is memory efficient
  • Flexible: Easy to use from the command line

Setup Procedure 

1. Add the ROCm repository 

commandline
# Add ROCm repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/rocm/apt/6.4.1 noble main" | sudo tee /etc/apt/sources.list.d/rocm.list
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/amdgpu/6.4.1/ubuntu noble main" | sudo tee /etc/apt/sources.list.d/amdgpu.list

sudo apt update
sudo apt upgrade

2. Installing ROCm-related packages 

commandline
# Basic Package
sudo apt install -y rocm-core rocm-hip-runtime rocminfo

# Development Tools
sudo apt install -y hip-dev rocm-dev hipblas-dev rocblas-dev

# Build Tools
sudo apt install -y cmake build-essential ffmpeg

3. Setting Environment Variables 

The Radeon 780M uses the gfx1103 architecture, but ROCm requires it to be treated as gfx1100.

commandline
# clone
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp

# Build with HIP/ROCm enabled
export CMAKE_PREFIX_PATH=/opt/rocm
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release -j$(nproc)

5. Download the Model 

commandline
# Quantized medium model (recommended)
curl -L -o models/ggml-medium-q5_0.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium-q5_0.bin

# Other Models
bash ./models/download-ggml-model.sh small
bash ./models/download-ggml-model.sh large

Usage 

Basic Usage Examples 

commandline
# Basic Transcription
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja

# Output in SRT subtitle format
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -osrt

# Output in JSON format
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -oj

# Specify output destination
./build/bin/whisper-cli -m models/ggml-medium-q5_0.bin -f input.mp3 -l ja -oj -of ./output/result

Frequently Used Options 

  • -l ja: Process as Japanese
  • -osrt: Output in SRT subtitle format
  • -oj: Output in JSON format
  • -of <path>: Specify output file path
  • -t <threads>: Specify number of threads to use

Troubleshooting 

when GPU Hang errors occur 

Are you not using a quantized model? We strongly recommend using a quantized version (such as q5_0, q8_0, etc.) for integrated GPUs.

What is a quantization model? 

Quantization is a technique that converts model weight parameters to lower precision (e.g., 32-bit → 5-bit). This provides the following benefits:

  • Reduced memory usage: From approximately 1.5GB for the medium model to about 500MB
  • Improved inference speed: Reduced memory bandwidth load leads to faster processing
  • Avoids GPU hang: Can operate even with limited VRAM on integrated GPUs

Quantization Model Type 

Model Name

Number of bits

Size Guide

precision

Recommended Uses

ggml-medium.bin

32bit(FP32)

~1.5GB

best

For dedicated GPUs ⚠️ Not recommended for integrated GPUs

ggml-medium-q8_0.bin

8bit

~900MB

high

Balance-focused

ggml-medium-q5_0.bin

5bit

~500MB

middle to high

Integrated GPU Recommended ✅

ggml-medium-q4_0.bin

4bit

~400MB

middle

when memory constraints are severe

Why Do We Need Quantization for Embedded GPUs? 

The Radeon 780M uses shared memory (a portion of system RAM), so its memory management differs from dedicated GPUs. In this setup, we had 32GB of RAM installed, yet we still encountered GPU hang errors even with the non-quantized medium model (1.5GB).

This is not simply a memory shortage; the following factors may be contributing:

  • Internal GPU memory bandwidth limitations: Shared memory access is slower than dedicated GPU GDDR memory
  • ROCm driver optimization: Large model processing for integrated GPUs is not fully optimized
  • gfx1103 architecture limitation: Emulating as 11.0.0 via HSA_OVERRIDE_GFX_VERSION

In actual experience, despite having 32GB of RAM, I encountered a "GPU Hang" error with non-quantized models. However, switching to the q5_0 model (approximately 500MB) allowed stable operation. I attribute this improvement to the reduced memory bandwidth requirements and decreased computational complexity of the smaller model size.

Impact on Precision 

Even the q5_0 model provides sufficient accuracy for practical applications. In Japanese transcription, we didn't notice any significant quality differences between models. For absolute maximum accuracy, consider the large model's q5_0 version.

If the GPU is not recognized 

commandline
# GPU Detection Confirmed
rocminfo | grep -E "Name|gfx"

# Check Environment Variables
echo $HSA_OVERRIDE_GFX_VERSION  # 11.0.0 になっているか

when build errors occur 

commandline
# RCheck OCm version
apt show rocm-core | grep Version

# Verify that HIP_PATH is properly configured
echo $HIP_PATH  # /opt/rocm になっているか

Operation Verification Command Collection 

commandline
# Verify ROCm Version
apt show rocm-core | grep Version

# GPU Detection Confirmed
rocminfo | grep -E "Name|gfx"

# Verify GPU Detection in PyTorch (if PyTorch is installed)
python3 -c "import torch; print(torch.cuda.is_available())"

Conclusion 

Even with the integrated GPU (Radeon 780M) in AMD's Ryzen 8000 series, transcription was possible at sufficiently practical speeds when using proper settings and quantized models.

The points are as follows:

  1. HSA_OVERRIDE_GFX_VERSION=11.0.0 must be set
  2. use quantization model (q5_0, etc.)
  3. Works reliably with ROCm 6.4.1

Even without an NVIDIA GPU, having a modern AMD CPU allows you to set up a high-speed transcription environment. We encourage you to give it a try.

Latest Tips