🎤 Audio to Text Transcriber

Whisper AI ONNX Runtime DirectML Python 3.13 AMD GPU FFmpeg

Project Category: Individual Project (Personal Productivity Tool)

I built a custom Python pipeline that leverages Whisper AI with AMD GPU acceleration via DirectML. The architecture is purposefully restrictive - targeting only AMD Radeon RX 6000/7000/8000 series GPUs on Windows to maximize DirectML optimization. My implementation handles long-form audio (30+ minutes) through intelligent chunking and achieves 3-5x speedup over CPU processing.

🧰 Technologies & Tools Used

Understanding the core technologies behind this project:

Whisper AI (OpenAI)

A state-of-the-art speech recognition model trained on 680,000 hours of multilingual data. Whisper uses a transformer architecture with encoder-decoder design, converting audio spectrograms into text tokens. It supports 99 languages and handles background noise, accents, and technical terminology remarkably well.

DirectML (Microsoft)

A low-level DirectX 12 API that provides GPU acceleration for machine learning workloads on Windows. Unlike CUDA (NVIDIA-only) or ROCm (Linux AMD), DirectML is hardware-agnostic but optimized for AMD GPUs on Windows. It enables neural network inference on consumer GPUs without requiring specialized ML hardware.

ONNX Runtime

Open Neural Network Exchange (ONNX) runtime is a cross-platform inference engine that executes machine learning models. It converts PyTorch/TensorFlow models to an optimized format, then uses execution providers (like DirectML) for hardware acceleration. This conversion happens once and is cached for subsequent runs.

Librosa

A Python library for audio analysis. I use it to load audio files, extract duration metadata, and split long audio into 30-second chunks. Librosa handles sample rate conversion and audio format normalization, ensuring consistent input for the Whisper model.

🔧 My Implementation

Custom Chunking Pipeline

I engineered a 30-second audio segmentation system that overcomes Whisper's token limits, enabling full transcription of 20+ minute files

GPU Memory Management

I wrote explicit tensor-to-device transfers (input_features.to(device)) to force GPU utilization, reducing CPU load from 70% to 20-30%

ONNX Conversion Layer

My code automatically converts Whisper models to ONNX format on first run, then caches them for 5-10x faster subsequent executions

Real-Time Progress System

I built a custom progress bar with per-chunk timing analytics, displaying real-time factor and GPU confirmation

Automated Setup Script

I created setup.bat that validates Python version, installs DirectML dependencies, and verifies GPU acceleration in one command

Error Handling & Fallbacks

My pipeline gracefully handles missing GPUs, corrupted audio, and model download failures with detailed error messages

⚠️ Design Decision: Hardware Restrictions

I intentionally limited this tool to:
• AMD Radeon RX 6000/7000/8000 series GPUs only
• Windows 10/11 (native, no WSL/Linux/macOS)
• Python 3.13.x specifically

This restriction allows me to optimize exclusively for DirectML without maintaining CUDA/ROCm branches, resulting in cleaner code and better AMD performance.

📊 Performance Benchmarks (AMD RX 6800 XT)

Model	First Run	Cached Runs	VRAM Usage	Use Case
tiny	5-10 sec	2-3 sec	~1GB	Testing/fast transcription
base	10-20 sec	3-5 sec	~2GB	Recommended
small	30-60 sec	8-12 sec	~4GB	Better accuracy
medium	2-3 min	15-25 sec	~8GB	High accuracy
large	3-5 min	30-60 sec	~16GB	Best accuracy

Note: Times are for 30 seconds of audio. First run includes ONNX conversion (one-time cost). Models cache automatically after first use.

🚀 Setup (One-Command)

cd "C:\Program Files (x86)\helper_tools\Audio_to_Text_Transcriber"
setup.bat

What setup.bat does:

Verifies Python 3.13.x installation
Installs PyTorch (CPU version)
Installs ONNX Runtime DirectML (AMD GPU acceleration)
Installs Optimum (ONNX optimization)
Installs Librosa (audio processing)
Verifies DirectML GPU acceleration
Optional: Pre-downloads Whisper models

💻 Usage

Interactive Mode (Recommended)

# From helper_tools root
mp3-to-txt.bat "path/to/audio.mp3"

Command Line

py audio_to_text.py "audio.mp3" --model base --language en

Example Output

======================================================================
✅ TRANSCRIPTION COMPLETE
======================================================================
Total time: 85.42s
Audio duration: 1242.8s (20.7 minutes)
Average per chunk: 2.03s
Real-time factor: 0.07x (lower is faster)
Total characters: 18547
======================================================================

🔧 Engineering Evolution: v2.0 Rewrite

Problem I Solved: Token Limit Bottleneck

Whisper's original implementation only processed the first 30 seconds of audio due to token limits. I re-architected the entire pipeline to handle unlimited audio length:

My Solution: Built a chunking system using Librosa to split audio into 30-second segments
Implementation: Process each chunk independently on GPU, then concatenate transcriptions
Result: Now handles 20+ minute files that previously failed

My Custom Progress Bar System

[████████░░░░░░░░░░░░░░░░░░░░░░] 50.0% - Chunk 21/42 (30.0s audio)
Processing time: 2.03s | GPU: DirectML (AMD Radeon RX 6800 XT)

I wrote a real-time progress visualization that displays:

Visual Bar: Custom ASCII rendering with percentage calculation
Chunk Counter: Current/total segments with timing per chunk
GPU Confirmation: Device detection to verify DirectML is active
Performance Metrics: Real-time factor calculation for speed analysis

GPU Optimization I Implemented

I discovered that Whisper's default behavior processes features on CPU even with GPU available. My fix:

# My explicit GPU transfer code
input_features = processor(audio, return_tensors="pt").input_features
input_features = input_features.to(device)  # Force DirectML device
predicted_ids = model.generate(input_features)  # Now runs on GPU

Before My Fix: CPU stuck at 70% usage, GPU idle
After My Fix: CPU drops to 20-30%, GPU utilization 70-80%
Performance Gain: ~15x real-time factor (2 seconds per 30-second chunk)

📁 Supported Formats

Audio: mp3, wav, m4a, flac, ogg, aac
Video: mp4, avi, mov, mkv (audio extracted automatically)

🔧 Troubleshooting

"DmlExecutionProvider not available"

py -m pip uninstall onnxruntime-directml -y
py -m pip install onnxruntime-directml --force-reinstall

AMD GPU not detected

Update AMD drivers via AMD Adrenalin software
Ensure GPU is not in power-saving mode
Check Windows Device Manager for GPU status

View on GitHub →