← Back to Projects

🎤 Audio to Text Transcriber

Whisper AI ONNX Runtime DirectML Python 3.13 AMD GPU FFmpeg

Project Category: Individual Project (Personal Productivity Tool)

I built a custom Python pipeline that leverages Whisper AI with AMD GPU acceleration via DirectML. The architecture is purposefully restrictive - targeting only AMD Radeon RX 6000/7000/8000 series GPUs on Windows to maximize DirectML optimization. My implementation handles long-form audio (30+ minutes) through intelligent chunking and achieves 3-5x speedup over CPU processing.

🧰 Technologies & Tools Used

Understanding the core technologies behind this project:

Whisper AI (OpenAI)

A state-of-the-art speech recognition model trained on 680,000 hours of multilingual data. Whisper uses a transformer architecture with encoder-decoder design, converting audio spectrograms into text tokens. It supports 99 languages and handles background noise, accents, and technical terminology remarkably well.

DirectML (Microsoft)

A low-level DirectX 12 API that provides GPU acceleration for machine learning workloads on Windows. Unlike CUDA (NVIDIA-only) or ROCm (Linux AMD), DirectML is hardware-agnostic but optimized for AMD GPUs on Windows. It enables neural network inference on consumer GPUs without requiring specialized ML hardware.

ONNX Runtime

Open Neural Network Exchange (ONNX) runtime is a cross-platform inference engine that executes machine learning models. It converts PyTorch/TensorFlow models to an optimized format, then uses execution providers (like DirectML) for hardware acceleration. This conversion happens once and is cached for subsequent runs.

Librosa

A Python library for audio analysis. I use it to load audio files, extract duration metadata, and split long audio into 30-second chunks. Librosa handles sample rate conversion and audio format normalization, ensuring consistent input for the Whisper model.

🔧 My Implementation

Custom Chunking Pipeline

I engineered a 30-second audio segmentation system that overcomes Whisper's token limits, enabling full transcription of 20+ minute files

GPU Memory Management

I wrote explicit tensor-to-device transfers (input_features.to(device)) to force GPU utilization, reducing CPU load from 70% to 20-30%

ONNX Conversion Layer

My code automatically converts Whisper models to ONNX format on first run, then caches them for 5-10x faster subsequent executions

Real-Time Progress System

I built a custom progress bar with per-chunk timing analytics, displaying real-time factor and GPU confirmation

Automated Setup Script

I created setup.bat that validates Python version, installs DirectML dependencies, and verifies GPU acceleration in one command

Error Handling & Fallbacks

My pipeline gracefully handles missing GPUs, corrupted audio, and model download failures with detailed error messages

⚠️ Design Decision: Hardware Restrictions

I intentionally limited this tool to:
• AMD Radeon RX 6000/7000/8000 series GPUs only
• Windows 10/11 (native, no WSL/Linux/macOS)
• Python 3.13.x specifically

This restriction allows me to optimize exclusively for DirectML without maintaining CUDA/ROCm branches, resulting in cleaner code and better AMD performance.

📊 Performance Benchmarks (AMD RX 6800 XT)

Model First Run Cached Runs VRAM Usage Use Case
tiny 5-10 sec 2-3 sec ~1GB Testing/fast transcription
base 10-20 sec 3-5 sec ~2GB Recommended
small 30-60 sec 8-12 sec ~4GB Better accuracy
medium 2-3 min 15-25 sec ~8GB High accuracy
large 3-5 min 30-60 sec ~16GB Best accuracy

Note: Times are for 30 seconds of audio. First run includes ONNX conversion (one-time cost). Models cache automatically after first use.

🚀 Setup (One-Command)

cd "C:\Program Files (x86)\helper_tools\Audio_to_Text_Transcriber"
setup.bat

What setup.bat does:

  1. Verifies Python 3.13.x installation
  2. Installs PyTorch (CPU version)
  3. Installs ONNX Runtime DirectML (AMD GPU acceleration)
  4. Installs Optimum (ONNX optimization)
  5. Installs Librosa (audio processing)
  6. Verifies DirectML GPU acceleration
  7. Optional: Pre-downloads Whisper models

💻 Usage

Interactive Mode (Recommended)

# From helper_tools root
mp3-to-txt.bat "path/to/audio.mp3"

Command Line

py audio_to_text.py "audio.mp3" --model base --language en

Example Output

======================================================================
✅ TRANSCRIPTION COMPLETE
======================================================================
Total time: 85.42s
Audio duration: 1242.8s (20.7 minutes)
Average per chunk: 2.03s
Real-time factor: 0.07x (lower is faster)
Total characters: 18547
======================================================================

🔧 Engineering Evolution: v2.0 Rewrite

Problem I Solved: Token Limit Bottleneck

Whisper's original implementation only processed the first 30 seconds of audio due to token limits. I re-architected the entire pipeline to handle unlimited audio length:

My Custom Progress Bar System

[████████░░░░░░░░░░░░░░░░░░░░░░] 50.0% - Chunk 21/42 (30.0s audio)
Processing time: 2.03s | GPU: DirectML (AMD Radeon RX 6800 XT)

I wrote a real-time progress visualization that displays:

GPU Optimization I Implemented

I discovered that Whisper's default behavior processes features on CPU even with GPU available. My fix:

# My explicit GPU transfer code
input_features = processor(audio, return_tensors="pt").input_features
input_features = input_features.to(device)  # Force DirectML device
predicted_ids = model.generate(input_features)  # Now runs on GPU

📁 Supported Formats

🔧 Troubleshooting

"DmlExecutionProvider not available"

py -m pip uninstall onnxruntime-directml -y
py -m pip install onnxruntime-directml --force-reinstall

AMD GPU not detected

View on GitHub →