Project Category: Individual Project (Personal Productivity Tool)
I built a custom Python pipeline that leverages Whisper AI with AMD GPU acceleration via DirectML. The architecture is purposefully restrictive - targeting only AMD Radeon RX 6000/7000/8000 series GPUs on Windows to maximize DirectML optimization. My implementation handles long-form audio (30+ minutes) through intelligent chunking and achieves 3-5x speedup over CPU processing.
Understanding the core technologies behind this project:
A state-of-the-art speech recognition model trained on 680,000 hours of multilingual data. Whisper uses a transformer architecture with encoder-decoder design, converting audio spectrograms into text tokens. It supports 99 languages and handles background noise, accents, and technical terminology remarkably well.
A low-level DirectX 12 API that provides GPU acceleration for machine learning workloads on Windows. Unlike CUDA (NVIDIA-only) or ROCm (Linux AMD), DirectML is hardware-agnostic but optimized for AMD GPUs on Windows. It enables neural network inference on consumer GPUs without requiring specialized ML hardware.
Open Neural Network Exchange (ONNX) runtime is a cross-platform inference engine that executes machine learning models. It converts PyTorch/TensorFlow models to an optimized format, then uses execution providers (like DirectML) for hardware acceleration. This conversion happens once and is cached for subsequent runs.
A Python library for audio analysis. I use it to load audio files, extract duration metadata, and split long audio into 30-second chunks. Librosa handles sample rate conversion and audio format normalization, ensuring consistent input for the Whisper model.
I engineered a 30-second audio segmentation system that overcomes Whisper's token limits, enabling full transcription of 20+ minute files
I wrote explicit tensor-to-device transfers (input_features.to(device)
) to force GPU utilization, reducing CPU load from 70% to 20-30%
My code automatically converts Whisper models to ONNX format on first run, then caches them for 5-10x faster subsequent executions
I built a custom progress bar with per-chunk timing analytics, displaying real-time factor and GPU confirmation
I created setup.bat
that validates Python version, installs DirectML dependencies, and verifies GPU acceleration in one command
My pipeline gracefully handles missing GPUs, corrupted audio, and model download failures with detailed error messages
I intentionally limited this tool to:
• AMD Radeon RX 6000/7000/8000 series GPUs only
• Windows 10/11 (native, no WSL/Linux/macOS)
• Python 3.13.x specifically
This restriction allows me to optimize exclusively for DirectML without maintaining CUDA/ROCm branches, resulting in cleaner code and better AMD performance.
Model | First Run | Cached Runs | VRAM Usage | Use Case |
---|---|---|---|---|
tiny | 5-10 sec | 2-3 sec | ~1GB | Testing/fast transcription |
base | 10-20 sec | 3-5 sec | ~2GB | Recommended |
small | 30-60 sec | 8-12 sec | ~4GB | Better accuracy |
medium | 2-3 min | 15-25 sec | ~8GB | High accuracy |
large | 3-5 min | 30-60 sec | ~16GB | Best accuracy |
Note: Times are for 30 seconds of audio. First run includes ONNX conversion (one-time cost). Models cache automatically after first use.
cd "C:\Program Files (x86)\helper_tools\Audio_to_Text_Transcriber"
setup.bat
# From helper_tools root
mp3-to-txt.bat "path/to/audio.mp3"
py audio_to_text.py "audio.mp3" --model base --language en
======================================================================
✅ TRANSCRIPTION COMPLETE
======================================================================
Total time: 85.42s
Audio duration: 1242.8s (20.7 minutes)
Average per chunk: 2.03s
Real-time factor: 0.07x (lower is faster)
Total characters: 18547
======================================================================
Whisper's original implementation only processed the first 30 seconds of audio due to token limits. I re-architected the entire pipeline to handle unlimited audio length:
[████████░░░░░░░░░░░░░░░░░░░░░░] 50.0% - Chunk 21/42 (30.0s audio)
Processing time: 2.03s | GPU: DirectML (AMD Radeon RX 6800 XT)
I wrote a real-time progress visualization that displays:
I discovered that Whisper's default behavior processes features on CPU even with GPU available. My fix:
# My explicit GPU transfer code
input_features = processor(audio, return_tensors="pt").input_features
input_features = input_features.to(device) # Force DirectML device
predicted_ids = model.generate(input_features) # Now runs on GPU
py -m pip uninstall onnxruntime-directml -y
py -m pip install onnxruntime-directml --force-reinstall