Synesthesia Machine | Ahmad Ali

The Problem

The challenge of real-time audio-to-visual translation sits at the intersection of signal processing, creative coding, and performance engineering. Sound is temporal and ephemeral. Visual representation gives it persistence, makes it something you can study and share. If you could see music, what would it actually look like? Not as a graph or meter, but as an immersive visual experience.

I've always been fascinated by the boundaries between senses. Synesthesia is the neurological phenomenon where stimulation of one sense triggers automatic perception in another. Some people genuinely see colors when they hear music. The Synesthesia Machine simulates that cross-sensory experience computationally, creating a visual language for audio that's consistent, learnable, and genuinely beautiful.

The computational challenge is substantial. Real-time analysis requires extracting the right audio features, mapping them to visual parameters, and rendering output at interactive frame rates (30+ fps) while keeping latency under 30 milliseconds. A bass-heavy electronic track must look fundamentally different from solo acoustic guitar, not because I hard-coded those mappings, but because the underlying audio features naturally distinguish them.

Approach

The pipeline starts with raw audio input (live stream or uploaded file) flowing through a Python backend that performs feature extraction using librosa. Instead of rendering a waveform visualizer or spectrum analyzer (which show sound as raw data), I extract high-level spectral features that capture the perceptual character of the audio.

Four discriminative features drive the analysis:

Spectral centroid (brightness): the weighted mean frequency of the spectrum
Spectral bandwidth (richness): energy spread across the frequency range
Spectral rolloff (high-frequency content): the frequency below which 95% of energy concentrates
Onset strength (percussive energy): detects beats and transients for reactive visuals

These features feed into a feature vector computed at approximately 30 frames per second. The 30 fps rate targets real-time visual synchronization, fast enough to feel responsive and achievable in the browser's request animation frame budget.

The Python prototype proves that these features are discriminative: a bass-heavy electronic track produces a fundamentally different feature trajectory than solo acoustic guitar. That's the foundation the visual rendering system will build on.

Next phase development will focus on the visual rendering engine and audio-to-visual mapping system (the creative core). The plan is to start with HTML Canvas for a 2D prototype, then evaluate WebGL if visual complexity and performance targets demand it.

Architecture

Real-time audio-to-visual translation requires two parallel processing streams. Audio input enters the Python backend where spectral feature extraction produces a structured feature vector at 30fps. That feature vector transmits to the browser where a Canvas or WebGL renderer consumes it and generates responsive visuals in real-time.

Audio Input (Microphone/File)
    ↓
[Python Backend: Feature Extraction]
    ├─ Spectral Centroid
    ├─ Spectral Bandwidth
    ├─ Spectral Rolloff
    └─ Onset Strength
    ↓
Feature Vector (~30fps)
    ↓
[Browser: Visual Rendering]
    ├─ Canvas 2D (current prototype)
    └─ WebGL (future optimization)
    ↓
Real-Time Visual Output

The bottleneck is latency. Audio APIs in the browser introduce inherent latency from buffering. The Python feature extraction must complete in under 30 milliseconds per frame. The rendering loop must hit 60fps browser refresh rate to stay synchronized.

Key Technical Details

Spectral feature extraction from the librosa library provides the discriminative foundation. Here's the core pattern:

import librosa
import numpy as np

def extract_features(audio_signal, sr=22050):
    """Extract discriminative spectral features for visualization."""
    # Compute Short-Time Fourier Transform
    S = librosa.stft(audio_signal)
    magnitude = np.abs(S)
    
    # Spectral centroid: brightness of the audio
    centroid = librosa.feature.spectral_centroid(S=magnitude)[0]
    
    # Spectral bandwidth: richness and spread
    bandwidth = librosa.feature.spectral_bandwidth(S=magnitude)[0]
    
    # Spectral rolloff: high-frequency cutoff
    rolloff = librosa.feature.spectral_rolloff(S=magnitude)[0]
    
    # Stack features into a matrix (time x features)
    return np.column_stack([centroid, bandwidth, rolloff])

The output is a time-frequency feature matrix where each row is a 30ms frame and columns are the spectral descriptors. This compact representation captures enough information to drive visually distinct outputs across different audio types.

Real-time performance: The Python prototype achieves approximately 30 frames per second feature output. For a 44.1 kHz audio stream, I process a 1,470-sample buffer (30ms) on each extraction cycle. The current implementation meets this target in single-threaded Python, with room for optimization (NumPy vectorization, Cython compilation).

What's proven: The audio analysis pipeline is complete and working. Spectral features discriminate reliably across music and speech. The 30fps throughput is achievable and sustainable under the latency constraint.

What's not yet built: Visual rendering engine (Canvas/WebGL). Audio-to-visual mapping system (the creative core, determining which audio features drive which visual parameters). Real-time synchronization pipeline in the browser. User-configurable mapping profiles.

Impact

What moved, what constrained it, and what trade-offs stayed visible.

Operational outcome, the limits around it, and the practical decisions that shaped the work.

Impact

Technical exploration of real-time sensory signal processing. Audio feature extraction pipeline proven discriminative for different audio types. Demonstrates feasibility of cross-sensory mapping at interactive frame rates.

Constraints

Real-time processing must stay under 30ms latency per frame. Visual rendering engine not yet implemented. Browser audio APIs introduce inherent latency complexity that WebAudio API alone cannot fully eliminate.

Trade-offs

Web deployment for accessibility over native performance. Canvas first over WebGL to prioritize rapid iteration on the mapping system, with a planned upgrade to WebGL when visual complexity demands it. Signal analysis focus rather than final visual polish at this stage.

Links

GitHub and demo coming soon