TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Abstract

We present an audio super-resolution model that processes speech and music signals using neural networks. The model is optimized with both time and frequency domain loss functions. We explore different reconstruction strategies that consider a range of perceptual and adversarial losses. Our approach focuses on enhancing both low and high-frequency audio signals to produce high-quality outputs. The results show improvements in speech and music quality across various tasks.

Example 1: IMU Signal

Low-resolution IMU Signal

Ground Truth

TFiLM

TRAMBA

Example 2: IMU Signal

Low-resolution IMU Signal

Ground Truth

TFiLM

TRAMBA

Example 3: BCM Signal

Low-resolution BCM Signal

Ground Truth

TFiLM

TRAMBA

Example 4: BCM Signal

Low-resolution BCM Signal

Ground Truth

TFiLM

TRAMBA