Audio Processing

Audio Processing Overview

Deeptrain's audio processing module bridges the gap between acoustic data and Large Language Models. It allows AI agents to "hear" and understand spoken content, environmental sounds, or recorded meetings by converting audio into high-fidelity text transcriptions and searchable vector embeddings.

By leveraging the Transcribe API, you can feed raw audio data into your RAG (Retrieval-Augmented Generation) pipeline, enabling agents to answer questions based on podcasts, lectures, or voice memos.

Supported Audio Formats

Deeptrain supports a wide range of standard audio formats, including:

Lossless: .wav, .flac
Lossy: .mp3, .m4a, .aac, .ogg

The Transcribe API

The Transcribe API is the primary interface for converting audio signals into actionable text data. It handles noise reduction, speaker diarization, and timestamping to ensure the resulting text is contextually rich for LLM consumption.

Usage Example

from deeptrain import AudioProcessor

# Initialize the processor
audio_tool = AudioProcessor(api_key="your_api_key")

# Transcribe a local file or a remote URL
result = audio_tool.transcribe(
    source="path/to/meeting_record.mp3",
    language="en",
    diarization=True
)

print(f"Transcription: {result['text']}")
print(f"Speakers identified: {result['speaker_count']}")

API Reference

Returns: A TranscriptionResult object containing the full text, metadata, and optional speaker segments.

Integrating Audio into AI Memory

Beyond simple transcription, Deeptrain allows you to index audio content directly into your localized embedding database. This enables your AI agent to perform semantic searches across hours of audio data without needing to re-process the files.

Vectorizing Audio Content

from deeptrain import MultiModalConnector

connector = MultiModalConnector()

# Process audio and store in the localized embedding database
connector.ingest_audio(
    file_path="podcasts/episode_01.wav",
    metadata={"category": "education", "id": 101}
)

# Query the audio content using natural language
response = connector.query("What was discussed regarding transformer architectures in the podcast?")

Key Capabilities

Model-Agnostic Integration: Use the transcribed output with any of the 200+ supported LLMs, including GPT-4, Claude, or Llama 3.
Context Extension: Circumvent context window limits by storing transcribed audio in a vector store and retrieving only the relevant snippets during inference.
Live Stream Processing: (Internal) Deeptrain utilizes an internal buffering mechanism to handle long-form audio streams, ensuring memory efficiency during transcription of large files.

Configuration

To optimize audio processing, you can configure the following global settings in your config.yaml or environment variables:

AUDIO_CHUNK_SIZE: Determines the segment length for processing large files (default: 30s).
SAMPLING_RATE: Sets the target frequency for audio normalization (default: 16000Hz).
ENABLE_NOISE_CANCELLATION: Boolean flag to toggle pre-processing filters for clearer transcriptions.