Audio Processing
Audio Processing Overview
Deeptrain's audio processing module bridges the gap between acoustic data and Large Language Models. It allows AI agents to "hear" and understand spoken content, environmental sounds, or recorded meetings by converting audio into high-fidelity text transcriptions and searchable vector embeddings.
By leveraging the Transcribe API, you can feed raw audio data into your RAG (Retrieval-Augmented Generation) pipeline, enabling agents to answer questions based on podcasts, lectures, or voice memos.
Supported Audio Formats
Deeptrain supports a wide range of standard audio formats, including:
- Lossless:
.wav,.flac - Lossy:
.mp3,.m4a,.aac,.ogg
The Transcribe API
The Transcribe API is the primary interface for converting audio signals into actionable text data. It handles noise reduction, speaker diarization, and timestamping to ensure the resulting text is contextually rich for LLM consumption.
Usage Example
from deeptrain import AudioProcessor
# Initialize the processor
audio_tool = AudioProcessor(api_key="your_api_key")
# Transcribe a local file or a remote URL
result = audio_tool.transcribe(
source="path/to/meeting_record.mp3",
language="en",
diarization=True
)
print(f"Transcription: {result['text']}")
print(f"Speakers identified: {result['speaker_count']}")
API Reference
| Parameter | Type | Description |
| :--- | :--- | :--- |
| source | str | Path to a local audio file or a valid URL. |
| language | str | (Optional) ISO 639-1 language code. Defaults to auto-detection. |
| diarization | bool | Whether to distinguish between different speakers. |
| timestamp | bool | If true, returns the start/end time for every transcribed segment. |
Returns: A TranscriptionResult object containing the full text, metadata, and optional speaker segments.
Integrating Audio into AI Memory
Beyond simple transcription, Deeptrain allows you to index audio content directly into your localized embedding database. This enables your AI agent to perform semantic searches across hours of audio data without needing to re-process the files.
Vectorizing Audio Content
from deeptrain import MultiModalConnector
connector = MultiModalConnector()
# Process audio and store in the localized embedding database
connector.ingest_audio(
file_path="podcasts/episode_01.wav",
metadata={"category": "education", "id": 101}
)
# Query the audio content using natural language
response = connector.query("What was discussed regarding transformer architectures in the podcast?")
Key Capabilities
- Model-Agnostic Integration: Use the transcribed output with any of the 200+ supported LLMs, including GPT-4, Claude, or Llama 3.
- Context Extension: Circumvent context window limits by storing transcribed audio in a vector store and retrieving only the relevant snippets during inference.
- Live Stream Processing: (Internal) Deeptrain utilizes an internal buffering mechanism to handle long-form audio streams, ensuring memory efficiency during transcription of large files.
Configuration
To optimize audio processing, you can configure the following global settings in your config.yaml or environment variables:
AUDIO_CHUNK_SIZE: Determines the segment length for processing large files (default: 30s).SAMPLING_RATE: Sets the target frequency for audio normalization (default: 16000Hz).ENABLE_NOISE_CANCELLATION: Boolean flag to toggle pre-processing filters for clearer transcriptions.