Audio Intelligence
Audio Intelligence
Deeptrain’s Audio Intelligence module bridges the gap between raw auditory data and text-based Large Language Models. By converting spoken content into high-fidelity text and structured metadata, you can integrate podcasts, meeting recordings, and voice commands directly into your AI agent’s knowledge base.
Overview
The Audio Intelligence suite focuses on two primary workflows:
- Transcription: Converting audio files or streams into text.
- Contextual Injection: Feeding transcribed data into Deeptrain’s localized embedding database to enhance RAG (Retrieval-Augmented Generation) pipelines.
The Transcribe API
The Transcribe API is the primary interface for processing audio content. It supports both local files and remote URLs (e.g., S3 buckets, public URLs).
Method: deeptrain.audio.transcribe()
Inputs:
| Parameter | Type | Description |
| :--- | :--- | :--- |
| source | string | The path to a local file or a valid URL (mp3, wav, m4a). |
| model | string | (Optional) The transcription model to use (default: base). |
| language | string | (Optional) ISO 639-1 language code for optimized accuracy. |
| timestamps | boolean | If true, returns word-level or segment-level timestamps. |
Output: TranscriptionResponse
{
text: string; // The full transcribed text
confidence: number; // Accuracy score (0.0 to 1.0)
duration: number; // Duration of the audio in seconds
segments: Array<{ // Included only if timestamps: true
start: number;
end: number;
text: string;
}>;
}
Usage Examples
Processing a Local Audio File
To train an agent on a local recording, use the transcribe method and pass the resulting text to your agent's memory.
import deeptrain
# Transcribe the audio file
result = deeptrain.audio.transcribe(
source="./recordings/q3_earnings_call.mp3",
language="en",
timestamps=True
)
# Integrate into your AI Agent's context
agent = deeptrain.Agent(model="gpt-4")
agent.learn(result.text)
print(f"Processed {result.duration} seconds of audio.")
Real-time Content Sourcing
You can pull audio content from hosted platforms to provide real-time updates to your agents.
# Sourcing from a remote URL
remote_audio = "https://example.com/podcasts/latest_tech_trends.wav"
result = deeptrain.audio.transcribe(source=remote_audio)
# The transcribed text is now searchable via the localized embedding database
deeptrain.vector_db.upsert(
content=result.text,
metadata={"source": remote_audio, "type": "audio"}
)
Supported Formats and Limits
- Supported File Types:
.mp3,.wav,.m4a,.ogg,.flac. - Max File Size: 25MB via standard API; for larger files, utilize the
deeptrain.audio.stream_processinternal utility (which handles chunking automatically). - Multilingual Support: Supports over 50 languages. If no language is specified, the system will attempt auto-detection during the first 30 seconds of audio.
Advanced Configuration: Speaker Diarization
For interviews or meetings with multiple participants, you can enable speaker diarization to attribute text to specific individuals.
result = deeptrain.audio.transcribe(
source="meeting_minutes.mp3",
diarize=True # Public flag for speaker identification
)
# Output includes speaker labels
# [Speaker 0]: Hello everyone.
# [Speaker 1]: Hi, let's start the review.
By leveraging these audio capabilities, your AI agents move beyond static text files, allowing them to "listen" to and learn from the vast amount of information stored in voice and audio formats.