Transcribe API

Overview

The Transcribe API is Deeptrain's core gateway for converting raw multi-modal content into machine-readable text. It allows you to ingest audio and video from various sources—including local files, self-hosted servers, and external platforms like YouTube or Vimeo—and transform them into structured training data for your LLMs and AI agents.

By leveraging this API, you can bridge the gap between non-textual media and your localized embedding database, enabling your models to "listen" to and "watch" content to improve their contextual awareness.

API Specification

Endpoint: `POST /v1/transcribe`

The Transcribe API accepts either a direct file upload or a source URL. It processes the media asynchronously or synchronously depending on your configuration and returns the transcribed text along with metadata.

Request Parameters

*Either source_url or file must be provided.

Supported Formats

Audio: MP3, WAV, M4A, FLAC, AAC.
Video: MP4, MOV, AVI, MKV, WEBM.
External Platforms: YouTube, Vimeo, and direct links to self-hosted media files.

Usage Examples

Transcribing a YouTube Video for Training

To ingest a public video and automatically sync it to your AI agent's knowledge base, use the source_url parameter and set training_sync to true.

curl -X POST https://api.deeptrain.ai/v1/transcribe \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source_url": "https://www.youtube.com/watch?v=example",
    "language": "en",
    "training_sync": true
  }'

Transcribing a Local Audio File (Python SDK)

If you are using the Deeptrain Python client, you can pass a local file buffer directly.

from deeptrain import DeeptrainClient

client = DeeptrainClient(api_key="YOUR_API_KEY")

result = client.transcribe(
    file_path="./meeting_notes.mp3",
    language="en",
    training_sync=False
)

print(f"Transcription: {result['text']}")

Response Schema

Upon a successful request, the API returns a JSON object containing the transcribed content and processing metadata.

{
  "id": "trans_8472910",
  "status": "completed",
  "data": {
    "text": "Welcome to the Deeptrain multi-modal tutorial...",
    "duration": 124.5,
    "language": "en",
    "word_count": 450
  },
  "training_status": {
    "synced": true,
    "vector_id": "vec_001923"
  }
}

Best Practices

Batching: For large video libraries, use the asynchronous processing flag to avoid connection timeouts.
Contextual Accuracy: For technical content (e.g., coding tutorials), ensure the language parameter is explicitly set to improve transcription accuracy for specific terminologies.
Data Privacy: When using local files, Deeptrain processes the data through secure, encrypted buffers before indexing them into your localized embedding database.