Skip to main content
Real-time voice moderation is available to customers on custom plans. If you’re interested in using it, please reach out here.

How it works

Voice moderation analyzes live voice and call audio as it happens. You open a streaming connection and send call audio; the speech is transcribed and each finalized utterance is moderated by your enabled text policies—toxicity, hate, PII, wordlists, guidelines, and the rest—with no extra configuration. You receive a moderation result for every utterance as it’s spoken. Unlike audio file moderation, which analyzes a complete recording after the fact, voice moderation works on a live stream and returns a verdict for each utterance during the call.

Conversations

A voice call is a conversation: a single live session with a start and an end, where every utterance belongs to the same thread. This lets you review an entire call as one unit instead of a series of disconnected messages.
  • Bring your own id. Supply a conversationId to link the call to a record in your own system. If you don’t, one is generated for you and returned when the session starts—every utterance in the call shares it.
  • Filter by type. Voice utterances are tagged with the voice content type, so you can separate them from messages, posts, and other content.
Real-time voice is in early access and the streaming interface may still change. Coordinate with us before building a production integration so we can confirm the current contract and your account’s limits.

Connecting

Open a WebSocket connection to the streaming endpoint, authenticating with your API key on the upgrade request and requesting the moderationapi.v1 subprotocol.
wss://voice.moderationapi.com/v1/stream
Authorization: Bearer <your_api_key>
Sec-WebSocket-Protocol: moderationapi.v1
A missing or malformed key closes the connection with code 4401.

Start the session

Send a start frame as the first message. It declares the conversation, the audio format, and the tracks you’ll stream (for example a caller and an agent), each with an optional author id.
{
  "event": "start",
  "conversationId": "your-call-id",
  "channel": "your-channel-key",
  "mediaFormat": { "encoding": "audio/x-mulaw", "sampleRate": 8000 },
  "tracks": [
    { "name": "inbound", "authorId": "caller-123" },
    { "name": "outbound", "authorId": "agent-456" }
  ],
  "emitPartials": false,
  "metadata": { "crmTicket": "T-9912", "region": "eu" }
}
  • conversationId — optional. Omit it to have one generated and returned in session.started.
  • channel — optional. Selects which channel’s policy configuration applies.
  • tracks — stream one or both tracks. Send both inbound and outbound to moderate the full call with each side attributed to its own author, or just one track (for example only inbound) if that’s all you have access to. Audio for any track you don’t declare is ignored.
  • mediaFormat.encodingaudio/x-mulaw (PCMU), audio/x-alaw (PCMA), linear PCM (audio/l16, linear16), or common encoded containers (wav, mp3, ogg, flac). sampleRate may be 8000–48000 Hz. Audio is passed through without resampling. The spoken language is detected automatically.
  • emitPartials — optional. Set true to also receive interim, non-final transcripts.
  • metadata — optional, arbitrary JSON attached to the conversation. Put anything you want to associate with the call here (your own ids, tags, context); it’s stored on the conversation and not interpreted by moderation.
The server replies with session.started:
{ "v": 1, "event": "session.started", "conversationId": "your-call-id", "sessionId": "", "tracks": ["inbound", "outbound"] }

Stream audio

Send media frames as audio arrives, one per track, with the audio chunk base64-encoded in payload:
{ "event": "media", "media": { "track": "inbound", "payload": "<base64-encoded audio>" } }

End the session

Send a stop frame to end the call gracefully (or simply disconnect). The server drains any in-flight utterances, emits session.ended, and closes.
{ "event": "stop" }

Using Twilio or another telephony provider

Telephony providers like Twilio stream call audio but can’t consume the moderation verdicts the gateway streams back, so they don’t connect to the gateway directly. Instead, run a thin bridge in your own backend:
  1. Accept the provider’s media stream (for Twilio, its connected / start / media messages).
  2. Open this WebSocket and map those onto the start and media frames above—pass your call id as conversationId and the caller/agent identifiers as each track’s authorId.
  3. Relay the utterance.final verdicts back to your application to act on them.

Events you receive

Every outbound message carries "v": 1 and an event field.
EventWhen
session.startedAfter your start frame is accepted.
utterance.partialInterim transcript (only if emitPartials was true).
utterance.finalA finalized utterance, with its moderation result.
warningNon-fatal condition (e.g. a transient transcription hiccup).
session.errorFatal error; the connection closes.
session.endedThe call ended; includes summary stats.
The key event is utterance.final—the transcribed text plus the standard moderation result (evaluation, recommendation, and policies), in the same shape as every other moderation response:
{
  "v": 1,
  "event": "utterance.final",
  "conversationId": "your-call-id",
  "contentId": "",
  "track": "inbound",
  "authorId": "caller-123",
  "text": "transcribed speech for this utterance",
  "startMs": 0,
  "endMs": 2000,
  "sttConfidence": 0.95,
  "evaluation": { "flagged": false },
  "recommendation": { "action": "allow" },
  "policies": []
}
Use the recommendation.action (allow, review, or reject) to decide what to do—see Acting on responses. When the call ends you receive session.ended with a summary:
{
  "v": 1,
  "event": "session.ended",
  "conversationId": "your-call-id",
  "sessionId": "",
  "stats": { "durationMs": 125000, "utterances": 42, "actions": { "allow": 39, "review": 2, "reject": 1 } }
}

Close codes

CodeMeaning
1000Normal close
1011Server error
4400Bad request (e.g. a malformed start)
4401Authentication failed
4403Not authorized for voice
4429Concurrency limit reached—retry later

Example

A minimal Node.js client that opens a session, streams audio from your telephony source, and acts on each verdict:
import WebSocket from "ws";

const ws = new WebSocket("wss://voice.moderationapi.com/v1/stream", "moderationapi.v1", {
  headers: { Authorization: `Bearer ${process.env.MODERATION_API_KEY}` },
});

ws.on("open", () => {
  ws.send(
    JSON.stringify({
      event: "start",
      conversationId: "call-abc-123",
      channel: "support-calls",
      mediaFormat: { encoding: "audio/x-mulaw", sampleRate: 8000 },
      tracks: [{ name: "inbound", authorId: "caller-123" }],
      emitPartials: false,
      metadata: { crmTicket: "T-9912" },
    }),
  );
});

// Forward audio as it arrives from your telephony source (base64-encoded chunks).
function sendAudio(base64Chunk) {
  ws.send(JSON.stringify({ event: "media", media: { track: "inbound", payload: base64Chunk } }));
}

ws.on("message", (raw) => {
  const msg = JSON.parse(raw.toString());
  switch (msg.event) {
    case "session.started":
      console.log("session started:", msg.conversationId);
      break;
    case "utterance.final":
      console.log(`[${msg.track}] ${msg.text} -> ${msg.recommendation.action}`);
      if (msg.recommendation.action === "reject") {
        // act on it — flag the call, alert an agent, etc.
      }
      break;
    case "session.ended":
      console.log("session ended:", msg.stats);
      break;
  }
});

// When the call ends, close the session gracefully.
function endCall() {
  ws.send(JSON.stringify({ event: "stop" }));
}

Limits

ConstraintValue
LanguageDetected automatically
ModeObserve and report—verdicts are returned, the live call is not interrupted
Max call duration1 hour
Concurrent callsPer-account limit; contact us to raise it