forked from ag2ai/ag2
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request ag2ai#415 from ag2ai/387-docs-add-blogpost-for-rea…
…ltimeagent-over-rtc 387 docs add blogpost for realtimeagent over rtc
- Loading branch information
Showing
4 changed files
with
364 additions
and
0 deletions.
There are no files selected for viewing
3 changes: 3 additions & 0 deletions
3
.../blog/2025-01-09-RealtimeAgent-over-WebRTC/img/webrtc_communication_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions
3
...ite/blog/2025-01-09-RealtimeAgent-over-WebRTC/img/webrtc_connection_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
357 changes: 357 additions & 0 deletions
357
website/blog/2025-01-09-RealtimeAgent-over-WebRTC/index.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,357 @@ | ||
--- | ||
title: Real-Time Voice Interactions over WebRTC | ||
authors: | ||
- marklysze | ||
- sternakt | ||
- davorrunje | ||
- davorinrusevljan | ||
tags: [Realtime API, Voice Agents, AI Tools, WebRTC] | ||
|
||
--- | ||
|
||
<div class="blog-authors"> | ||
<p class="authors">Authors:</p> | ||
<CardGroup cols={2}> | ||
<Card href="https://github.com/marklysze"> | ||
<div class="col card"> | ||
<div class="img-placeholder"> | ||
<img noZoom src="https://github.com/marklysze.png" /> | ||
</div> | ||
<div> | ||
<p class="name">Mark Sze</p> | ||
<p>Software Engineer at AG2.ai</p> | ||
</div> | ||
</div> | ||
</Card> | ||
<Card href="https://github.com/sternakt"> | ||
<div class="col card"> | ||
<div class="img-placeholder"> | ||
<img noZoom src="https://github.com/sternakt.png" /> | ||
</div> | ||
<div> | ||
<p class="name">Tvrtko Sternak</p> | ||
<p>Machine Learning Engineer at Airt</p> | ||
</div> | ||
</div> | ||
</Card> | ||
<Card href="https://github.com/davorrunje"> | ||
<div class="col card"> | ||
<div class="img-placeholder"> | ||
<img noZoom src="https://github.com/davorrunje.png" /> | ||
</div> | ||
<div> | ||
<p class="name">Davor Runje</p> | ||
<p>CTO at Airt</p> | ||
</div> | ||
</div> | ||
</Card> | ||
<Card href="https://github.com/davorinrusevljan"> | ||
<div class="col card"> | ||
<div class="img-placeholder"> | ||
<img noZoom src="https://github.com/davorinrusevljan.png" /> | ||
</div> | ||
<div> | ||
<p class="name">Davorin Ruševljan</p> | ||
<p>Developer</p> | ||
</div> | ||
</div> | ||
</Card> | ||
</CardGroup> | ||
</div> | ||
|
||
 | ||
|
||
**TL;DR:** | ||
- Build a real-time voice application using [WebRTC](https://webrtc.org/) and connect it with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent). [Demo implementation](https://github.com/ag2ai/realtime-agent-over-webrtc). | ||
- **Optimized for Real-Time Interactions**: Experience seamless voice communication with minimal latency and enhanced reliability. | ||
|
||
# **Realtime Voice Applications with WebRTC** | ||
|
||
In our [previous blog post](/blog/2025-01-08-RealtimeAgent-over-websocket), we introduced the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter), a simple way to stream real-time audio using [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/). While effective, [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) can face challenges with quality and reliability in high-latency or network-variable scenarios. Enter [WebRTC](https://webrtc.org/). | ||
|
||
Today, we’re excited to showcase the integration with [OpenAI Realtime API with WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc), leveraging WebRTC’s peer-to-peer communication capabilities to provide a robust, low-latency, high-quality audio streaming experience directly from the browser. | ||
|
||
## **Why WebRTC?** | ||
[WebRTC](https://webrtc.org/) (Web Real-Time Communication) is a powerful technology for enabling direct peer-to-peer communication between browsers and servers. It was built with real-time audio, video, and data transfer in mind, making it an ideal choice for real-time voice applications. Here are some key benefits: | ||
|
||
### **1. Low Latency** | ||
[WebRTC's](https://webrtc.org/) peer-to-peer design minimizes latency, ensuring natural, fluid conversations. | ||
|
||
### **2. Adaptive Quality** | ||
[WebRTC](https://webrtc.org/) dynamically adjusts audio quality based on network conditions, maintaining a seamless user experience even in suboptimal environments. | ||
|
||
### **3. Secure by Design** | ||
With encryption (DTLS and SRTP) baked into its architecture, [WebRTC](https://webrtc.org/) ensures secure communication between peers. | ||
|
||
### **4. Widely Supported** | ||
[WebRTC](https://webrtc.org/) is supported by all major modern browsers, making it highly accessible for end users. | ||
|
||
## **How It Works** | ||
|
||
This example demonstrates using [WebRTC](https://webrtc.org/) to establish low-latency, real-time interactions with [OpenAI Realtime API with WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc) from a web browser. Here's how it works: | ||
|
||
 | ||
|
||
1. **Request an Ephemeral API Key** | ||
- The browser connects to your backend via [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) to exchange configuration details, such as the ephemeral key and model information. | ||
- [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) handle signaling to bootstrap the [WebRTC](https://webrtc.org/) session. | ||
- The browser requests a short-lived API key from your server. | ||
|
||
2. **Generate an Ephemeral API Key** | ||
- Your backend generates an ephemeral key via the OpenAI REST API and returns it. These keys expire after one minute to enhance security. | ||
|
||
3. **Initialize the WebRTC Connection** | ||
- **Audio Streaming**: The browser captures microphone input and streams it to OpenAI while playing audio responses via an `<audio>` element. | ||
- **DataChannel**: A `DataChannel` is established to send and receive events (e.g., function calls). | ||
- **Session Handshake**: The browser creates an SDP offer, sends it to OpenAI with the ephemeral key, and sets the remote SDP answer to finalize the connection. | ||
- The audio stream and events flow in real time, enabling interactive, low-latency conversations. | ||
|
||
## **Example: Build a Voice-Enabled Language Translator** | ||
Let’s walk through a practical example of using [WebRTC](https://webrtc.org/) to create a voice-enabled language translator. | ||
<Note>You can find the full example [here](https://github.com/ag2ai/realtime-agent-over-webrtc/tree/main).</Note> | ||
|
||
### **1. Clone the Repository** | ||
Start by cloning the example project from GitHub: | ||
```bash | ||
git clone https://github.com/ag2ai/realtime-agent-over-webrtc.git | ||
cd realtime-agent-over-webrtc | ||
``` | ||
|
||
### **2. Set Up Environment Variables** | ||
Create a `OAI_CONFIG_LIST` file based on the provided `OAI_CONFIG_LIST_sample`: | ||
```bash | ||
cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST | ||
``` | ||
In the `OAI_CONFIG_LIST` file, update the `api_key` with your OpenAI API key. | ||
|
||
<Warning> | ||
Supported key format | ||
|
||
Currently WebRTC can be used only by API keys the begin with: | ||
|
||
``` | ||
sk-proj | ||
``` | ||
|
||
Other keys may result internal server error (500) on OpenAI server. For more details see [this issue](https://community.openai.com/t/realtime-api-create-sessions-results-in-500-internal-server-error/1060964/5) | ||
|
||
</Warning> | ||
|
||
### (Optional) Create and Use a Virtual Environment | ||
To avoid cluttering your global Python environment: | ||
```bash | ||
python3 -m venv env | ||
source env/bin/activate | ||
``` | ||
|
||
### **3. Install Dependencies** | ||
Install the required Python packages: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### **4. Start the Server** | ||
Run the application with Uvicorn: | ||
```bash | ||
uvicorn realtime_over_webrtc.main:app --port 5050 | ||
``` | ||
When the server starts, you should see: | ||
```bash | ||
INFO: Started server process [12345] | ||
INFO: Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit) | ||
``` | ||
|
||
### **5. Open the Application** | ||
Navigate to [**localhost:5050/start-chat**](http://localhost:5050/start-chat) in your browser. The application will request microphone permissions to enable real-time voice interaction. | ||
|
||
### **6. Start Speaking** | ||
To get started, simply speak into your microphone and ask a question. For example, you can say: | ||
|
||
**"What's the weather like in Rome?"** | ||
|
||
This initial question will activate the agent, and it will respond, showcasing its ability to understand and interact with you in real time. | ||
|
||
## **Code review** | ||
|
||
### WebRTC connection | ||
|
||
A lot of the [WebRTC](https://webrtc.org/) connection logic happens in the [website_files/static | ||
/WebRTC.js](https://github.com/ag2ai/realtime-agent-over-webrtc/blob/main/realtime_over_webrtc/website_files/static/WebRTC.js), so lets take a look at the code there first. | ||
|
||
#### **WebSocket Initialization** | ||
The [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) is responsible for exchanging initialization data and signaling messages. | ||
```javascript | ||
ws = new WebSocket(webSocketUrl); | ||
|
||
ws.onmessage = async event => { | ||
const message = JSON.parse(event.data); | ||
console.info("Received Message from AG2 backend", message); | ||
if (message.type === "ag2.init") { | ||
await openRTC(message.config); // Starts the WebRTC connection | ||
return; | ||
} | ||
if (dc) { | ||
dc.send(JSON.stringify(message)); // Sends data via DataChannel | ||
} else { | ||
console.log("DC not ready yet", message); | ||
} | ||
}; | ||
``` | ||
|
||
#### **WebRTC Setup** | ||
This block configures the [WebRTC](https://webrtc.org/) connection, adds audio tracks, and initializes the `DataChannel`. | ||
```javascript | ||
async function openRTC(data) { | ||
const EPHEMERAL_KEY = data.client_secret.value; | ||
|
||
// Set up to play remote audio | ||
const audioEl = document.createElement("audio"); | ||
audioEl.autoplay = true; | ||
pc.ontrack = e => audioEl.srcObject = e.streams[0]; | ||
|
||
// Add microphone input as local audio track | ||
const ms = await navigator.mediaDevices.getUserMedia({ audio: true }); | ||
pc.addTrack(ms.getTracks()[0]); | ||
|
||
// Create a DataChannel | ||
dc = pc.createDataChannel("oai-events"); | ||
dc.addEventListener("message", e => { | ||
const message = JSON.parse(e.data); | ||
if (message.type.includes("function")) { | ||
ws.send(e.data); // Forward function messages to WebSocket | ||
} | ||
}); | ||
|
||
// Create and send an SDP offer | ||
const offer = await pc.createOffer(); | ||
await pc.setLocalDescription(offer); | ||
|
||
// Send the offer to OpenAI | ||
const baseUrl = "https://api.openai.com/v1/realtime"; | ||
const sdpResponse = await fetch(`${baseUrl}?model=${data.model}`, { | ||
method: "POST", | ||
body: offer.sdp, | ||
headers: { | ||
Authorization: `Bearer ${EPHEMERAL_KEY}`, | ||
"Content-Type": "application/sdp" | ||
}, | ||
}); | ||
|
||
// Set the remote SDP answer | ||
const answer = { type: "answer", sdp: await sdpResponse.text() }; | ||
await pc.setRemoteDescription(answer); | ||
console.log("Connected to OpenAI WebRTC"); | ||
} | ||
``` | ||
|
||
### Server implementation | ||
|
||
This server implementation uses [FastAPI](https://fastapi.tiangolo.com/) to set up a [WebRTC](https://webrtc.org/) and [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) interaction, allowing clients to communicate with a chatbot powered by OpenAI's Realtime API. The server provides endpoints for a simple chat interface and real-time audio communication. | ||
|
||
#### Create an app using FastAPI | ||
|
||
First, initialize a [FastAPI](https://fastapi.tiangolo.com/) app instance to handle HTTP requests and [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) connections. | ||
|
||
```python | ||
app = FastAPI() | ||
``` | ||
|
||
This creates an app instance that will be used to manage both regular HTTP requests and real-time [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) interactions. | ||
|
||
#### Define the root endpoint for status | ||
|
||
Next, define a root endpoint to verify that the server is running. | ||
|
||
```python | ||
@app.get("/", response_class=JSONResponse) | ||
async def index_page(): | ||
return {"message": "WebRTC AG2 Server is running!"} | ||
``` | ||
|
||
When accessed, this endpoint responds with a simple status message indicating that the [WebRTC](https://webrtc.org/) server is up and running. | ||
|
||
#### Set up static files and templates | ||
|
||
Mount a directory for static files (e.g., CSS, JavaScript) and configure templates for rendering HTML. | ||
|
||
```python | ||
website_files_path = Path(__file__).parent / "website_files" | ||
|
||
app.mount( | ||
"/static", StaticFiles(directory=website_files_path / "static"), name="static" | ||
) | ||
|
||
templates = Jinja2Templates(directory=website_files_path / "templates") | ||
``` | ||
|
||
This ensures that static assets (like styling or scripts) can be served and that HTML templates can be rendered for dynamic responses. | ||
|
||
#### Serve the chat interface page | ||
|
||
Create an endpoint to serve the HTML page for the chat interface. | ||
|
||
```python | ||
@app.get("/start-chat/", response_class=HTMLResponse) | ||
async def start_chat(request: Request): | ||
"""Endpoint to return the HTML page for audio chat.""" | ||
port = request.url.port | ||
return templates.TemplateResponse("chat.html", {"request": request, "port": port}) | ||
``` | ||
|
||
This endpoint serves the `chat.html` page and provides the port number in the template, which is used for [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) connections. | ||
|
||
#### Handle WebSocket connections for media streaming | ||
|
||
Set up a [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) endpoint to handle real-time interactions, including receiving audio streams and responding with OpenAI's model output. | ||
|
||
```python | ||
@app.websocket("/session") | ||
async def handle_media_stream(websocket: WebSocket): | ||
"""Handle WebSocket connections providing audio stream and OpenAI.""" | ||
await websocket.accept() | ||
|
||
logger = getLogger("uvicorn.error") | ||
|
||
realtime_agent = RealtimeAgent( | ||
name="Weather Bot", | ||
system_message="Hello there! I am an AI voice assistant powered by Autogen and the OpenAI Realtime API. You can ask me about weather, jokes, or anything you can imagine. Start by saying 'How can I help you'?", | ||
llm_config=realtime_llm_config, | ||
websocket=websocket, | ||
logger=logger, | ||
) | ||
``` | ||
|
||
This [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) endpoint establishes a connection and creates a [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent) that will manage interactions with OpenAI’s Realtime API. It also includes logging for monitoring the process. | ||
|
||
#### Register and implement real-time functions | ||
|
||
Define custom real-time functions that can be called from the client side, such as fetching weather data. | ||
|
||
```python | ||
@realtime_agent.register_realtime_function( | ||
name="get_weather", description="Get the current weather" | ||
) | ||
def get_weather(location: Annotated[str, "city"]) -> str: | ||
logger.info(f"Checking the weather: {location}") | ||
return ( | ||
"The weather is cloudy." if location == "Rome" else "The weather is sunny." | ||
) | ||
``` | ||
|
||
Here, a weather-related function is registered with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent). It responds with a simple weather message based on the input city. | ||
|
||
#### Run the RealtimeAgent | ||
|
||
Finally, run the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent) to start handling the [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) interactions. | ||
|
||
```python | ||
await realtime_agent.run() | ||
``` | ||
|
||
This starts the agent's event loop, which listens for incoming messages and responds accordingly. | ||
|
||
|
||
## **Conclusion** | ||
New integration of [OpenAI Realtime API with WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc) unlocks the full potential of [WebRTC](https://webrtc.org/) for real-time voice applications. With its low latency, adaptive quality, and secure communication, it’s the perfect tool for building interactive, voice-enabled applications. | ||
|
||
Try it today and take your voice applications to the next level! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters