Merge pull request ag2ai#415 from ag2ai/387-docs-add-blogpost-for-rea…

…ltimeagent-over-rtc 387 docs add blogpost for realtimeagent over rtc
harishmohanraj · Jan 9, 2025 · b8bc478 · b8bc478
2 parents 677c309 + 4910f4f
commit b8bc478
Show file tree

Hide file tree

Showing 4 changed files with 364 additions and 0 deletions.
diff --git a/.../blog/2025-01-09-RealtimeAgent-over-WebRTC/img/webrtc_communication_diagram.png b/.../blog/2025-01-09-RealtimeAgent-over-WebRTC/img/webrtc_communication_diagram.png
diff --git a/...ite/blog/2025-01-09-RealtimeAgent-over-WebRTC/img/webrtc_connection_diagram.png b/...ite/blog/2025-01-09-RealtimeAgent-over-WebRTC/img/webrtc_connection_diagram.png
diff --git a/website/blog/2025-01-09-RealtimeAgent-over-WebRTC/index.mdx b/website/blog/2025-01-09-RealtimeAgent-over-WebRTC/index.mdx
@@ -0,0 +1,357 @@
+---
+title: Real-Time Voice Interactions over WebRTC
+authors:
+  - marklysze
+  - sternakt
+  - davorrunje
+  - davorinrusevljan
+tags: [Realtime API, Voice Agents, AI Tools, WebRTC]
+
+---
+
+<div class="blog-authors">
+   <p class="authors">Authors:</p>
+   <CardGroup cols={2}>
+      <Card href="https://github.com/marklysze">
+         <div class="col card">
+         <div class="img-placeholder">
+            <img noZoom src="https://github.com/marklysze.png" />
+         </div>
+         <div>
+            <p class="name">Mark Sze</p>
+            <p>Software Engineer at AG2.ai</p>
+         </div>
+      </div>
+      </Card>
+      <Card href="https://github.com/sternakt">
+         <div class="col card">
+         <div class="img-placeholder">
+            <img noZoom src="https://github.com/sternakt.png" />
+         </div>
+         <div>
+            <p class="name">Tvrtko Sternak</p>
+            <p>Machine Learning Engineer at Airt</p>
+         </div>
+      </div>
+      </Card>
+      <Card href="https://github.com/davorrunje">
+         <div class="col card">
+         <div class="img-placeholder">
+            <img noZoom src="https://github.com/davorrunje.png" />
+         </div>
+         <div>
+            <p class="name">Davor Runje</p>
+            <p>CTO at Airt</p>
+         </div>
+      </div>
+      </Card>
+      <Card href="https://github.com/davorinrusevljan">
+         <div class="col card">
+         <div class="img-placeholder">
+            <img noZoom src="https://github.com/davorinrusevljan.png" />
+         </div>
+         <div>
+            <p class="name">Davorin Ruševljan</p>
+            <p>Developer</p>
+         </div>
+      </div>
+      </Card>
+   </CardGroup>
+</div>
+
+![Realtime agent communication over WebRTC](img/webrtc_communication_diagram.png)
+
+**TL;DR:**
+- Build a real-time voice application using [WebRTC](https://webrtc.org/) and connect it with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent). [Demo implementation](https://github.com/ag2ai/realtime-agent-over-webrtc).
+- **Optimized for Real-Time Interactions**: Experience seamless voice communication with minimal latency and enhanced reliability.
+
+# **Realtime Voice Applications with WebRTC**
+
+In our [previous blog post](/blog/2025-01-08-RealtimeAgent-over-websocket), we introduced the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter), a simple way to stream real-time audio using [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/). While effective, [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) can face challenges with quality and reliability in high-latency or network-variable scenarios. Enter [WebRTC](https://webrtc.org/).
+
+Today, we’re excited to showcase the integration with [OpenAI Realtime API with WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc), leveraging WebRTC’s peer-to-peer communication capabilities to provide a robust, low-latency, high-quality audio streaming experience directly from the browser.
+
+## **Why WebRTC?**
+[WebRTC](https://webrtc.org/) (Web Real-Time Communication) is a powerful technology for enabling direct peer-to-peer communication between browsers and servers. It was built with real-time audio, video, and data transfer in mind, making it an ideal choice for real-time voice applications. Here are some key benefits:
+
+### **1. Low Latency**
+[WebRTC's](https://webrtc.org/) peer-to-peer design minimizes latency, ensuring natural, fluid conversations.
+
+### **2. Adaptive Quality**
+[WebRTC](https://webrtc.org/) dynamically adjusts audio quality based on network conditions, maintaining a seamless user experience even in suboptimal environments.
+
+### **3. Secure by Design**
+With encryption (DTLS and SRTP) baked into its architecture, [WebRTC](https://webrtc.org/) ensures secure communication between peers.
+
+### **4. Widely Supported**
+[WebRTC](https://webrtc.org/) is supported by all major modern browsers, making it highly accessible for end users.
+
+## **How It Works**
+
+This example demonstrates using [WebRTC](https://webrtc.org/) to establish low-latency, real-time interactions with [OpenAI Realtime API with WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc) from a web browser. Here's how it works:
+
+![Realtime agent communication over WebRTC](img/webrtc_connection_diagram.png)
+
+1. **Request an Ephemeral API Key**
+   - The browser connects to your backend via [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) to exchange configuration details, such as the ephemeral key and model information.
+   - [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) handle signaling to bootstrap the [WebRTC](https://webrtc.org/) session.
+   - The browser requests a short-lived API key from your server.
+
+2. **Generate an Ephemeral API Key**
+   - Your backend generates an ephemeral key via the OpenAI REST API and returns it. These keys expire after one minute to enhance security.
+
+3. **Initialize the WebRTC Connection**
+   - **Audio Streaming**: The browser captures microphone input and streams it to OpenAI while playing audio responses via an `<audio>` element.
+   - **DataChannel**: A `DataChannel` is established to send and receive events (e.g., function calls).
+   - **Session Handshake**: The browser creates an SDP offer, sends it to OpenAI with the ephemeral key, and sets the remote SDP answer to finalize the connection.
+   - The audio stream and events flow in real time, enabling interactive, low-latency conversations.
+
+## **Example: Build a Voice-Enabled Language Translator**
+Let’s walk through a practical example of using [WebRTC](https://webrtc.org/) to create a voice-enabled language translator.
+<Note>You can find the full example [here](https://github.com/ag2ai/realtime-agent-over-webrtc/tree/main).</Note>
+
+### **1. Clone the Repository**
+Start by cloning the example project from GitHub:
+```bash
+git clone https://github.com/ag2ai/realtime-agent-over-webrtc.git
+cd realtime-agent-over-webrtc
+```
+
+### **2. Set Up Environment Variables**
+Create a `OAI_CONFIG_LIST` file based on the provided `OAI_CONFIG_LIST_sample`:
+```bash
+cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST
+```
+In the `OAI_CONFIG_LIST` file, update the `api_key` with your OpenAI API key.
+
+<Warning>
+Supported key format
+
+Currently WebRTC can be used only by API keys the begin with:
+
+```
+sk-proj
+```
+
+Other keys may result internal server error  (500) on OpenAI server. For more details see [this issue](https://community.openai.com/t/realtime-api-create-sessions-results-in-500-internal-server-error/1060964/5)
+
+</Warning>
+
+### (Optional) Create and Use a Virtual Environment
+To avoid cluttering your global Python environment:
+```bash
+python3 -m venv env
+source env/bin/activate
+```
+
+### **3. Install Dependencies**
+Install the required Python packages:
+```bash
+pip install -r requirements.txt
+```
+
+### **4. Start the Server**
+Run the application with Uvicorn:
+```bash
+uvicorn realtime_over_webrtc.main:app --port 5050
+```
+When the server starts, you should see:
+```bash
+INFO:     Started server process [12345]
+INFO:     Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit)
+```
+
+### **5. Open the Application**
+Navigate to [**localhost:5050/start-chat**](http://localhost:5050/start-chat) in your browser. The application will request microphone permissions to enable real-time voice interaction.
+
+### **6. Start Speaking**
+To get started, simply speak into your microphone and ask a question. For example, you can say:
+
+**"What's the weather like in Rome?"**
+
+This initial question will activate the agent, and it will respond, showcasing its ability to understand and interact with you in real time.
+
+## **Code review**
+
+### WebRTC connection
+
+A lot of the [WebRTC](https://webrtc.org/) connection logic happens in the [website_files/static
+/WebRTC.js](https://github.com/ag2ai/realtime-agent-over-webrtc/blob/main/realtime_over_webrtc/website_files/static/WebRTC.js), so lets take a look at the code there first.
+
+#### **WebSocket Initialization**
+The [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) is responsible for exchanging initialization data and signaling messages.
+```javascript
+ws = new WebSocket(webSocketUrl);
+
+ws.onmessage = async event => {
+    const message = JSON.parse(event.data);
+    console.info("Received Message from AG2 backend", message);
+    if (message.type === "ag2.init") {
+        await openRTC(message.config); // Starts the WebRTC connection
+        return;
+    }
+    if (dc) {
+        dc.send(JSON.stringify(message)); // Sends data via DataChannel
+    } else {
+        console.log("DC not ready yet", message);
+    }
+};
+```
+
+#### **WebRTC Setup**
+This block configures the [WebRTC](https://webrtc.org/) connection, adds audio tracks, and initializes the `DataChannel`.
+```javascript
+async function openRTC(data) {
+    const EPHEMERAL_KEY = data.client_secret.value;
+
+    // Set up to play remote audio
+    const audioEl = document.createElement("audio");
+    audioEl.autoplay = true;
+    pc.ontrack = e => audioEl.srcObject = e.streams[0];
+
+    // Add microphone input as local audio track
+    const ms = await navigator.mediaDevices.getUserMedia({ audio: true });
+    pc.addTrack(ms.getTracks()[0]);
+
+    // Create a DataChannel
+    dc = pc.createDataChannel("oai-events");
+    dc.addEventListener("message", e => {
+        const message = JSON.parse(e.data);
+        if (message.type.includes("function")) {
+            ws.send(e.data); // Forward function messages to WebSocket
+        }
+    });
+
+    // Create and send an SDP offer
+    const offer = await pc.createOffer();
+    await pc.setLocalDescription(offer);
+
+    // Send the offer to OpenAI
+    const baseUrl = "https://api.openai.com/v1/realtime";
+    const sdpResponse = await fetch(`${baseUrl}?model=${data.model}`, {
+        method: "POST",
+        body: offer.sdp,
+        headers: {
+            Authorization: `Bearer ${EPHEMERAL_KEY}`,
+            "Content-Type": "application/sdp"
+        },
+    });
+
+    // Set the remote SDP answer
+    const answer = { type: "answer", sdp: await sdpResponse.text() };
+    await pc.setRemoteDescription(answer);
+    console.log("Connected to OpenAI WebRTC");
+}
+```
+
+### Server implementation
+
+This server implementation uses [FastAPI](https://fastapi.tiangolo.com/) to set up a [WebRTC](https://webrtc.org/) and [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) interaction, allowing clients to communicate with a chatbot powered by OpenAI's Realtime API. The server provides endpoints for a simple chat interface and real-time audio communication.
+
+#### Create an app using FastAPI
+
+First, initialize a [FastAPI](https://fastapi.tiangolo.com/) app instance to handle HTTP requests and [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) connections.
+
+```python
+app = FastAPI()
+```
+
+This creates an app instance that will be used to manage both regular HTTP requests and real-time [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) interactions.
+
+#### Define the root endpoint for status
+
+Next, define a root endpoint to verify that the server is running.
+
+```python
+@app.get("/", response_class=JSONResponse)
+async def index_page():
+    return {"message": "WebRTC AG2 Server is running!"}
+```
+
+When accessed, this endpoint responds with a simple status message indicating that the [WebRTC](https://webrtc.org/) server is up and running.
+
+#### Set up static files and templates
+
+Mount a directory for static files (e.g., CSS, JavaScript) and configure templates for rendering HTML.
+
+```python
+website_files_path = Path(__file__).parent / "website_files"
+
+app.mount(
+    "/static", StaticFiles(directory=website_files_path / "static"), name="static"
+)
+
+templates = Jinja2Templates(directory=website_files_path / "templates")
+```
+
+This ensures that static assets (like styling or scripts) can be served and that HTML templates can be rendered for dynamic responses.
+
+#### Serve the chat interface page
+
+Create an endpoint to serve the HTML page for the chat interface.
+
+```python
+@app.get("/start-chat/", response_class=HTMLResponse)
+async def start_chat(request: Request):
+    """Endpoint to return the HTML page for audio chat."""
+    port = request.url.port
+    return templates.TemplateResponse("chat.html", {"request": request, "port": port})
+```
+
+This endpoint serves the `chat.html` page and provides the port number in the template, which is used for [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) connections.
+
+#### Handle WebSocket connections for media streaming
+
+Set up a [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) endpoint to handle real-time interactions, including receiving audio streams and responding with OpenAI's model output.
+
+```python
+@app.websocket("/session")
+async def handle_media_stream(websocket: WebSocket):
+    """Handle WebSocket connections providing audio stream and OpenAI."""
+    await websocket.accept()
+
+    logger = getLogger("uvicorn.error")
+
+    realtime_agent = RealtimeAgent(
+        name="Weather Bot",
+        system_message="Hello there! I am an AI voice assistant powered by Autogen and the OpenAI Realtime API. You can ask me about weather, jokes, or anything you can imagine. Start by saying 'How can I help you'?",
+        llm_config=realtime_llm_config,
+        websocket=websocket,
+        logger=logger,
+    )
+```
+
+This [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) endpoint establishes a connection and creates a [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent) that will manage interactions with OpenAI’s Realtime API. It also includes logging for monitoring the process.
+
+#### Register and implement real-time functions
+
+Define custom real-time functions that can be called from the client side, such as fetching weather data.
+
+```python
+    @realtime_agent.register_realtime_function(
+        name="get_weather", description="Get the current weather"
+    )
+    def get_weather(location: Annotated[str, "city"]) -> str:
+        logger.info(f"Checking the weather: {location}")
+        return (
+            "The weather is cloudy." if location == "Rome" else "The weather is sunny."
+        )
+```
+
+Here, a weather-related function is registered with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent). It responds with a simple weather message based on the input city.
+
+#### Run the RealtimeAgent
+
+Finally, run the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent) to start handling the [WebSocket](https://fastapi.tiangolo.com/advanced/websockets/) interactions.
+
+```python
+    await realtime_agent.run()
+```
+
+This starts the agent's event loop, which listens for incoming messages and responds accordingly.
+
+
+## **Conclusion**
+New integration of [OpenAI Realtime API with WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc) unlocks the full potential of [WebRTC](https://webrtc.org/) for real-time voice applications. With its low latency, adaptive quality, and secure communication, it’s the perfect tool for building interactive, voice-enabled applications.
+
+Try it today and take your voice applications to the next level!
diff --git a/website/mint.json b/website/mint.json
@@ -505,6 +505,7 @@
         {
           "group": "Recent posts",
           "pages": [
+            "blog/2025-01-09-RealtimeAgent-over-WebRTC/index",
             "blog/2025-01-08-RealtimeAgent-over-websocket/index",
             "blog/2025-01-07-Tools-Dependency-Injection/index",
             "blog/2024-12-20-RealtimeAgent/index",