Voice vs chat

The same Swiggy tool, different response contracts. Design for TTS and rich cards separately.

A search_restaurants response that works great in Claude's chat UI (long list, rich cards, ratings, distances) is a disaster on a car's voice assistant - it'll read 18 restaurant names while the user tries to change lanes.

Voice and chat surfaces want different things from the same tool. Your agent's job is to shape the answer for its surface.

When to assume voice

You're on voice if any of these are true:

  • The client is a car, TV, smart speaker, or ambient surface.
  • The response will be TTS-rendered without a screen.
  • The user typed no input (they spoke).
  • Your framework's surface metadata says voice (some frameworks expose ctx.surface === "voice").

Otherwise, assume chat: a visible screen, the user can scan, widgets help.

Voice response contract

Your system prompt should include something like:

Chat response contract

You are on a chat surface (Claude, ChatGPT, or similar). Your responses will
be rendered with markdown and may include widgets.

- Present up to 8 restaurants in a search result as a concise list with
  name, cuisine, rating, distance, delivery time.
- Use the widget returned alongside search_restaurants (restaurant-card) when
  your client supports rendering it.
- Show cart items as a markdown table with item, quantity, price, total.
- Still confirm before place_food_order, but in text ("Confirm order? Reply
  yes to place.").

Sample voice exchange

User:   Order biryani
Agent:  I found two options near your home address. The closest is Biryani
        House, about 30 minutes, forty-nine rupees off with a welcome
        coupon. Want that one?
User:   Yes
Agent:  One chicken biryani from Biryani House. Total four hundred and sixty
        nine rupees, cash on delivery. Confirm to place?
User:   Confirm
Agent:  Placed. You'll see delivery updates in the Swiggy app.

Sample chat exchange

**User**: Order biryani
 
**Agent**:
 
I found these near your home:
 
| Restaurant       | Rating | Distance | Delivery |
|------------------|--------|----------|----------|
| Biryani House    | 4.5 ★  | 2.1 km   | 30 min   |
| Paradise Biryani | 4.3 ★  | 3.8 km   | 40 min   |
| Kebab Factory    | 4.4 ★  | 5.2 km   | 45 min   |
 
[Restaurant card widget for Biryani House]
 
Which one?

Tool-selection differences

Some tools are more useful on voice than chat:

ToolVoiceChat
your_go_to_items (Instamart reorder)Perfect - "reorder your usual?" one-shotAlso good, but search is fine on screen
search_menu with many resultsCompress to top 3Show up to 10
fetch_food_couponsRead top 1Show whole list
track_food_orderSay ETA onlyShow full timeline

What Swiggy does for you

Tool responses include fields optimized for both surfaces:

  • shortDescription (voice-friendly, 1 sentence)
  • longDescription (chat-friendly, includes structured data)
  • deliveryTimeSpoken (e.g. "about 30 minutes") vs deliveryTimeRange (e.g. "25-35 MIN")

Use the right field for your surface.

Guardrails common to both

  • Never autonomously place an order without user confirmation. Surfaces differ in the shape of the confirmation, not its necessity.
  • Always surface distance for far restaurants (>5 km on Food, >10 km on Dineout).
  • Respect the ₹1000 cart cap on Food; tell the user before they pick an 8th item they can't afford.
  • Never read raw IDs, tokens, or internal codes aloud or in screen UI.