Skip to main content

One of the biggest drivers of LLM API costs is Input Tokens. When users have long conversations—especially those containing high-resolution images or large files—sending the entire 50-message history on every request wastes money and slows down response times (TTFT). LLM Router’s Context Optimization solves this by analyzing the user’s latest message against the conversation history. If the user changes topics, or if previous heavy media is no longer relevant to the current question, we automatically prune the context before sending it to the model provider.

1. How Chat Optimization Works

When a request arrives, our internal Gateway AI generates a chat_score (from 0.0 to 1.0). This score represents how heavily the user’s current message relies on the past messages.
  • Score 1.0 (High Dependency): “Fix the error in the second file you sent.” (Needs full history).
  • Score 0.1 (Low Dependency): “Completely new topic: Write a Haiku.” (Needs zero history).
You define a Threshold Score using the chatHistoryOptimization.score setting. If the internal chat_score is LESS than your configured threshold, LLM Router activates the optimization engine to strip out old, irrelevant messages and compress long text blocks.

2. How Media Optimization Works

Multimodal inputs (like sending a UI mockup to Claude 3.5 Sonnet) are incredibly expensive. Often, a user will upload an image early in a chat, but later ask a purely text-based question. If you enable mediaOptimization, the router detects if the current prompt actually requires the past images to answer. If it doesn’t, LLM Router strips the heavy image binaries out of the history array entirely, saving you massive amounts of “Vision” tokens.

Configuration

You configure these behaviors inside the gateway object.
TypeScript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.llmrouter.app/v1",
  apiKey: process.env.LLM_ROUTER_API_KEY,
});

async function main() {
  const response = await client.chat.completions.create({
    model: "anthropic/claude-3-5-sonnet", // Or leave blank if using Tags
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What color is this button?" },
          {
            type: "image_url",
            image_url: { url: "https://example.com/ui.png" },
          },
        ],
      },
      { role: "assistant", content: "The button is blue." },
      // ... later in the chat ...
      {
        role: "user",
        content: "Actually, ignore that. How do I install Python?",
      },
    ],

    // @ts-expect-error - Custom LLM Router extension
    gateway: {
      chatHistoryOptimization: {
        enabled: true, // Master switch for text history optimization
        score: 0.6, // The threshold that triggers optimization
      },
      mediaOptimization: true, // Enables stripping of irrelevant images/audio
    },
  });

  console.log(response.choices[0].message.content);
}
main();

Visualizing Chat & Media Optimization

To understand how much money this saves, let’s look at real-world examples of how LLM Router transforms your messages array before sending it to the expensive upstream model.

Scenario A: The Multi-Modal Topic Shift (Media Stripping)

  1. The Setup: A user uploads a high-res UI mockup (~2,000 tokens).
  2. The Shift: After 2 turns, the user asks, “Build a landing page for a car rental business called…”
  3. The Analysis: LLM Router calculates a chat_score of 0.1 (Total Topic Change) and determines the image is no longer needed.
Before Optimization (What you sent) Cost: ~2,500 Input Tokens
[
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What color is this button?" },
      { "type": "image_url", "image_url": { "url": "https://..." } } // EXPENSIVE!
    ]
  },
  { "role": "assistant", "content": "The button is blue." },

  // --- THE TOPIC SHIFT ---
  {
    "role": "user",
    "content": "Build a landing page for a car rental business called…"
  }
]
After Optimization (What LLM Router sent to Anthropic) Cost: ~60 Input Tokens (Saved 97%)
[
  // 1. The image is STRIPPED and replaced with a cheap text placeholder
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What color is this button?" },
      { "type": "text", "text": "[MEDIA_REMOVED_TO_SAVE_COST]" } // CHEAP!
    ]
  },
  { "role": "assistant", "content": "The button is blue." },

  // 2. The new request remains intact
  {
    "role": "user",
    "content": "Build a landing page for a car rental business called…"
  }
]

Scenario B: The Text Topic Shift

  1. The Setup: A user pastes a massive 5,000-line error log to debug a Python script.
  2. The Shift: After five turns, the user says, “Build a landing page for a car rental business called…”
Before Optimization: Cost: ~6,000 Input Tokens
[
  { "role": "system", "content": "You are a helpful coding assistant." },
  {
    "role": "user",
    "content": "Here is my 5,000 line Python error log: [MASSIVE_WALL_OF_TEXT...]"
  },
  {
    "role": "assistant",
    "content": "It looks like a SyntaxError on line 42. Try fixing the indentation."
  },

  // --- THE TOPIC SHIFT ---
  {
    "role": "user",
    "content": "Build a landing page for a car rental business called…"
  }
]
After Optimization: Cost: ~40 Input Tokens (Saved 99%)
[
  // 1. System Prompt is ALWAYS preserved
  { "role": "system", "content": "You are a helpful coding assistant." },

  // 2. The irrelevant deep history is PRUNED entirely

  // 3. The immediate context (the last 1-2 messages) is kept
  {
    "role": "user",
    "content": "Build a landing page for a car rental business called…"
  }
]

Scenario C: The Long Conversation

If the chat_score is borderline (e.g., 0.5), indicating the user is still on the same topic but the old messages are getting too long, LLM Router performs Middle-Out Compression on the older text blocks. Before:
{ "role": "user", "content": "[... 2,000 lines of setup code ...]" }
After Compression:
{
  "role": "user",
  "content": "import react from 'react';\nconst App = () => {\n\n... [Middle 18,500 characters truncated to save context] ...\n\n  return <div />;\n}"
}
LLM Router intelligently keeps the top 30% and bottom 30% of the old message, preserving the most critical context while dropping the expensive “middle” fluff.