Skip to content

Instantly share code, notes, and snippets.

@tomfuertes
Created February 23, 2026 15:07
Show Gist options
  • Select an option

  • Save tomfuertes/911685d7249abe2c41b94ef80ff856b3 to your computer and use it in GitHub Desktop.

Select an option

Save tomfuertes/911685d7249abe2c41b94ef80ff856b3 to your computer and use it in GitHub Desktop.
Observability for Polyglot (Sofia AI Tutor) - Prompt Tuning Focus

Observability for Polyglot (Sofia AI Tutor) - Prompt Tuning Focus

What's already there

The backend has solid cost and operational observability:

  • message_logs table - tokens in/out, model, cost, timestamp per call
  • daily_costs table - aggregated spend with kill switch at $100/day
  • logger.geminiCall() - logs slow requests (>1s) to console
  • Sentry - error monitoring with 20% perf sample rate

The gap: none of this captures what Sofia actually said or why a response felt wrong. You can see a call cost $0.0004 and took 1.2s - but you can't see that she opened with "¡Hola! ¡Muy bien!" when the prompt explicitly forbids that.


Why this matters for prompt tuning

The current prompt is in src/config/constants.ts → DEFAULT_SYSTEM_PROMPT. It has real specificity:

NEVER open with exclamatory praise like "¡Perfecto\!", "¡Muy bien\!" or greetings like "¡Hola\!"
NEVER write "haha", "jaja", "lol"
No asterisks, markdown, or formatting. Plain text only.

These are clearly rules that got added because Sofia broke them in production. Without traces, the feedback loop is:

  1. User complains Sofia sounds weird
  2. You guess what happened
  3. You patch the prompt
  4. You wait for more complaints

With traces, it's:

  1. User complains
  2. You pull the exact conversation in Langfuse
  3. You see the rule that broke and when
  4. You fix it and A/B the new prompt version

Recommended: Add Langfuse

Langfuse is purpose-built for this. Free tier is generous, self-hostable, and the integration is ~20 lines.

Resources:

Install

npm install langfuse

Env vars to add (.env / Vercel dashboard)

LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

Integration point

The entire integration goes in src/services/geminiService.ts, wrapping the existing chat() method. No other files need to change.

import Langfuse from 'langfuse';

// Add to constructor or as module-level singleton
const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
});

// In chat() / chatStream() / chatWithAudio(), wrap the fetch call:
async chat(messages, systemPrompt, userId) {
  const trace = langfuse.trace({
    name: 'sofia-chat',
    userId,
    metadata: { model: GEMINI_CONFIG.model },
  });

  const generation = trace.generation({
    name: 'gemini-completion',
    model: GEMINI_CONFIG.model,
    input: messages,
    systemPrompt,
  });

  try {
    // ... existing fetch logic ...
    
    generation.end({
      output: result.message,
      usage: {
        input: result.tokensInput,
        output: result.tokensOutput,
      },
    });

    return result;
  } catch (error) {
    generation.end({ level: 'ERROR', statusMessage: error.message });
    throw error;
  }
}

Add feedback scores (optional but high value)

If the app gets a thumbs-down button, wire it to:

langfuse.score({
  traceId: traceId, // pass this back in the API response
  name: 'user-feedback',
  value: -1, // -1 bad, 1 good
});

This lets you filter Langfuse to "only show bad responses" and see exactly what prompted them.


Minimum viable alternative (no new dependency)

If adding Langfuse feels like too much right now, the smallest useful change is logging message content to a separate Supabase table:

create table conversation_samples (
  id uuid primary key default gen_random_uuid(),
  user_id text,
  session_id text,
  system_prompt text,
  user_message text,
  assistant_response text,
  tokens_input int,
  tokens_output int,
  created_at timestamptz default now()
);

Then in api/chat.ts after a successful response, insert a row. No PII beyond what's already in message_logs. This gives you a queryable history to grep when something breaks.


What to look at first in Langfuse

Once integrated, these views will be immediately useful:

  1. Traces → filter by proficiency level - does "beginner" vs "intermediate" change response quality?
  2. Generations → sort by latency - are there prompt/message combos that consistently run slow?
  3. Prompt versions - migrate DEFAULT_SYSTEM_PROMPT out of code and into Langfuse's prompt manager. Then you can iterate without deploys.
  4. Sessions - group by userId to see full conversation arcs, not just individual turns.

The audio path needs extra attention

chatWithAudio() in geminiService.ts has the most complex prompt - it injects a 8-point instruction list as the user's message to force verbatim transcription and pronunciation feedback. This is the most likely source of weird Sofia behavior. Traces here will show whether Gemini is actually following those instructions or hallucinating transcriptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment