Skip to content

Instantly share code, notes, and snippets.

@vgrichina
Last active January 7, 2026 00:09
Show Gist options
  • Select an option

  • Save vgrichina/54c0620ca16063a88fa44d8bf10dc94e to your computer and use it in GitHub Desktop.

Select an option

Save vgrichina/54c0620ca16063a88fa44d8bf10dc94e to your computer and use it in GitHub Desktop.
Analyze images using Gemini

Image Analysis with Bounding Boxes

Get precise bounding boxes for image elements using Gemini 3 Flash. Complements Claude's vision which can see but can't provide exact coordinates.

API Call

# Convert image to base64
IMAGE_BASE64=$(base64 -i image.png)

# Call Gemini via OpenRouter
curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Analyze this screenshot and identify ALL prominent UI elements. Return JSON array in this format:\n\n[{\"box_2d\": [6, 246, 386, 526], \"label\": \"Apple logo\"}, {\"box_2d\": [234, 649, 650, 863], \"label\": \"Search button\"}]\n\nRules:\n- box_2d format: [y_min, x_min, y_max, x_max] using 0-1000 normalized coordinates\n- label: describe the element\n- Return ONLY the JSON array, no markdown blocks"
        },
        {
          "type": "image_url",
          "image_url": {"url": "data:image/png;base64,'"$IMAGE_BASE64"'"}
        }
      ]
    }],
    "temperature": 0.1
  }'

Response Format

IMPORTANT: Gemini's native format uses box_2d with [y_min, x_min, y_max, x_max] order (Y before X).

[
  {
    "box_2d": [6, 246, 386, 526],
    "label": "Apple logo"
  },
  {
    "box_2d": [234, 649, 650, 863],
    "label": "Search button"
  }
]

Converting Coordinates

Gemini uses a 0-1000 normalized coordinate space for BOTH X and Y axes, regardless of image aspect ratio.

// Get actual image dimensions
// sips -g pixelWidth -g pixelHeight image.png
const actualWidth = 1906;
const actualHeight = 1360;

// Scale factors: actual pixels per normalized unit
const scaleX = actualWidth / 1000;   // 1906 / 1000 = 1.906
const scaleY = actualHeight / 1000;  // 1360 / 1000 = 1.360

// Convert box_2d to actual pixels
// box_2d format: [y_min, x_min, y_max, x_max]
const [y_min, x_min, y_max, x_max] = box_2d;

const actualX = x_min * scaleX;
const actualY = y_min * scaleY;
const actualWidth = (x_max - x_min) * scaleX;
const actualHeight = (y_max - y_min) * scaleY;

// Example: box_2d [100, 500, 200, 700]
// Means: top=100, left=500, bottom=200, right=700 in 0-1000 space
// Actual pixels: {x: 953, y: 136, width: 381.2, height: 136}

Use Cases

  • Game Dev: Sprite bounds, interactive objects, UI elements
  • Automation: Button/menu coordinates for testing
  • Data Extraction: Chart points, diagram nodes
  • Computer Vision: Object detection with locations

Notes

  • Model: Use google/gemini-3-flash-preview exactly as written
  • Auth: Requires OPENROUTER_API_KEY
  • Cost: ~$0.0002 per image
  • Coordinate Space: Always 0-1000 for both X and Y axes
  • Format: Native Gemini format is box_2d: [y_min, x_min, y_max, x_max] (Y before X!)
  • Scaling: Convert to pixels by multiplying by (actualDimension / 1000)
  • Actual Dimensions: Get with sips -g pixelWidth -g pixelHeight image.png
  • Reference: See Google Vertex AI docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment