vgrichina/gemini-image-analysis.md

## gemini-image-analysis.md

      
    Raw
  

              gemini-image-analysis.md
            
          
    Image Analysis with Bounding Boxes

Get precise bounding boxes for image elements using Gemini 3 Flash. Complements Claude's vision which can see but can't provide exact coordinates.
API Call

# Convert image to base64
IMAGE_BASE64=$(base64 -i image.png)

# Call Gemini via OpenRouter
curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Analyze this screenshot and identify ALL prominent UI elements. Return JSON array in this format:\n\n[{\"box_2d\": [6, 246, 386, 526], \"label\": \"Apple logo\"}, {\"box_2d\": [234, 649, 650, 863], \"label\": \"Search button\"}]\n\nRules:\n- box_2d format: [y_min, x_min, y_max, x_max] using 0-1000 normalized coordinates\n- label: describe the element\n- Return ONLY the JSON array, no markdown blocks"
        },
        {
          "type": "image_url",
          "image_url": {"url": "data:image/png;base64,'"$IMAGE_BASE64"'"}
        }
      ]
    }],
    "temperature": 0.1
  }'
Response Format

IMPORTANT: Gemini's native format uses box_2d with [y_min, x_min, y_max, x_max] order (Y before X).
[
  {
    "box_2d": [6, 246, 386, 526],
    "label": "Apple logo"
  },
  {
    "box_2d": [234, 649, 650, 863],
    "label": "Search button"
  }
]
Converting Coordinates

Gemini uses a 0-1000 normalized coordinate space for BOTH X and Y axes, regardless of image aspect ratio.
// Get actual image dimensions
// sips -g pixelWidth -g pixelHeight image.png
const actualWidth = 1906;
const actualHeight = 1360;

// Scale factors: actual pixels per normalized unit
const scaleX = actualWidth / 1000;   // 1906 / 1000 = 1.906
const scaleY = actualHeight / 1000;  // 1360 / 1000 = 1.360

// Convert box_2d to actual pixels
// box_2d format: [y_min, x_min, y_max, x_max]
const [y_min, x_min, y_max, x_max] = box_2d;

const actualX = x_min * scaleX;
const actualY = y_min * scaleY;
const actualWidth = (x_max - x_min) * scaleX;
const actualHeight = (y_max - y_min) * scaleY;

// Example: box_2d [100, 500, 200, 700]
// Means: top=100, left=500, bottom=200, right=700 in 0-1000 space
// Actual pixels: {x: 953, y: 136, width: 381.2, height: 136}
Use Cases


Game Dev: Sprite bounds, interactive objects, UI elements
Automation: Button/menu coordinates for testing
Data Extraction: Chart points, diagram nodes
Computer Vision: Object detection with locations

Notes


Model: Use google/gemini-3-flash-preview exactly as written
Auth: Requires OPENROUTER_API_KEY
Cost: ~$0.0002 per image
Coordinate Space: Always 0-1000 for both X and Y axes
Format: Native Gemini format is box_2d: [y_min, x_min, y_max, x_max] (Y before X!)
Scaling: Convert to pixels by multiplying by (actualDimension / 1000)
Actual Dimensions: Get with sips -g pixelWidth -g pixelHeight image.png
Reference: See Google Vertex AI docs
No results found