Get precise bounding boxes for image elements using Gemini 3 Flash. Complements Claude's vision which can see but can't provide exact coordinates.
# Convert image to base64
IMAGE_BASE64=$(base64 -i image.png)
# Call Gemini via OpenRouter
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-3-flash-preview",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this screenshot and identify ALL prominent UI elements. Return JSON array in this format:\n\n[{\"box_2d\": [6, 246, 386, 526], \"label\": \"Apple logo\"}, {\"box_2d\": [234, 649, 650, 863], \"label\": \"Search button\"}]\n\nRules:\n- box_2d format: [y_min, x_min, y_max, x_max] using 0-1000 normalized coordinates\n- label: describe the element\n- Return ONLY the JSON array, no markdown blocks"
},
{
"type": "image_url",
"image_url": {"url": "data:image/png;base64,'"$IMAGE_BASE64"'"}
}
]
}],
"temperature": 0.1
}'IMPORTANT: Gemini's native format uses box_2d with [y_min, x_min, y_max, x_max] order (Y before X).
[
{
"box_2d": [6, 246, 386, 526],
"label": "Apple logo"
},
{
"box_2d": [234, 649, 650, 863],
"label": "Search button"
}
]Gemini uses a 0-1000 normalized coordinate space for BOTH X and Y axes, regardless of image aspect ratio.
// Get actual image dimensions
// sips -g pixelWidth -g pixelHeight image.png
const actualWidth = 1906;
const actualHeight = 1360;
// Scale factors: actual pixels per normalized unit
const scaleX = actualWidth / 1000; // 1906 / 1000 = 1.906
const scaleY = actualHeight / 1000; // 1360 / 1000 = 1.360
// Convert box_2d to actual pixels
// box_2d format: [y_min, x_min, y_max, x_max]
const [y_min, x_min, y_max, x_max] = box_2d;
const actualX = x_min * scaleX;
const actualY = y_min * scaleY;
const actualWidth = (x_max - x_min) * scaleX;
const actualHeight = (y_max - y_min) * scaleY;
// Example: box_2d [100, 500, 200, 700]
// Means: top=100, left=500, bottom=200, right=700 in 0-1000 space
// Actual pixels: {x: 953, y: 136, width: 381.2, height: 136}- Game Dev: Sprite bounds, interactive objects, UI elements
- Automation: Button/menu coordinates for testing
- Data Extraction: Chart points, diagram nodes
- Computer Vision: Object detection with locations
- Model: Use
google/gemini-3-flash-previewexactly as written - Auth: Requires
OPENROUTER_API_KEY - Cost: ~$0.0002 per image
- Coordinate Space: Always 0-1000 for both X and Y axes
- Format: Native Gemini format is
box_2d: [y_min, x_min, y_max, x_max](Y before X!) - Scaling: Convert to pixels by multiplying by
(actualDimension / 1000) - Actual Dimensions: Get with
sips -g pixelWidth -g pixelHeight image.png - Reference: See Google Vertex AI docs