Skip to content

Instantly share code, notes, and snippets.

@esafwan
Last active May 17, 2025 20:11
Show Gist options
  • Select an option

  • Save esafwan/92b47ef88fcc8b84897251a169892529 to your computer and use it in GitHub Desktop.

Select an option

Save esafwan/92b47ef88fcc8b84897251a169892529 to your computer and use it in GitHub Desktop.
Mistral OCR
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "1Fzjyb9daweA"
},
"source": [
"# OCR Cookbook\n",
"\n",
"---\n",
"\n",
"## Enable Document Understanding for Any Model with OCR\n",
"\n",
"Optical Character Recognition (OCR) transforms text-based documents and images into pure text outputs and markdown. By leveraging this feature, you can enable any Large Language Model (LLM) to reliably understand documents efficiently and cost-effectively.\n",
"\n",
"In this guide, we will demonstrate how to use OCR with our models to discuss any text-based document, whether it's a PDF, photo, or screenshot, via URLs.\n",
"\n",
"---\n",
"\n",
"### 2 Methods\n",
"We will explore two methods. One will leverage [Tool Usage](https://docs.mistral.ai/capabilities/function_calling/) to open any URL on demand by the user. The second approach will make use of our built-in feature that leverages OCR, we will extract the URLs with regex and call our models with this feature.\n",
"\n",
"- [Tool Usage](#scrollTo=KvEQoe7Y9-um)\n",
"- [Built-In](#scrollTo=nKJoY5asORZq)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KvEQoe7Y9-um"
},
"source": [
"## Tool Usage\n",
"The first method we will explore will leverage tool usage.\n",
"\n",
"To achieve this, we will first send our question, which may or may not include URLs pointing to documents that we want to perform OCR on. Mistral Small will then decide, using the `open_urls` tool ( extracting the URLs directly ), whether it needs to perform OCR on any URL or if it can directly answer the question."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FL4ZJCeY918i"
},
"source": [
"![image.png]()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sf84okJJmm7M"
},
"source": [
"### Setup\n",
"First, let's install `mistralai`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X1EBW_a6gRUD",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "3d1c72f7-7eb5-44b0-c898-ce3d45d2dbcb"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Collecting mistralai\n",
" Downloading mistralai-1.5.0-py3-none-any.whl.metadata (29 kB)\n",
"Collecting eval-type-backport>=0.2.0 (from mistralai)\n",
" Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)\n",
"Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.28.1)\n",
"Collecting jsonpath-python>=1.0.6 (from mistralai)\n",
" Downloading jsonpath_python-1.0.6-py3-none-any.whl.metadata (12 kB)\n",
"Requirement already satisfied: pydantic>=2.9.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.10.6)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.8.2)\n",
"Collecting typing-inspect>=0.9.0 (from mistralai)\n",
" Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)\n",
"Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.7.1)\n",
"Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (2025.1.31)\n",
"Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (1.0.7)\n",
"Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.10)\n",
"Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.27.0->mistralai) (0.14.0)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (2.27.2)\n",
"Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (4.12.2)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->mistralai) (1.17.0)\n",
"Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.9.0->mistralai)\n",
" Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)\n",
"Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->mistralai) (1.3.1)\n",
"Downloading mistralai-1.5.0-py3-none-any.whl (271 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m271.6/271.6 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)\n",
"Downloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)\n",
"Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)\n",
"Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\n",
"Installing collected packages: mypy-extensions, jsonpath-python, eval-type-backport, typing-inspect, mistralai\n",
"Successfully installed eval-type-backport-0.2.2 jsonpath-python-1.0.6 mistralai-1.5.0 mypy-extensions-1.0.0 typing-inspect-0.9.0\n"
]
}
],
"source": [
"!pip install mistralai"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nTpiGWkpmvSb"
},
"source": [
"We can now set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AwG2kwfTlbW1"
},
"outputs": [],
"source": [
"from mistralai import Mistral\n",
"\n",
"api_key = \"API_KEY\"\n",
"client = Mistral(api_key=api_key)\n",
"text_model = \"mistral-small-latest\"\n",
"ocr_model = \"mistral-ocr-latest\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F35KDRN-nEMv"
},
"source": [
"### System and Tool\n",
"For the model to be aware of its purpose and what it can do, it's important to provide a clear system prompt with instructions and explanations of any tools it may have access to.\n",
"\n",
"Let's define a system prompt and the tools it will have access to, in this case, `open_urls`.\n",
"\n",
"*Note: `open_urls` can easily be customized with other resources and models ( for summarization, for example ) and many other features. In this demo, we are going for a simpler approach.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zzgxk6qTgGU9"
},
"outputs": [],
"source": [
"system = \"\"\"You are an AI Assistant with document understanding via URLs. You will be provided with URLs, and you must answer any questions related to those documents.\n",
"\n",
"# OPEN URLS INSTRUCTIONS\n",
"You can open URLs by using the `open_urls` tool. It will open webpages and apply OCR to them, retrieving the contents. Use those contents to answer the user.\n",
"Only URLs pointing to PDFs and images are supported; you may encounter an error if they are not; provide that information to the user if required.\"\"\""
]
},
{
"cell_type": "code",
"source": [
"def _perform_ocr(url: str) -> str:\n",
" try: # Apply OCR to the PDF URL\n",
" response = client.ocr.process(\n",
" model=ocr_model,\n",
" document={\n",
" \"type\": \"document_url\",\n",
" \"document_url\": url\n",
" }\n",
" )\n",
" except Exception:\n",
" try: # IF PDF OCR fails, try Image OCR\n",
" response = client.ocr.process(\n",
" model=ocr_model,\n",
" document={\n",
" \"type\": \"image_url\",\n",
" \"image_url\": url\n",
" }\n",
" )\n",
" except Exception as e:\n",
" return e # Return the error to the model if it fails, otherwise return the contents\n",
" return \"\\n\\n\".join([f\"### Page {i+1}\\n{response.pages[i].markdown}\" for i in range(len(response.pages))])"
],
"metadata": {
"id": "SxP7DlEHWXnK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"def open_urls(urls: list) -> str:\n",
" contents = \"# Documents\"\n",
" for url in urls:\n",
" contents += f\"\\n\\n## URL: {url}\\n{_perform_ocr(url)}\"\n",
" return contents"
],
"metadata": {
"id": "s9PgX9fqWY1m"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BagY4xg0nSSg"
},
"source": [
"We also have to define the Tool Schema that will be provided to our API and model.\n",
"\n",
"By following the [documentation](https://docs.mistral.ai/capabilities/function_calling/), we can create something like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hpBKzNOfliQr"
},
"outputs": [],
"source": [
"tools = [\n",
" {\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"open_urls\",\n",
" \"description\": \"Open URLs websites (PDFs and Images) and perform OCR on them.\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"urls\": {\n",
" \"type\": \"array\",\n",
" \"description\": \"The URLs list.\",\n",
" }\n",
" },\n",
" \"required\": [\"urls\"],\n",
" },\n",
" },\n",
" },\n",
"]"
]
},
{
"cell_type": "code",
"source": [
"names_to_functions = {\n",
" 'open_urls': open_urls\n",
"}"
],
"metadata": {
"id": "DqalxqIWWVL1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "W6bE08lPngrm"
},
"source": [
"### Test\n",
"Everything is ready; we can quickly create a while loop to chat with our model directly in the console.\n",
"\n",
"The model will use `open_urls` each time URLs are mentioned. If they are PDFs or photos, it will perform OCR and provide the raw text contents to the model, which will then use them to answer the user.\n",
"\n",
"#### Example Prompts ( PDF & Image )\n",
"- Could you summarize what this research paper talks about? https://arxiv.org/pdf/2410.07073\n",
"- What is written here: https://jeroen.github.io/images/testocr.png"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 371
},
"id": "pVeVmWn_ljRo",
"outputId": "7fe77386-cb5d-43f4-8bae-ea7b41ddb01a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assistant > The research paper titled \"Pixtral 12B\" introduces a 12-billion-parameter multimodal language model designed to understand both natural images and documents. The model is trained on a large-scale dataset of interleaved image and text documents, enabling it to perform multi-turn, multi-image conversations. Pixtral 12B is built on a transformer architecture and includes a new vision encoder, PixtralViT, which allows it to process images at their native resolution and aspect ratio. This flexibility is achieved through a novel RoPE-2D implementation, which supports variable image sizes and aspect ratios without the need for interpolation.\n",
"\n",
"The model's performance is evaluated on various multimodal benchmarks, where it outperforms other open-source models of similar sizes, such as Qwen-2-VL 7B and Llama-3.2 11B. Pixtral 12B also matches or exceeds the performance of much larger models like Llama-3.2 90B and closed-source models like Claude-3 Haiku and Gemini-1.5 Flash 8B. The paper introduces a new benchmark, MM-MT-Bench, designed to evaluate multimodal models in practical scenarios, and provides detailed analysis and code for standardized evaluation protocols.\n",
"\n",
"The architecture of Pixtral 12B consists of a multimodal decoder and a vision encoder. The vision encoder, PixtralViT, is trained from scratch and includes several key features such as break tokens, gating in the feedforward layer, sequence packing, and RoPE-2D for relative position encoding. The model is evaluated under various prompts and metrics, demonstrating its robustness and flexibility in handling different types of multimodal tasks.\n",
"\n",
"The paper also discusses the importance of standardized evaluation protocols and the impact of prompt design on model performance. It highlights that Pixtral 12B performs well under both 'Explicit' and 'Naive' prompts, with only minor regressions on specific benchmarks. The model's performance is further analyzed under flexible parsing constraints, showing that it benefits very little from relaxed metrics and continues to lead even when flexible parsing is accounted for.\n",
"\n",
"In summary, Pixtral 12B is a state-of-the-art multimodal model that excels in both text-only and multimodal tasks. Its novel architecture, flexibility in processing images, and strong performance across various benchmarks make it a versatile tool for complex multimodal applications. The model is released under the Apache 2.0 license, making it accessible for further research and development.\n",
"Assistant > The text written on the image is:\n",
"\n",
"\"This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.\"\n",
"Assistant > You're welcome! If you have any more questions or need further assistance, feel free to ask.\n"
]
}
],
"source": [
"import json\n",
"\n",
"messages = [{\"role\": \"system\", \"content\": system}]\n",
"while True:\n",
" # Insert user input, quit if desired\n",
" user_input = input(\"User > \")\n",
" if user_input == \"quit\":\n",
" break\n",
" messages.append({\"role\": \"user\", \"content\": user_input})\n",
"\n",
" # Loop Mistral Small tool use until no tool called\n",
" while True:\n",
" response = client.chat.complete(\n",
" model = text_model,\n",
" messages = messages,\n",
" temperature = 0,\n",
" tools = tools\n",
" )\n",
" messages.append({\"role\":\"assistant\", \"content\": response.choices[0].message.content, \"tool_calls\": response.choices[0].message.tool_calls})\n",
"\n",
" # If tool called, run tool and continue, else break loop and reply\n",
" if response.choices[0].message.tool_calls:\n",
" tool_call = response.choices[0].message.tool_calls[0]\n",
" function_name = tool_call.function.name\n",
" function_params = json.loads(tool_call.function.arguments)\n",
" function_result = names_to_functions[function_name](**function_params)\n",
" messages.append({\"role\":\"tool\", \"name\":function_name, \"content\":function_result, \"tool_call_id\":tool_call.id})\n",
" else:\n",
" break\n",
"\n",
" print(\"Assistant >\", response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"source": [
"## Built-In\n",
"Mistral provides a built-in feature that leverages OCR with all models. By providing a URL pointing to a document, you can extract text data that will be provided to the model.\n",
"\n",
"Following, there is a simple, quick, example of how to make use of this feature by extracting PDF URLs with regex and uploading them as a `document_url`."
],
"metadata": {
"id": "nKJoY5asORZq"
}
},
{
"cell_type": "markdown",
"source": [
"### System and Regex\n",
"Let's define a simple system prompt, since there is no tool call required for this demo we can be fairly straightforward."
],
"metadata": {
"id": "T7CvWtw9jfR7"
}
},
{
"cell_type": "code",
"source": [
"system = \"You are an AI Assistant with document understanding via URLs. You may be provided with URLs, followed by their corresponding OCR.\""
],
"metadata": {
"id": "Mkmw1FyGQpl3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"To extract the URLs, we will use regex to extract any URL pattern from the user query.\n",
"\n",
"*Note: We will assume there will only be PDF files for simplicity.*"
],
"metadata": {
"id": "35yYt9asjoIa"
}
},
{
"cell_type": "code",
"source": [
"import re\n",
"\n",
"def extract_urls(text: str) -> list:\n",
" url_pattern = r'\\b((?:https?|ftp)://(?:www\\.)?[^\\s/$.?#].[^\\s]*)\\b'\n",
" urls = re.findall(url_pattern, text)\n",
" return urls"
],
"metadata": {
"id": "vLMw8Z8fOT19"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Test\n",
"We can now try it out, we setup so that for each query all urls are extracted and added to the query properly.\n",
"\n",
"#### Example Prompts ( PDFs )\n",
"- Could you summarize what this research paper talks about? https://arxiv.org/pdf/2410.07073\n",
"- Explain this architecture: https://arxiv.org/abs/2401.04088"
],
"metadata": {
"id": "gsRD_4mJjz7-"
}
},
{
"cell_type": "code",
"source": [
"import json\n",
"\n",
"messages = [{\"role\": \"system\", \"content\": system}]\n",
"while True:\n",
" user_input = input(\"User > \")\n",
" if user_input.lower() == \"quit\":\n",
" break\n",
"\n",
" # Extract URLs from the user input, assuming they are always PDFs\n",
" document_urls = extract_urls(user_input)\n",
" user_message_content = [{\"type\": \"text\", \"text\": user_input}]\n",
" for url in document_urls:\n",
" user_message_content.append({\"type\": \"document_url\", \"document_url\": url})\n",
" messages.append({\"role\": \"user\", \"content\": user_message_content})\n",
"\n",
" # Send the messages to the model and get a response\n",
" response = client.chat.complete(\n",
" model=text_model,\n",
" messages=messages,\n",
" temperature=0\n",
" )\n",
" messages.append({\"role\": \"assistant\", \"content\": response.choices[0].message.content})\n",
"\n",
" print(\"Assistant >\", response.choices[0].message.content)\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Eell9TZ7Oapq",
"outputId": "4ea0043c-4411-43f7-a036-77d3a11cee8f"
},
"execution_count": null,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"User > hi\n",
"Assistant > Hello! How can I assist you today? If you have any documents or URLs you'd like me to help with, feel free to share them.\n",
"User > quit\n"
]
}
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

Mistral Generic API Calls:

curl --location "https://api.mistral.ai/v1/embeddings" \
     --header 'Content-Type: application/json' \
     --header 'Accept: application/json' \
     --header "Authorization: Bearer $MISTRAL_API_KEY" \
     --data '{
    "model": "mistral-embed",
    "input": ["Embed this sentence.", "As well as this one."]
  }'

Mistral OCR:

OCR and Document Understanding Document OCR processor The Document OCR (Optical Character Recognition) processor, powered by our latest OCR model mistral-ocr-latest, enables you to extract text and structured content from PDF documents.

Key features:

  • Extracts text content while maintaining document structure and hierarchy
  • Preserves formatting like headers, paragraphs, lists and tables
  • Returns results in markdown format for easy parsing and rendering
  • Handles complex layouts including multi-column text and mixed content
  • Processes documents at scale with high accuracy
  • Supports multiple document formats including PDF, images, and uploaded documents
  • The OCR processor returns both the extracted text content and metadata about the document structure, making it easy to work with the recognized content programmatically.
curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    "include_image_base64": true
  }' -o ocr_output.json

Or via base64:

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "data:application/pdf;base64,<base64_pdf>"
    },
    "include_image_base64": true
  }' -o ocr_output.json

Output Example:

{
    "pages": [
        {
            "index": 1,
            "markdown": "# d data from the target distribution, that is comparatively abundant, to predict model performance. Note that in this work, our focus is not to improve performance on the target but, rather, to estimate the accuracy on the target for a given classifier.\n\n[^0]\n[^0]:    * Work done in part while Saurabh Garg was interning at Google\n    ${ }^{1}$ Code is available at https://github.com/saurabhgarg1996/ATC_code.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 2,
            "markdown": "![img-0.jpeg](img-0.jpeg)\n\nFigure 1: Illustration of our proposed method ATC. Left: using source domain validation data, we identify a threshold on a score (e.g. negative entropy) computed on model confidence such that fraction of examples abovey, our work takes a step forward in positively answering the question raised in Deng \\& Zheng (2021); Deng et al. (2021) about a practical strategy to select a threshold that enables accuracy prediction with thresholded model confidence.",
            "images": [
                {
                    "id": "img-0.jpeg",
                    "top_left_x": 292,
                    "top_left_y": 217,
                    "bottom_right_x": 1405,
                    "bottom_right_y": 649,
                    "image_base64": "..."
                }
            ],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 3,
            "markdown": "ATC is simple to implement with existing frameworks, compatible with arbitrary model classes, and dominates other contemporary methods. Across several model architecturless, in our work, we only assume access to labeled data from the source domain presuming no access to labeled target domains or information about how to simulate them.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 4,
            "markdown": "Moreover, unlike the parallel work of Deng et al. (2021), we do not focus on methods that alter the training on source data to aid accuracy prediction on the target data. Chen et al. (2021b) propose an importance re-weighting based approach that leverages (additional) information about the axis along which distribution is shifting in formwhere we use FCN. Across all datasets, we observe that ATC achieves superior performance (lower MAE is better). For GDE post T and pre T estimates match since TS doesn't alter the argmax prediction. Results reported by aggregating MAE numbers over 4 different seeds. Values in parenthesis (i.e., $(\\cdot)$ ) denote standard deviation values.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 5,
            "markdown": "| Dataset | Shift | IM |  | AC |  | DOC |  | GDE | ATC-MC (Ours) |  | ATC-NE (Ours) |  |\n| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n|  |  | Pre T | Post T | Pre T | Post T | Pre T | Post T | Post T | Pre T | Post T | Pre T | Post T |\n| CIFAR10 | Natural | 7.14 | 6.20 | 10.25 | 7.06 | 7.68 | 6.35 | 5.74 | 4.02 | 3.85 | 3.76 | 3.38 |\n|  |  | (0.14) | (0.11) | (0.31) | (0.33) | (0.28) | (0.27) | (0.25) | (0.38) | (0.30) | (0.33) | (0.32) |\n|  | Synthetic | 12.62 | 10.75 | 16.50 | 11.91 | 13.93 | 11.20 | 7.97 | 5.66 | 5.03 | 4.87 | 3.63 |\n|  |  | (0.76) | (0.71) | (0.28) | (0.24) | (0.29) | (0.28) | (0.13) | (0.64) | (0.71) | (0.71) | (0.62) |\n| CIFAR100 | Synthetic | 12.77 | 12.34 | 16.89 | 12.73 | 11.18 | 9.63 | 12.00 | 5.61 | 5.55 | 5.65 | 5.76 |\n|  |  | (0.43) | (0.68) | (0.20) | (2.59) | (0.35) | (1.25) | (0.48) | (0.51) | (0.55) | (0.35) | (0.27) |\n| ImageNet200 | Natural | 12.63 | 7.99 | 23.08 | 7.22 | 15.40 | 6.33 | 5.00 | 4.60 | 1.80 | 4.06 | 1.38 |\n|  |  | (0.59) | (0.47) | (0.31) | (0.22) | (0.42) | (0.24) | (0.36) | (0.63) | (0.17) | (0.69) | (0.29) |\n|  | Synthetic | 20.17 | 11.74 | 33.69 | 9.51 | 25.49 | 8.61 | 4.19 | 5.37 | 2.78 | 4.53 | 3.58 |\n|  |  | (0.74) | (0.80) | (0.73) | (0.51) | (0.66) | (0.50) | (0.14) | (0.88) | (0.23) | (0.79) | (0.33) |\n| ImageNet | Natural | 8.09 | 6.42 | 21.66 | 5.91 | 8.53 | 5.21 | 5.90 | 3.93 | 1.89 | 2.45 | 0.73 |\n|  |  | (0.25) | (0.28) | (0.38) | (0.22) | (0.26) | (0.25) | (0.44) | (0.26) | (0.21) | (0.16) | (0.10) |\n|  | Synthetic | 13.93 | 9.90 | 28.05 | 7.56 | 13.82 | 6.19 | 6.70 | 3.33 | 2.55 | 2.12 | 5.06 |\n|  |  | (0.14) | (0.23) | (0.39) | (0.13) | (0.31) | (0.07) | (0.52) | (0.25) | (0.25) | (0.31) | (0.27) |\n| FMoW-WILDS | Natural | 5.15 | 3.55 | 34.64 | 5.03 | 5.58 | 3.46 | 5.08 | 2.59 | 2.33 | 2.52 | 2.22 |\n|  |  | (0.19) | (0.41) | (0.22) | (0.29) | (0.17) | (0.37) | (0.46) | (0.32) | (0.28) | (0.25) | (0.30) |\n| RxRx1-WILDS | Natural | 6.17 | 6.11 | 21.05 | 5.21 | 6.54 | 6.27 | 6.82 | 5.30 | 5.20 | 5.19 | 5.63 |\n|  |  | (0.20) | (0.24) | (0.31) | (0.18) | (0.21) | (0.20) | (0.31) | (0.30) | (0.44) | (0.43) | (0.55) |\n| Entity-13 | Same | 18.32 | 14.38 | 27.79 | 13.56 | 20.50 | 13.22 | 16.09 | 9.35 | 7.50 | 7.80 | 6.94 |\n|  |  | (0.29) | (0.53) | (1.18) | (0.58) | (0.47) | (0.58) | (0.84) | (0.79) | (0.65) | (0.62) | (0.71) |\n|  | Novel | 28.82 | 24.03 | 38.97 | 22.96 | 31.66 | 22.61 | 25.26 | 17.11 | 13.96 | 14.75 | 9.94 |\n|  |  | (0.30) | (0.55) | (1.32) | (0.59) | (0.54) | (0.58) | (1.08) | (0.93) | (0.64) | (0.78) |  |\n| Entity-30 | Same | 16.91 | 14.61 | 26.84 | 14.37 | 18.60 | 13.11 | 13.74 | 8.54 | 7.94 | 7.77 | 8.04 |\n|  |  | (1.33) | (1.11) | (2.15) | (1.34) | (1.69) | (1.30) | (1.07) | (1.47) | (1.38) | (1.44) | (1.51) |\n|  | Novel | 28.66 | 25.83 | 39.21 | 25.03 | 30.95 | 23.73 | 23.15 | 15.57 | 13.24 | 12.44 | 11.05 |\n|  |  | (1.16) | (0.88) | (2.03) | (1.11) | (1.64) | (1.11) | (0.51) | (1.44) | (1.15) | (1.26) | (1.13) |\n| NonLIVING-26 | Same | 17.43 | 15.95 | 27.70 | 15.40 | 18.06 | 14.58 | 16.99 | 10.79 | 10.13 | 10.05 | 10.29 |\n|  |  | (0.90) | (0.86) | (0.90) | (0.69) | (1.00) | (0.78) | (1.25) | (0.62) | (0.32) | (0.46) | (0.79) |\n|  | Novel | 29.51 | 27.75 | 40.02 | 26.77 | 30.36 | 25.93 | 27.70 | 19.64 | 17.75 | 16.90 | 15.69 |\n|  |  | (0.86) | (0.82) | (0.76) | (0.82) | (0.95) | (0.80) | (1.42) | (0.68) | (0.53) | (0.60) | (0.83) |\n| LIVING-17 | Same | 14.28 | 12.21 | 23.46 | 11.16 | 15.22 | 10.78 | 10.49 | 4.92 | 4.23 | 4.19 | 4.73 |\n|  |  | (0.96) | (0.93) | (1.16) | (0.90) | (0.96) | (0.99) | (0.97) | (0.57) | (0.42) | (0.35) | (0.24) |\n|  | Novel | 28.91 | 26.35 | 38.62 | 24.91 | 30.32 | 24.52 | 22.49 | 15.42 | 13.02 | 12.29 | 10.34 |\n|  |  | (0.66) | (0.73) | (1.01) | (0.61) | (0.59) | (0.74) | (0.85) | (0.59) | (0.53) | (0.73) | (0.62) |\n\nTable 4: Mean Absolute estimation Error (MAE) results for different datasets in our setup grouped by the nature of shift for ResNet model. 'Same' refers to same subpopulation shifts and 'Novel' refers novel subpopulation shifts. We include details about the target sets considered in each shift in Table 2. Post T denotes use of TS calibration on source. Across all datasets, we observe that ATC achieves superior performance (lower MAE is better). For GDE post T and pre T estimates match since TS doesn't alter the argmax prediction. Results reported by aggregating MAE numbers over 4 different seeds. Values in parenthesis (i.e., $(\\cdot)$ ) denote standard deviation values.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        }
    ],
    "model": "mistral-ocr-2503-completion",
    "usage_info": {
        "pages_processed": 29,
        "doc_size_bytes": null
    }
}

OCR with uploaded PDF

You can also upload a PDF file and get the OCR results from the uploaded PDF.

Upload a file:

curl https://api.mistral.ai/v1/files \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -F purpose="ocr" \
  -F file="@uploaded_file.pdf"

Retrieve File:

curl -X GET "https://api.mistral.ai/v1/files/$id" \
     -H "Accept: application/json" \
     -H "Authorization: Bearer $MISTRAL_API_KEY"

id='00edaf84-95b0-45db-8f83-f71138491f23' object='file' size_bytes=3749788 created_at=1741023462 filename='uploaded_file.pdf' purpose='ocr' sample_type='ocr_input' source='upload' deleted=False num_lines=None

Get signed URL:

curl -X GET "https://api.mistral.ai/v1/files/$id/url?expiry=24" \
     -H "Accept: application/json" \
     -H "Authorization: Bearer $MISTRAL_API_KEY"

Get OCR results

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "<signed_url>"
    },
    "include_image_base64": true
  }' -o ocr_output.json

Document understanding

The Document understanding capability combines OCR with large language model capabilities to enable natural language interaction with document content. This allows you to extract information and insights from documents by asking questions in natural language.

The workflow consists of two main steps:

Document Processing:

OCR extracts text, structure, and formatting, creating a machine-readable version of the document.

Language Model Understanding:

The extracted document content is analyzed by a large language model. You can ask questions or request information in natural language. The model understands context and relationships within the document and can provide relevant answers based on the document content.

Key capabilities:

  • Question answering about specific document content
  • Information extraction and summarization
  • Document analysis and insights
  • Multi-document queries and comparisons
  • Context-aware responses that consider the full document

Common use cases:

  • Analyzing research papers and technical documents
  • Extracting information from business documents
  • Processing legal documents and contracts
  • Building document Q&A applications
  • Automating document-based workflows

The examples below show how to interact with a PDF document using natural language:


curl https://api.mistral.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-small-latest",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "what is the last sentence in the document"
          },
          {
            "type": "document_url",
            "document_url": "https://arxiv.org/pdf/1805.04770"
          }
        ]
      }
    ],
    "document_image_limit": 8,
    "document_page_limit": 64
  }'

FAQ

Are there any limits regarding the OCR API? Yes, there are certain limitations for the OCR API. Uploaded document files must not exceed 50 MB in size and should be no longer than 1,000 pages.

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment