Skip to content

Instantly share code, notes, and snippets.

@davidmezzetti
Created November 8, 2025 11:23
Show Gist options
  • Select an option

  • Save davidmezzetti/ac55ee9e229b94443a8789cc15cceb3e to your computer and use it in GitHub Desktop.

Select an option

Save davidmezzetti/ac55ee9e229b94443a8789cc15cceb3e to your computer and use it in GitHub Desktop.
from txtai.pipeline import Textractor
# Docling backend, split text by sections
textractor = Textractor(sections=True, backend="docling")
# BERT Paper
textractor("https://arxiv.org/pdf/1810.04805")
# PDF converted to Markdown, split on Markdown sections
# ['## BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding...
# '## Abstract\nWe introduce a new language representation model called BERT...
# "## 1 Introduction\nLanguage model pre-training has been shown to be effective...
# '## 2 Related Work\nThere is a long history of pre-training general language representations...
# ...
# ]
# Website
textractor("https://github.com/neuml/txtai")
# HTML to Markdown split by sections
# ['**GitHub - neuml/txtai: 💡 All-in-one open-source AI framework for semantic search...
# '**All-in-one AI framework** \ntxtai is an all-in-one AI framework for semantic search...
# '## Why txtai?\nNew vector databases, LLM frameworks and everything in between are sprouting...
# '## Use Cases\nThe following sections introduce common txtai use cases. A comprehensive set of...'
# ...
# ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment