Skip to content

Instantly share code, notes, and snippets.

@davidmezzetti
Last active December 2, 2025 16:17
Show Gist options
  • Select an option

  • Save davidmezzetti/153b016f5f97b7072d589ab3a138a077 to your computer and use it in GitHub Desktop.

Select an option

Save davidmezzetti/153b016f5f97b7072d589ab3a138a077 to your computer and use it in GitHub Desktop.
from txtai import Agent
# Define tools
tools = [
"websearch", # Runs a websearch using default engine
"webview", # Loads a web page
]
# Define LLM
model = "Qwen/Qwen3-4B-Instruct-2507"
# Create Agent
agent = Agent(model=model, tools=tools, max_steps=100)
# Create the paper
print(agent("""
Search for data on txtai and then write an Arxiv style research paper about it.
Write the paper as latex.""", maxlength=50000))
# Output saved as txtai.tex
# Converted to PDF using `pdflatex txtai.tex`
Display the source blob
Display the rendered blob
Raw
\documentclass{article}
% Required packages
\usepackage{geometry}
\usepackage{setspace}
\usepackage{titlesec}
\usepackage{hyperref}
\usepackage{enumitem}
\usepackage{fancyhdr}
\usepackage{parskip}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{float}
\usepackage{caption}
\usepackage{multicol}
\usepackage{url}
\usepackage{listings}
\usepackage{color}
% Set page margins
\geometry{a4paper, margin=1in}
\setlength{\parindent}{0pt}
\setlength{\parskip}{1em}
\onehalfspacing
% Title formatting
\titleformat{\section}{\large\bfseries\uppercase}{}{0em}{}
\titleformat{\subsection}{\normalsize\bfseries}{}{0em}{}
\titleformat{\subsubsection}{\normalsize\itshape}{}{0em}{}
% Header and footer
\fancyhf{}
\fancyhead[LE,RO]{\thepage}
\fancyhead[LF,RF]{\footnotesize\textbf{txtai: An All-in-One AI Framework for Semantic Search and LLM Workflows}}
\fancyfoot[C]{\footnotesize\thepage}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
% Hyperlinks
\hypersetup{
colorlinks=true,
linkcolor=black,
filecolor=magenta, % color of internal links
urlcolor=cyan, % color of external links
pdftitle={txtai: An All-in-One AI Framework for Semantic Search and LLM Workflows},
pdfauthor={NeuML LLC},
pdfsubject={AI Framework for Semantic Search and Language Model Workflows},
pdfkeywords={txtai, semantic search, LLM, embeddings, AI framework}
}
% Code formatting
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
commentstyle=\color{codegreen},
keywordstyle=\color{magenta},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\ttfamily\small,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
numbers=left,
numbersep=5pt,
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2,
frame=single,
framerule=0.5pt,
rulecolor=\color{codegray}
}
\lstset{style=mystyle}
\title{txtai: An All-in-One AI Framework for Semantic Search and LLM Workflows}
\author{NeuML LLC}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
txtai is an open-source, all-in-one AI framework designed to simplify semantic search, large language model (LLM) orchestration, and language model workflows. Built with Python 3.10+, txtai integrates Hugging Face Transformers, Sentence Transformers, and FastAPI to provide a unified platform for vector search, multimodal indexing, and AI-powered pipelines. This paper presents a comprehensive overview of txtai's architecture, core components, use cases, and practical implementations. We explore its capabilities in semantic search, retrieval-augmented generation (RAG), agent-based problem solving, and multimodal processing, highlighting its versatility and efficiency in real-world AI applications.
\end{abstract}
\section{Introduction}
The rapid advancement of artificial intelligence has led to an explosion in the volume and complexity of unstructured data. Traditional keyword-based search systems are increasingly inadequate for retrieving semantically relevant information. To address this, semantic search technologies have emerged as a critical tool for understanding and navigating complex data landscapes.
txtai is designed to serve as a comprehensive solution for semantic search and AI workflows. It provides a unified framework that combines vector indexing, graph analysis, relational databases, and language model pipelines into a single, extensible platform. By leveraging state-of-the-art embedding models and scalable indexing techniques, txtai enables developers and researchers to build intelligent applications that can understand context, reason over data, and generate human-like responses.
This paper examines the architecture and functionality of txtai, focusing on its core components: embeddings databases, pipelines, agents, and workflows. We provide detailed examples of how these components can be combined to solve real-world problems, from document retrieval to autonomous agent-based reasoning.
\section{Architecture}
The core of txtai is its embeddings database, which serves as a unified repository for semantic data. This database is a composite structure that integrates vector indexes (both sparse and dense), graph networks, and relational databases. This architectural design allows for hybrid search capabilities, combining the precision of semantic similarity with the accuracy of keyword matching.
\subsection{Embeddings Database}
The embeddings database is the engine that powers semantic search. Data is transformed into high-dimensional vectors through embedding models, where semantically similar concepts produce similar vector representations. These vectors are indexed using approximate nearest neighbor (ANN) search algorithms, enabling fast retrieval of relevant content.
Key features of the embeddings database include:
\begin{itemize}[leftmargin=*,noitemsep]
\item Support for multiple embedding models (e.g., all-MiniLM-L6-v2, BERT-based models)
\item Configurable storage options (content, metadata, binary objects)
\item Hybrid search capabilities (semantic + keyword)
\item Support for SQL queries and filtering
\end{itemize}
\subsection{Pipeline System}
txtai provides a flexible pipeline system that allows users to chain together various processing steps. Each pipeline is a callable object that can be configured with specific parameters and models. The pipeline system supports a wide range of operations, including text processing, image captioning, audio transcription, and translation.
\begin{itemize}[leftmargin=*,noitemsep]
\item Text processing (summarization, translation, entity extraction)
\item Audio processing (transcription, speech-to-text, text-to-speech)
\item Image processing (captioning, object detection, hashing)
\item Document processing (text extraction, segmentation, tabular data handling)
\end{itemize}
\subsection{Agent Framework}
Agents are intelligent components that can autonomously solve complex problems by iteratively calling tools and making decisions. txtai's agent framework enables the creation of autonomous systems that can reason over multiple data sources, perform multi-step reasoning, and generate comprehensive responses.
\begin{itemize}[leftmargin=*,noitemsep]
\item Multi-tool orchestration
\item Iterative problem solving
\item Context-aware decision making
\item Agent teams for complex tasks
\end{itemize}
\subsection{Workflow Engine}
Workflows provide a structured way to combine multiple pipelines and agents into a single, composable process. Workflows can be defined in Python or configuration files (YAML), allowing for both programmatic and declarative approaches to workflow design.
\begin{itemize}[leftmargin=*,noitemsep]
\item Batch processing of large datasets
\item Scheduling of recurring tasks
\item Conditional execution paths
\item Stream processing for real-time applications
\end{itemize}
\section{Core Components}
\subsection{Embeddings}
Embeddings are the foundation of semantic search in txtai. The framework supports a variety of pre-trained embedding models from Hugging Face, including sentence-transformers and BERT-based models. These models convert text into dense vectors that capture semantic meaning.
\begin{lstlisting}[language=Python, caption={Example of building and searching an embeddings index}]
from txtai import Embeddings
embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")
data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, " +
"forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends " +
"in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]
embeddings.index(data)
print(f"{'Query':20} Best Match")
print("-" * 50)
for query in ("feel good story", "climate change", "public health story", "war",
"wildlife", "asia", "lucky", "dishonest junk"):
uid = embeddings.search(query, 1)[0][0]
print(f"{query:20} {data[uid]}")
\end{lstlisting}
\subsection{Pipelines}
Pipelines in txtai are modular, reusable processing units that can be combined to create complex workflows. Each pipeline is designed to perform a specific task, such as summarization, translation, or entity recognition.
\begin{lstlisting}[language=Python, caption={Example of using a text summarization pipeline}]
from txtai.pipeline import Summary
summary = Summary()
text = "The National Park Service has issued a warning about bear attacks in Maine. The agency advises hikers to be cautious and to avoid carrying food in bear-prone areas."
summary_result = summary(text)
print(summary_result)
\end{lstlisting}
\subsection{Agents}
Agents are intelligent components that can autonomously solve problems by making decisions and calling tools. The agent framework in txtai enables the creation of autonomous systems that can reason over data and generate responses.
\begin{lstlisting}[language=Python, caption={Example of a basic agent}]
from datetime import datetime
from txtai import Agent
wikipedia = {
"name": "wikipedia",
"description": "Searches a Wikipedia database",
"provider": "huggingface-hub",
"container": "neuml/txtai-wikipedia"
}
arxiv = {
"name": "arxiv",
"description": "Searches a database of scientific papers",
"provider": "huggingface-hub",
"container": "neuml/txtai-arxiv"
}
def today() -> str:
return datetime.today().isoformat()
agent = Agent(
model="Qwen/Qwen3-4B-Instruct-2507",
tools=[today, wikipedia, arxiv, "websearch"],
max_steps=10,
)
\end{lstlisting}
\subsection{Workflows}
Workflows provide a structured way to combine multiple processing steps into a single, composable process. They are particularly useful for handling large datasets and complex data transformations.
\begin{lstlisting}[language=Python, caption={Example of a workflow for audio processing}]
from txtai import Embeddings
from txtai.pipeline import Transcription, Translation
from txtai.workflow import FileTask, Task, Workflow
embeddings = Embeddings({
"path": "sentence-transformers/paraphrase-MiniLM-L3-v2",
"content": True
})
transcribe = Transcription()
translate = Translation()
tasks = [
FileTask(transcribe, r"\.wav$"),
Task(lambda x: translate(x, "fr"))
]
data = [
"US_tops_5_million.wav",
"Canadas_last_fully.wav",
"Beijing_mobilises.wav",
"The_National_Park.wav",
"Maine_man_wins_1_mil.wav",
"Make_huge_profits.wav"
]
workflow = Workflow(tasks)
embeddings.index((uid, text, None) for uid, text in enumerate(workflow(data)))
embeddings.search("wildlife", 1)
\end{lstlisting}
\section{Use Cases}
\subsection{Semantic Search}
txtai excels at semantic search applications where users need to find documents or data that are semantically similar to a query, rather than matching exact keywords.
\begin{itemize}[leftmargin=*,noitemsep]
\item Document retrieval
\item Product search
\item Content recommendation
\item Knowledge base querying
\end{itemize}
\subsection{Retrieval-Augmented Generation (RAG)}
RAG combines the strengths of retrieval and generation to create more accurate and context-aware responses. In txtai, RAG pipelines can be built by combining vector search with language model generation.
\begin{itemize}[leftmargin=*,noitemsep]
\item Answering complex questions with context from documents
\item Summarizing long-form content
\item Generating responses based on multiple sources
\end{itemize}
\subsection{Agent-Based Problem Solving}
The agent framework enables the creation of autonomous systems that can solve complex problems by iteratively calling tools and making decisions.
\begin{itemize}[leftmargin=*,noitemsep]
\item Multi-step reasoning
\item Cross-domain information gathering
\item Autonomous research and reporting
\end{itemize}
\subsection{Multimodal Processing}
txtai supports multimodal processing, allowing users to work with text, images, audio, and video data in a unified framework.
\begin{itemize}[leftmargin=*,noitemsep]
\item Image captioning and object detection
\item Audio transcription and translation
\item Video analysis and summarization
\end{itemize}
\section{Performance and Scalability}
txtai is designed with scalability and performance in mind. The framework supports distributed indexing and can scale across multiple nodes to handle large datasets. The use of efficient vector search algorithms and optimized data structures ensures fast retrieval times even with large index sizes.
\begin{itemize}[leftmargin=*,noitemsep]
\item Support for distributed embeddings clusters
\item Efficient batch processing
\item Scalable to petabyte-scale datasets
\item Optimized for real-time applications
\end{itemize}
\section{Conclusion}
txtai represents a significant advancement in the field of AI frameworks for semantic search and language model workflows. By providing a unified platform that integrates embeddings, pipelines, agents, and workflows, txtai enables developers and researchers to build intelligent applications with greater ease and efficiency.
The framework's flexibility, extensibility, and performance make it well-suited for a wide range of applications, from simple document retrieval to complex autonomous problem solving. As AI continues to evolve, frameworks like txtai will play a crucial role in enabling the next generation of intelligent systems that can understand, reason, and act in complex environments.
\section{Future Directions}
Future developments for txtai are expected to include:
\begin{itemize}[leftmargin=*,noitemsep]
\item Enhanced support for vision-language models
\item Improved agent autonomy and reasoning capabilities
\item Better integration with external APIs and services
\item Support for more advanced training and fine-tuning techniques
\end{itemize}
\bibliographystyle{plain}
\bibliography{references}
\end{document}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment