Skip to content

Instantly share code, notes, and snippets.

@joelkuiper
Last active January 19, 2026 12:27
Show Gist options
  • Select an option

  • Save joelkuiper/46420d3c34e3a44bd56b079b896f2fec to your computer and use it in GitHub Desktop.

Select an option

Save joelkuiper/46420d3c34e3a44bd56b079b896f2fec to your computer and use it in GitHub Desktop.
pdf_to_bibtex.py - Academic PDF to BibTeX Converter
#!/usr/bin/env python3
"""
pdf_to_bibtex.py - Academic PDF to BibTeX Converter
====================================================
Automatically extract bibliographic metadata from academic PDFs using Grobid
and CrossRef, generating a clean BibTeX file with consistent AuthorYYYY citation keys.
INPUT FORMAT
------------
Expects a directory of PDFs named in AuthorYYYY format (or close variations):
Literature/
├── Adams2013.pdf → Adams2013
├── Aston-Jones2005.pdf → AstonJones2005
├── Feyaerts&Henriksen2021.pdf → FeyaertsHenriksen2021
└── Cross-Disorder Group... 2019.pdf → CrossDisorder2019
The year must be a 4-digit number at the end (before .pdf). The script uses
the filename to generate citation keys and to skip already-processed files.
FEATURES
--------
• Extracts metadata from PDFs using Grobid (runs via Docker)
• Enriches via DOI content negotiation (direct BibTeX from doi.org)
• Falls back to CrossRef API title search when no DOI available
• Generates BibTeX with AuthorYYYY keys (e.g., Chesney2014, AstonJones2005)
• Smart key disambiguation: AuthorYYYY → AuthorSecondAuthorYYYY → AuthorYYYYa
• Incremental processing: skips PDFs already in the .bib file
• Optional: rename PDFs to match their citation keys
• Optional: extract full TEI XML for each document (structured full-text)
REQUIREMENTS
------------
• Docker (for Grobid)
• Python 3.7+
• requests library: pip install requests
Grobid is pulled automatically on first run (~2GB Docker image).
USAGE
-----
Basic - process all PDFs and create references.bib:
python pdf_to_bibtex.py /path/to/papers/
Specify output file:
python pdf_to_bibtex.py /path/to/papers/ my_references.bib
With all options:
python pdf_to_bibtex.py /path/to/papers/ refs.bib --rename-files --extract-tei
OPTIONS
-------
--no-crossref Skip DOI lookup and CrossRef title search (use only Grobid data)
--rename-files Rename PDFs to AuthorYYYY.pdf based on extracted metadata
--extract-tei Save full TEI XML as AuthorYYYY.tei.xml (Grobid's structured output)
--stop-grobid Stop the Grobid Docker container after processing
EXAMPLE OUTPUT
--------------
$ python pdf_to_bibtex.py Literature/
Found 69 PDF files
CrossRef enrichment: enabled
Found 42 existing entries in references.bib
Skipping (already in bib): Adams2013.pdf
Skipping (already in bib): Bastos2012.pdf
Processing 27 new files...
✓ Grobid container 'grobid-pdf-extractor' is already running
[1/27] Processing: NewPaper2024.pdf
→ Querying CrossRef by DOI: 10.1038/s41586-024-07051-0...
✓ CrossRef: found via DOI
✓ Key: Smith2024
Title: A new theory of consciousness
Authors: Smith, John, Doe, Jane et al.
Year: 2024 | Nature, 625(7994), pp. 112--118
============================================================
✓ Successfully updated references.bib
New entries added: 27
CrossRef enriched: 25
Total entries in file: 69
GENERATED BIBTEX FORMAT
-----------------------
@article{Smith2024,
author = {Smith, John and Doe, Jane and Johnson, Bob},
title = {A new theory of consciousness},
journal = {Nature},
year = {2024},
volume = {625},
number = {7994},
pages = {112--118},
doi = {10.1038/s41586-024-07051-0}
}
HOW IT WORKS
------------
1. Grobid extracts title, authors, DOI from PDF header/first page
2. If DOI found → fetch BibTeX directly from doi.org (content negotiation)
3. If no DOI → search CrossRef by title, use best match
4. Merge Grobid + enriched data (DOI/CrossRef preferred for year, volume, pages, etc.)
5. Generate citation key from filename (if AuthorYYYY pattern) or from metadata
6. Append new entries to existing .bib file (won't duplicate)
FILE NAMING CONVENTION
----------------------
The script handles various filename formats:
• AuthorYYYY.pdf → AuthorYYYY (e.g., Chesney2014.pdf → Chesney2014)
• Author-AuthorYYYY.pdf → AuthorAuthorYYYY (e.g., Aston-Jones2005.pdf → AstonJones2005)
• Author&AuthorYYYY.pdf → AuthorAuthorYYYY (e.g., Feyaerts&Henriksen2021.pdf → FeyaertsHenriksen2021)
• "Name With Spaces YYYY.pdf" → FirstWord+YYYY (e.g., "Cross-Disorder Group... 2019.pdf" → CrossDisorder2019)
The script will:
• Use the filename to derive the citation key
• Skip files whose key already exists in the .bib
• With --rename-files, rename PDFs to clean AuthorYYYY.pdf format
For papers with the same first author and year, disambiguation is automatic:
• Smith2024.pdf → Smith2024
• Smith2024-other.pdf → SmithJones2024 (uses second author from filename or metadata)
• Smith2024-third.pdf → Smith2024a (alphabetical suffix)
TEI XML OUTPUT (--extract-tei)
------------------------------
Grobid can produce structured TEI XML with:
• Full text segmented into sections
• Parsed references with links
• Figures and tables identified
• Author affiliations and emails
Useful for text mining, citation analysis, or building a local search index.
NOTES
-----
• Grobid container keeps running after the script (for faster subsequent runs)
• Use --stop-grobid to stop it when done
• DOI lookup uses content negotiation (doi.org) which is fast and reliable
• CrossRef is only used as fallback for title search when no DOI is available
• Some PDFs (scans, unusual layouts) may yield incomplete metadata
AUTHOR
------
Generated with Claude. Feel free to modify and redistribute.
LICENSE
-------
MIT License - do whatever you want with it.
"""
import os
import sys
import re
import time
import subprocess
import argparse
from pathlib import Path
from typing import Optional, Dict, Any, Tuple
import requests
# Grobid configuration
GROBID_IMAGE = "lfoppiano/grobid:0.8.1"
GROBID_CONTAINER_NAME = "grobid-pdf-extractor"
GROBID_PORT = 8070
GROBID_URL = f"http://localhost:{GROBID_PORT}"
# CrossRef configuration (used for title search fallback)
CROSSREF_API = "https://api.crossref.org"
# Be polite - identify ourselves (CrossRef asks for this)
CROSSREF_HEADERS = {
"User-Agent": "pdf-to-bibtex/1.0 (https://github.com/user/pdf-to-bibtex; mailto:user@example.com)"
}
def load_existing_bibtex(filepath: Path) -> Tuple[set, str]:
"""Load existing BibTeX file and return set of keys and the content."""
if not filepath.exists():
return set(), ""
content = filepath.read_text(encoding="utf-8")
# Extract all keys from @type{key, patterns
keys = set(re.findall(r'@\w+\s*\{\s*([^,\s]+)\s*,', content))
return keys, content
def start_grobid_container() -> bool:
"""Start the Grobid Docker container if not already running."""
result = subprocess.run(
["docker", "ps", "-q", "-f", f"name={GROBID_CONTAINER_NAME}"],
capture_output=True,
text=True
)
if result.stdout.strip():
print(f"✓ Grobid container '{GROBID_CONTAINER_NAME}' is already running")
return True
result = subprocess.run(
["docker", "ps", "-aq", "-f", f"name={GROBID_CONTAINER_NAME}"],
capture_output=True,
text=True
)
if result.stdout.strip():
print(f"Starting existing Grobid container '{GROBID_CONTAINER_NAME}'...")
subprocess.run(["docker", "start", GROBID_CONTAINER_NAME], check=True)
else:
print(f"Pulling and starting Grobid container ({GROBID_IMAGE})...")
subprocess.run([
"docker", "run", "-d",
"--name", GROBID_CONTAINER_NAME,
"-p", f"{GROBID_PORT}:8070",
GROBID_IMAGE
], check=True)
print("Waiting for Grobid to initialize (this may take a minute)...")
max_attempts = 60
for attempt in range(max_attempts):
try:
response = requests.get(f"{GROBID_URL}/api/isalive", timeout=5)
if response.status_code == 200:
print("✓ Grobid is ready")
return True
except requests.exceptions.RequestException:
pass
time.sleep(2)
if (attempt + 1) % 10 == 0:
print(f" Still waiting... ({attempt + 1}/{max_attempts})")
print("✗ Grobid failed to start in time")
return False
def process_pdf_with_grobid(pdf_path: Path) -> Optional[str]:
"""Send a PDF to Grobid and get BibTeX response."""
url = f"{GROBID_URL}/api/processHeaderDocument"
try:
with open(pdf_path, "rb") as pdf_file:
files = {"input": (pdf_path.name, pdf_file, "application/pdf")}
response = requests.post(url, files=files, timeout=120)
if response.status_code == 200:
return response.text
else:
print(f" Warning: Grobid returned status {response.status_code}")
return None
except Exception as e:
print(f" Error processing with Grobid: {e}")
return None
def process_pdf_fulltext_tei(pdf_path: Path) -> Optional[str]:
"""Send a PDF to Grobid and get full TEI XML response."""
url = f"{GROBID_URL}/api/processFulltextDocument"
try:
with open(pdf_path, "rb") as pdf_file:
files = {"input": (pdf_path.name, pdf_file, "application/pdf")}
response = requests.post(url, files=files, timeout=300) # Longer timeout for full processing
if response.status_code == 200:
return response.text
else:
print(f" Warning: Grobid TEI returned status {response.status_code}")
return None
except Exception as e:
print(f" Error getting TEI from Grobid: {e}")
return None
def parse_bibtex(bibtex_str: str) -> Dict[str, str]:
"""Parse a BibTeX entry into a dictionary."""
fields = {}
# Extract entry type
type_match = re.match(r'@(\w+)\s*\{', bibtex_str)
if type_match:
fields['_type'] = type_match.group(1).lower()
# Extract fields - handle nested braces and quoted values
for match in re.finditer(r'(\w+)\s*=\s*[{"]((?:[^{}"]|(?:\{[^{}]*\}))*)[}"]', bibtex_str):
key = match.group(1).lower()
value = match.group(2).strip()
if value:
fields[key] = value
return fields
def query_doi_bibtex(doi: str) -> Optional[str]:
"""Query DOI directly for BibTeX using content negotiation."""
try:
# Clean DOI
doi = doi.strip()
if doi.startswith("http"):
doi = re.sub(r'https?://(dx\.)?doi\.org/', '', doi)
url = f"https://doi.org/{doi}"
headers = {"Accept": "application/x-bibtex"}
response = requests.get(url, headers=headers, timeout=30, allow_redirects=True)
if response.status_code == 200 and response.text.strip().startswith("@"):
# Ensure proper UTF-8 decoding
response.encoding = 'utf-8'
text = response.text
# Normalize various dash types to standard BibTeX double-dash
text = text.replace('–', '--') # en-dash
text = text.replace('—', '--') # em-dash
text = text.replace('−', '-') # minus sign
return text
return None
except Exception as e:
print(f" DOI lookup failed: {e}")
return None
def query_crossref_by_title(title: str) -> Optional[Dict[str, Any]]:
"""Query CrossRef API by title search."""
try:
# Clean title
title = re.sub(r'[^\w\s]', ' ', title)
title = ' '.join(title.split()[:15]) # First 15 words
url = f"{CROSSREF_API}/works"
params = {
"query.title": title,
"rows": 1,
"select": "DOI,title,author,published-print,published-online,container-title,volume,issue,page,publisher,type"
}
response = requests.get(url, params=params, headers=CROSSREF_HEADERS, timeout=30)
if response.status_code == 200:
items = response.json().get("message", {}).get("items", [])
if items:
return items[0]
return None
except Exception as e:
print(f" CrossRef title search failed: {e}")
return None
def crossref_to_bibtex_fields(cr: Dict[str, Any]) -> Dict[str, str]:
"""Convert CrossRef API response to BibTeX fields."""
fields = {}
# Title
if "title" in cr and cr["title"]:
fields["title"] = cr["title"][0] if isinstance(cr["title"], list) else cr["title"]
# Authors
if "author" in cr and cr["author"]:
authors = []
for a in cr["author"]:
if "family" in a:
name = a["family"]
if "given" in a:
name = f"{a['family']}, {a['given']}"
authors.append(name)
if authors:
fields["author"] = " and ".join(authors)
# Year - try multiple date fields
year = None
for date_field in ["published-print", "published-online", "issued", "created"]:
if date_field in cr and cr[date_field]:
date_parts = cr[date_field].get("date-parts", [[]])
if date_parts and date_parts[0]:
year = str(date_parts[0][0])
break
if year:
fields["year"] = year
# Journal
if "container-title" in cr and cr["container-title"]:
fields["journal"] = cr["container-title"][0] if isinstance(cr["container-title"], list) else cr["container-title"]
# Volume
if "volume" in cr:
fields["volume"] = str(cr["volume"])
# Issue/Number
if "issue" in cr:
fields["number"] = str(cr["issue"])
# Pages
if "page" in cr:
pages = cr["page"]
# Normalize various dash types to standard BibTeX double-dash
pages = pages.replace('–', '--') # en-dash
pages = pages.replace('—', '--') # em-dash
pages = pages.replace('−', '-') # minus sign
pages = pages.replace("-", "--") # regular hyphen
# Avoid quadruple dashes from double-replacement
while '----' in pages:
pages = pages.replace('----', '--')
fields["pages"] = pages
# DOI
if "DOI" in cr:
fields["doi"] = cr["DOI"]
# Publisher
if "publisher" in cr:
fields["publisher"] = cr["publisher"]
# Determine entry type
cr_type = cr.get("type", "")
if cr_type == "journal-article":
fields["_type"] = "article"
elif cr_type in ["book", "monograph"]:
fields["_type"] = "book"
elif cr_type == "proceedings-article":
fields["_type"] = "inproceedings"
elif cr_type == "book-chapter":
fields["_type"] = "incollection"
else:
fields["_type"] = "article" if fields.get("journal") else "misc"
return fields
def merge_metadata(grobid: Dict[str, str], crossref: Dict[str, str], filename: str) -> Dict[str, str]:
"""Merge Grobid and CrossRef metadata, preferring CrossRef for most fields."""
merged = {}
# CrossRef is more reliable for structured data
crossref_preferred = ["year", "volume", "number", "pages", "journal", "publisher", "doi"]
# Grobid might have better title extraction in some cases
grobid_preferred = []
# Start with Grobid data
for key, value in grobid.items():
if value:
merged[key] = value
# Override/add from CrossRef
for key, value in crossref.items():
if value:
if key in crossref_preferred or key not in merged:
merged[key] = value
# Use entry type from CrossRef if available
if "_type" in crossref:
merged["_type"] = crossref["_type"]
# Fallback: extract year from filename if still missing
if "year" not in merged or not merged["year"]:
year_match = re.search(r'(\d{4})', filename)
if year_match:
merged["year"] = year_match.group(1)
return merged
def generate_bibtex_key(filename: str, metadata: Dict[str, str], used_keys: set) -> str:
"""
Generate a unique BibTeX key in AuthorYYYY format.
Disambiguation strategy:
1. AuthorYYYY
2. AuthorSecondAuthorYYYY (if duplicate and second author exists)
3. AuthorYYYYa, AuthorYYYYb, etc. (if still duplicate)
"""
filename_stem = Path(filename).stem
first_author = None
second_author_from_filename = None
year = None
# Pattern 1: Standard AuthorYYYY or Author-AuthorYYYY
match = re.match(r"^([A-Za-z-]+)(\d{4})", filename_stem)
if match:
first_author = match.group(1).replace("-", "")
year = match.group(2)
# Pattern 2: Author&Author2YYYY (e.g., Feyaerts&Henriksen2021)
if not first_author:
match = re.match(r"^([A-Za-z-]+)&([A-Za-z-]+)(\d{4})", filename_stem)
if match:
first_author = match.group(1).replace("-", "")
second_author_from_filename = match.group(2).replace("-", "")
year = match.group(3)
# Pattern 3: "Some Name With Spaces YYYY" (e.g., "Cross-Disorder Group... 2019")
if not first_author:
match = re.match(r"^(.+?)\s+(\d{4})", filename_stem)
if match:
name_part = match.group(1)
year = match.group(2)
first_word_match = re.match(r"^([A-Za-z-]+)", name_part)
if first_word_match:
first_author = first_word_match.group(1).replace("-", "")
# Fall back to metadata if filename didn't match
if not first_author and "author" in metadata:
authors_str = metadata["author"]
first_author_full = re.split(r'\s+and\s+', authors_str)[0]
if "," in first_author_full:
first_author = first_author_full.split(",")[0].strip()
else:
parts = first_author_full.split()
first_author = parts[0] if parts else ""
first_author = re.sub(r"[^A-Za-z]", "", first_author)
if not year:
year = metadata.get("year", "XXXX")
if not first_author:
first_author = re.sub(r"[^A-Za-z]", "", filename_stem) or "Unknown"
# Try basic key first
base_key = f"{first_author}{year}"
if base_key not in used_keys:
return base_key
# Try with second author from filename if available (e.g., Feyaerts&Henriksen2021)
if second_author_from_filename:
two_author_key = f"{first_author}{second_author_from_filename}{year}"
if two_author_key not in used_keys:
return two_author_key
# Try with second author from metadata
if "author" in metadata:
authors = re.split(r'\s+and\s+', metadata["author"])
if len(authors) >= 2:
second_author_full = authors[1]
if "," in second_author_full:
second_author = second_author_full.split(",")[0].strip()
else:
parts = second_author_full.split()
second_author = parts[-1] if parts else "" # Last name
second_author = re.sub(r"[^A-Za-z]", "", second_author)
if second_author:
two_author_key = f"{first_author}{second_author}{year}"
if two_author_key not in used_keys:
return two_author_key
# Fall back to alphabetical suffix
suffix = ord('a')
while f"{base_key}{chr(suffix)}" in used_keys:
suffix += 1
if suffix > ord('z'):
suffix = ord('a')
base_key = f"{base_key}z" # Extremely unlikely edge case
return f"{base_key}{chr(suffix)}"
def escape_bibtex(text: str) -> str:
"""Escape special characters for BibTeX."""
if not text:
return ""
# Only escape & which is common in titles
text = text.replace("&", "\\&")
return text
def metadata_to_bibtex(key: str, metadata: Dict[str, str]) -> str:
"""Convert metadata dictionary to a BibTeX entry string."""
entry_type = metadata.get("_type", "misc")
lines = [f"@{entry_type}{{{key},"]
# Order fields nicely
field_order = ["author", "title", "journal", "booktitle", "year", "volume", "number", "pages", "publisher", "doi"]
added_fields = set()
for field in field_order:
if field in metadata and metadata[field]:
value = escape_bibtex(metadata[field])
lines.append(f" {field} = {{{value}}},")
added_fields.add(field)
# Add any remaining fields
for field, value in metadata.items():
if field not in added_fields and not field.startswith("_") and value:
value = escape_bibtex(value)
lines.append(f" {field} = {{{value}}},")
# Remove trailing comma from last field
if lines[-1].endswith(","):
lines[-1] = lines[-1][:-1]
lines.append("}")
return "\n".join(lines)
def process_single_pdf(pdf_path: Path, used_keys: set, use_crossref: bool = True,
extract_tei: bool = False) -> Optional[tuple]:
"""Process a single PDF and return (key, bibtex_entry, metadata, tei_xml)."""
# Step 1: Get initial data from Grobid
grobid_bibtex = process_pdf_with_grobid(pdf_path)
if not grobid_bibtex:
return None
grobid_fields = parse_bibtex(grobid_bibtex)
# Step 2: Try to get better metadata
enriched_fields = {}
if use_crossref:
doi = grobid_fields.get("doi", "")
title = grobid_fields.get("title", "")
# Try DOI content negotiation first (direct BibTeX from doi.org)
if doi:
print(f" → Fetching BibTeX from DOI: {doi[:50]}...")
doi_bibtex = query_doi_bibtex(doi)
if doi_bibtex:
enriched_fields = parse_bibtex(doi_bibtex)
print(f" ✓ DOI: got BibTeX directly")
# Fall back to CrossRef title search if no DOI or DOI lookup failed
if not enriched_fields and title:
print(f" → Querying CrossRef by title...")
cr_data = query_crossref_by_title(title)
if cr_data:
enriched_fields = crossref_to_bibtex_fields(cr_data)
print(f" ✓ CrossRef: found via title search")
# Step 3: Merge metadata
final_metadata = merge_metadata(grobid_fields, enriched_fields, pdf_path.name)
# Step 4: Generate key (with disambiguation)
key = generate_bibtex_key(pdf_path.name, final_metadata, used_keys)
# Step 5: Generate BibTeX
bibtex_entry = metadata_to_bibtex(key, final_metadata)
# Step 6: Get full TEI if requested
tei_xml = None
if extract_tei:
print(f" → Extracting full TEI...")
tei_xml = process_pdf_fulltext_tei(pdf_path)
if tei_xml:
print(f" ✓ TEI extracted")
else:
print(f" ⚠ TEI extraction failed")
return key, bibtex_entry, final_metadata, tei_xml
def generate_new_filename(key: str, metadata: Dict[str, str]) -> str:
"""Generate a new filename based on the BibTeX key."""
# Use the key directly as the filename (it's already in AuthorYYYY format)
return f"{key}.pdf"
def rename_pdf_file(old_path: Path, new_filename: str) -> Optional[Path]:
"""Rename a PDF file, handling conflicts."""
new_path = old_path.parent / new_filename
if new_path == old_path:
return old_path # No change needed
if new_path.exists():
print(f" ⚠ Cannot rename: {new_filename} already exists")
return None
try:
old_path.rename(new_path)
return new_path
except OSError as e:
print(f" ⚠ Rename failed: {e}")
return None
def get_expected_key_from_filename(filename: str) -> Optional[str]:
"""Extract expected BibTeX key from filename if it matches AuthorYYYY pattern."""
filename_stem = Path(filename).stem
# Pattern 1: Standard AuthorYYYY or Author-Author2YYYY (e.g., Chesney2014, Aston-Jones2005)
match = re.match(r"^([A-Za-z-]+)(\d{4})([a-z])?$", filename_stem)
if match:
author = match.group(1).replace("-", "")
year = match.group(2)
suffix = match.group(3) or ""
return f"{author}{year}{suffix}"
# Pattern 2: Author&Author2YYYY (e.g., Feyaerts&Henriksen2021)
match = re.match(r"^([A-Za-z-]+)&([A-Za-z-]+)(\d{4})([a-z])?$", filename_stem)
if match:
author1 = match.group(1).replace("-", "")
author2 = match.group(2).replace("-", "")
year = match.group(3)
suffix = match.group(4) or ""
return f"{author1}{author2}{year}{suffix}"
# Pattern 3: "Some Name With Spaces YYYY" (e.g., "Cross-Disorder Group... 2019")
match = re.match(r"^(.+?)\s+(\d{4})([a-z])?$", filename_stem)
if match:
name_part = match.group(1)
year = match.group(2)
suffix = match.group(3) or ""
# Extract first word (or hyphenated word) as the key base
first_word_match = re.match(r"^([A-Za-z-]+)", name_part)
if first_word_match:
author = first_word_match.group(1).replace("-", "")
return f"{author}{year}{suffix}"
return None
def process_directory(input_dir: Path, output_file: Path, use_crossref: bool = True,
rename_files: bool = False, extract_tei: bool = False) -> None:
"""Process all PDFs in a directory and generate a BibTeX file."""
pdf_files = sorted(input_dir.glob("*.pdf"))
if not pdf_files:
print(f"No PDF files found in {input_dir}")
return
print(f"Found {len(pdf_files)} PDF files")
if use_crossref:
print("CrossRef enrichment: enabled")
if rename_files:
print("File renaming: enabled")
if extract_tei:
print("TEI extraction: enabled")
# Load existing BibTeX entries
existing_keys, existing_content = load_existing_bibtex(output_file)
if existing_keys:
print(f"Found {len(existing_keys)} existing entries in {output_file}")
# Track all used keys (existing + new)
used_keys = existing_keys.copy()
# Determine which files to skip
files_to_process = []
for pdf_path in pdf_files:
expected_key = get_expected_key_from_filename(pdf_path.name)
if expected_key and expected_key in existing_keys:
print(f" Skipping (already in bib): {pdf_path.name}")
else:
files_to_process.append(pdf_path)
if not files_to_process:
print("\nNo new files to process.")
return
print(f"\nProcessing {len(files_to_process)} new files...")
if not start_grobid_container():
print("Failed to start Grobid. Exiting.")
sys.exit(1)
new_entries = []
stats = {"success": 0, "crossref_enriched": 0, "failed": 0, "renamed": 0, "tei_saved": 0}
for i, pdf_path in enumerate(files_to_process, 1):
print(f"\n[{i}/{len(files_to_process)}] Processing: {pdf_path.name}")
result = process_single_pdf(pdf_path, used_keys, use_crossref=use_crossref,
extract_tei=extract_tei)
if not result:
print(f" ✗ Failed to process")
stats["failed"] += 1
continue
key, bibtex_entry, metadata, tei_xml = result
# Add key to used set
used_keys.add(key)
new_entries.append(bibtex_entry)
stats["success"] += 1
# Check if CrossRef enriched
if metadata.get("volume") or metadata.get("pages") or metadata.get("number"):
stats["crossref_enriched"] += 1
# Display result
print(f" ✓ Key: {key}")
if metadata.get("title"):
title = metadata["title"]
print(f" Title: {title[:60]}{'...' if len(title) > 60 else ''}")
if metadata.get("author"):
authors = metadata["author"].split(" and ")
author_str = ", ".join(authors[:2]) + (" et al." if len(authors) > 2 else "")
print(f" Authors: {author_str}")
if metadata.get("year"):
extra = []
if metadata.get("journal"):
extra.append(metadata["journal"][:30])
if metadata.get("volume"):
vol = metadata["volume"]
if metadata.get("number"):
vol += f"({metadata['number']})"
extra.append(vol)
if metadata.get("pages"):
extra.append(f"pp. {metadata['pages']}")
print(f" Year: {metadata['year']}" + (f" | {', '.join(extra)}" if extra else ""))
# Save TEI XML if extracted
if tei_xml:
tei_filename = f"{key}.tei.xml"
tei_path = pdf_path.parent / tei_filename
try:
tei_path.write_text(tei_xml, encoding="utf-8")
print(f" TEI saved: {tei_filename}")
stats["tei_saved"] += 1
except OSError as e:
print(f" ⚠ Failed to save TEI: {e}")
# Rename file if requested
if rename_files:
new_filename = generate_new_filename(key, metadata)
if new_filename != pdf_path.name:
new_path = rename_pdf_file(pdf_path, new_filename)
if new_path:
print(f" Renamed: {pdf_path.name} → {new_filename}")
stats["renamed"] += 1
# Be nice to CrossRef API
if use_crossref:
time.sleep(0.1)
# Write BibTeX file (append new entries or create new)
if existing_content:
# Append to existing file
with open(output_file, "a", encoding="utf-8") as f:
f.write("\n\n")
f.write(f"% Added {len(new_entries)} entries on {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n")
f.write("\n\n".join(new_entries))
f.write("\n")
else:
# Create new file
with open(output_file, "w", encoding="utf-8") as f:
f.write("% BibTeX file generated by pdf_to_bibtex.py\n")
f.write("% Using Grobid + DOI content negotiation for metadata enrichment\n")
f.write(f"% Generated from: {input_dir}\n")
f.write(f"% Date: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
f.write(f"% Total entries: {len(new_entries)}\n\n")
f.write("\n\n".join(new_entries))
f.write("\n")
print(f"\n{'='*60}")
print(f"✓ Successfully updated {output_file}")
print(f" New entries added: {stats['success']}")
print(f" CrossRef enriched: {stats['crossref_enriched']}")
print(f" Total entries in file: {len(existing_keys) + stats['success']}")
if extract_tei:
print(f" TEI files saved: {stats['tei_saved']}")
if rename_files:
print(f" Files renamed: {stats['renamed']}")
if stats["failed"]:
print(f" Failed: {stats['failed']}")
def main():
parser = argparse.ArgumentParser(
description="Extract bibliographic metadata from PDFs using Grobid + DOI lookup"
)
parser.add_argument(
"input_dir",
type=Path,
help="Directory containing PDF files"
)
parser.add_argument(
"output",
type=Path,
nargs="?",
default=Path("references.bib"),
help="Output BibTeX file (default: references.bib)"
)
parser.add_argument(
"--no-crossref",
action="store_true",
help="Skip DOI lookup and CrossRef title search (use only Grobid data)"
)
parser.add_argument(
"--rename-files",
action="store_true",
help="Rename PDF files to AuthorYYYY.pdf format based on extracted metadata"
)
parser.add_argument(
"--extract-tei",
action="store_true",
help="Extract full TEI XML and save as AuthorYYYY.tei.xml alongside PDFs"
)
parser.add_argument(
"--stop-grobid",
action="store_true",
help="Stop the Grobid container after processing"
)
args = parser.parse_args()
if not args.input_dir.exists():
print(f"Error: Input directory '{args.input_dir}' does not exist")
sys.exit(1)
if not args.input_dir.is_dir():
print(f"Error: '{args.input_dir}' is not a directory")
sys.exit(1)
process_directory(
args.input_dir,
args.output,
use_crossref=not args.no_crossref,
rename_files=args.rename_files,
extract_tei=args.extract_tei
)
if args.stop_grobid:
print("\nStopping Grobid container...")
subprocess.run(["docker", "stop", GROBID_CONTAINER_NAME], check=False)
print("✓ Grobid container stopped")
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment