Last active
January 19, 2026 12:27
-
-
Save joelkuiper/46420d3c34e3a44bd56b079b896f2fec to your computer and use it in GitHub Desktop.
pdf_to_bibtex.py - Academic PDF to BibTeX Converter
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env python3 | |
| """ | |
| pdf_to_bibtex.py - Academic PDF to BibTeX Converter | |
| ==================================================== | |
| Automatically extract bibliographic metadata from academic PDFs using Grobid | |
| and CrossRef, generating a clean BibTeX file with consistent AuthorYYYY citation keys. | |
| INPUT FORMAT | |
| ------------ | |
| Expects a directory of PDFs named in AuthorYYYY format (or close variations): | |
| Literature/ | |
| ├── Adams2013.pdf → Adams2013 | |
| ├── Aston-Jones2005.pdf → AstonJones2005 | |
| ├── Feyaerts&Henriksen2021.pdf → FeyaertsHenriksen2021 | |
| └── Cross-Disorder Group... 2019.pdf → CrossDisorder2019 | |
| The year must be a 4-digit number at the end (before .pdf). The script uses | |
| the filename to generate citation keys and to skip already-processed files. | |
| FEATURES | |
| -------- | |
| • Extracts metadata from PDFs using Grobid (runs via Docker) | |
| • Enriches via DOI content negotiation (direct BibTeX from doi.org) | |
| • Falls back to CrossRef API title search when no DOI available | |
| • Generates BibTeX with AuthorYYYY keys (e.g., Chesney2014, AstonJones2005) | |
| • Smart key disambiguation: AuthorYYYY → AuthorSecondAuthorYYYY → AuthorYYYYa | |
| • Incremental processing: skips PDFs already in the .bib file | |
| • Optional: rename PDFs to match their citation keys | |
| • Optional: extract full TEI XML for each document (structured full-text) | |
| REQUIREMENTS | |
| ------------ | |
| • Docker (for Grobid) | |
| • Python 3.7+ | |
| • requests library: pip install requests | |
| Grobid is pulled automatically on first run (~2GB Docker image). | |
| USAGE | |
| ----- | |
| Basic - process all PDFs and create references.bib: | |
| python pdf_to_bibtex.py /path/to/papers/ | |
| Specify output file: | |
| python pdf_to_bibtex.py /path/to/papers/ my_references.bib | |
| With all options: | |
| python pdf_to_bibtex.py /path/to/papers/ refs.bib --rename-files --extract-tei | |
| OPTIONS | |
| ------- | |
| --no-crossref Skip DOI lookup and CrossRef title search (use only Grobid data) | |
| --rename-files Rename PDFs to AuthorYYYY.pdf based on extracted metadata | |
| --extract-tei Save full TEI XML as AuthorYYYY.tei.xml (Grobid's structured output) | |
| --stop-grobid Stop the Grobid Docker container after processing | |
| EXAMPLE OUTPUT | |
| -------------- | |
| $ python pdf_to_bibtex.py Literature/ | |
| Found 69 PDF files | |
| CrossRef enrichment: enabled | |
| Found 42 existing entries in references.bib | |
| Skipping (already in bib): Adams2013.pdf | |
| Skipping (already in bib): Bastos2012.pdf | |
| Processing 27 new files... | |
| ✓ Grobid container 'grobid-pdf-extractor' is already running | |
| [1/27] Processing: NewPaper2024.pdf | |
| → Querying CrossRef by DOI: 10.1038/s41586-024-07051-0... | |
| ✓ CrossRef: found via DOI | |
| ✓ Key: Smith2024 | |
| Title: A new theory of consciousness | |
| Authors: Smith, John, Doe, Jane et al. | |
| Year: 2024 | Nature, 625(7994), pp. 112--118 | |
| ============================================================ | |
| ✓ Successfully updated references.bib | |
| New entries added: 27 | |
| CrossRef enriched: 25 | |
| Total entries in file: 69 | |
| GENERATED BIBTEX FORMAT | |
| ----------------------- | |
| @article{Smith2024, | |
| author = {Smith, John and Doe, Jane and Johnson, Bob}, | |
| title = {A new theory of consciousness}, | |
| journal = {Nature}, | |
| year = {2024}, | |
| volume = {625}, | |
| number = {7994}, | |
| pages = {112--118}, | |
| doi = {10.1038/s41586-024-07051-0} | |
| } | |
| HOW IT WORKS | |
| ------------ | |
| 1. Grobid extracts title, authors, DOI from PDF header/first page | |
| 2. If DOI found → fetch BibTeX directly from doi.org (content negotiation) | |
| 3. If no DOI → search CrossRef by title, use best match | |
| 4. Merge Grobid + enriched data (DOI/CrossRef preferred for year, volume, pages, etc.) | |
| 5. Generate citation key from filename (if AuthorYYYY pattern) or from metadata | |
| 6. Append new entries to existing .bib file (won't duplicate) | |
| FILE NAMING CONVENTION | |
| ---------------------- | |
| The script handles various filename formats: | |
| • AuthorYYYY.pdf → AuthorYYYY (e.g., Chesney2014.pdf → Chesney2014) | |
| • Author-AuthorYYYY.pdf → AuthorAuthorYYYY (e.g., Aston-Jones2005.pdf → AstonJones2005) | |
| • Author&AuthorYYYY.pdf → AuthorAuthorYYYY (e.g., Feyaerts&Henriksen2021.pdf → FeyaertsHenriksen2021) | |
| • "Name With Spaces YYYY.pdf" → FirstWord+YYYY (e.g., "Cross-Disorder Group... 2019.pdf" → CrossDisorder2019) | |
| The script will: | |
| • Use the filename to derive the citation key | |
| • Skip files whose key already exists in the .bib | |
| • With --rename-files, rename PDFs to clean AuthorYYYY.pdf format | |
| For papers with the same first author and year, disambiguation is automatic: | |
| • Smith2024.pdf → Smith2024 | |
| • Smith2024-other.pdf → SmithJones2024 (uses second author from filename or metadata) | |
| • Smith2024-third.pdf → Smith2024a (alphabetical suffix) | |
| TEI XML OUTPUT (--extract-tei) | |
| ------------------------------ | |
| Grobid can produce structured TEI XML with: | |
| • Full text segmented into sections | |
| • Parsed references with links | |
| • Figures and tables identified | |
| • Author affiliations and emails | |
| Useful for text mining, citation analysis, or building a local search index. | |
| NOTES | |
| ----- | |
| • Grobid container keeps running after the script (for faster subsequent runs) | |
| • Use --stop-grobid to stop it when done | |
| • DOI lookup uses content negotiation (doi.org) which is fast and reliable | |
| • CrossRef is only used as fallback for title search when no DOI is available | |
| • Some PDFs (scans, unusual layouts) may yield incomplete metadata | |
| AUTHOR | |
| ------ | |
| Generated with Claude. Feel free to modify and redistribute. | |
| LICENSE | |
| ------- | |
| MIT License - do whatever you want with it. | |
| """ | |
| import os | |
| import sys | |
| import re | |
| import time | |
| import subprocess | |
| import argparse | |
| from pathlib import Path | |
| from typing import Optional, Dict, Any, Tuple | |
| import requests | |
| # Grobid configuration | |
| GROBID_IMAGE = "lfoppiano/grobid:0.8.1" | |
| GROBID_CONTAINER_NAME = "grobid-pdf-extractor" | |
| GROBID_PORT = 8070 | |
| GROBID_URL = f"http://localhost:{GROBID_PORT}" | |
| # CrossRef configuration (used for title search fallback) | |
| CROSSREF_API = "https://api.crossref.org" | |
| # Be polite - identify ourselves (CrossRef asks for this) | |
| CROSSREF_HEADERS = { | |
| "User-Agent": "pdf-to-bibtex/1.0 (https://github.com/user/pdf-to-bibtex; mailto:user@example.com)" | |
| } | |
| def load_existing_bibtex(filepath: Path) -> Tuple[set, str]: | |
| """Load existing BibTeX file and return set of keys and the content.""" | |
| if not filepath.exists(): | |
| return set(), "" | |
| content = filepath.read_text(encoding="utf-8") | |
| # Extract all keys from @type{key, patterns | |
| keys = set(re.findall(r'@\w+\s*\{\s*([^,\s]+)\s*,', content)) | |
| return keys, content | |
| def start_grobid_container() -> bool: | |
| """Start the Grobid Docker container if not already running.""" | |
| result = subprocess.run( | |
| ["docker", "ps", "-q", "-f", f"name={GROBID_CONTAINER_NAME}"], | |
| capture_output=True, | |
| text=True | |
| ) | |
| if result.stdout.strip(): | |
| print(f"✓ Grobid container '{GROBID_CONTAINER_NAME}' is already running") | |
| return True | |
| result = subprocess.run( | |
| ["docker", "ps", "-aq", "-f", f"name={GROBID_CONTAINER_NAME}"], | |
| capture_output=True, | |
| text=True | |
| ) | |
| if result.stdout.strip(): | |
| print(f"Starting existing Grobid container '{GROBID_CONTAINER_NAME}'...") | |
| subprocess.run(["docker", "start", GROBID_CONTAINER_NAME], check=True) | |
| else: | |
| print(f"Pulling and starting Grobid container ({GROBID_IMAGE})...") | |
| subprocess.run([ | |
| "docker", "run", "-d", | |
| "--name", GROBID_CONTAINER_NAME, | |
| "-p", f"{GROBID_PORT}:8070", | |
| GROBID_IMAGE | |
| ], check=True) | |
| print("Waiting for Grobid to initialize (this may take a minute)...") | |
| max_attempts = 60 | |
| for attempt in range(max_attempts): | |
| try: | |
| response = requests.get(f"{GROBID_URL}/api/isalive", timeout=5) | |
| if response.status_code == 200: | |
| print("✓ Grobid is ready") | |
| return True | |
| except requests.exceptions.RequestException: | |
| pass | |
| time.sleep(2) | |
| if (attempt + 1) % 10 == 0: | |
| print(f" Still waiting... ({attempt + 1}/{max_attempts})") | |
| print("✗ Grobid failed to start in time") | |
| return False | |
| def process_pdf_with_grobid(pdf_path: Path) -> Optional[str]: | |
| """Send a PDF to Grobid and get BibTeX response.""" | |
| url = f"{GROBID_URL}/api/processHeaderDocument" | |
| try: | |
| with open(pdf_path, "rb") as pdf_file: | |
| files = {"input": (pdf_path.name, pdf_file, "application/pdf")} | |
| response = requests.post(url, files=files, timeout=120) | |
| if response.status_code == 200: | |
| return response.text | |
| else: | |
| print(f" Warning: Grobid returned status {response.status_code}") | |
| return None | |
| except Exception as e: | |
| print(f" Error processing with Grobid: {e}") | |
| return None | |
| def process_pdf_fulltext_tei(pdf_path: Path) -> Optional[str]: | |
| """Send a PDF to Grobid and get full TEI XML response.""" | |
| url = f"{GROBID_URL}/api/processFulltextDocument" | |
| try: | |
| with open(pdf_path, "rb") as pdf_file: | |
| files = {"input": (pdf_path.name, pdf_file, "application/pdf")} | |
| response = requests.post(url, files=files, timeout=300) # Longer timeout for full processing | |
| if response.status_code == 200: | |
| return response.text | |
| else: | |
| print(f" Warning: Grobid TEI returned status {response.status_code}") | |
| return None | |
| except Exception as e: | |
| print(f" Error getting TEI from Grobid: {e}") | |
| return None | |
| def parse_bibtex(bibtex_str: str) -> Dict[str, str]: | |
| """Parse a BibTeX entry into a dictionary.""" | |
| fields = {} | |
| # Extract entry type | |
| type_match = re.match(r'@(\w+)\s*\{', bibtex_str) | |
| if type_match: | |
| fields['_type'] = type_match.group(1).lower() | |
| # Extract fields - handle nested braces and quoted values | |
| for match in re.finditer(r'(\w+)\s*=\s*[{"]((?:[^{}"]|(?:\{[^{}]*\}))*)[}"]', bibtex_str): | |
| key = match.group(1).lower() | |
| value = match.group(2).strip() | |
| if value: | |
| fields[key] = value | |
| return fields | |
| def query_doi_bibtex(doi: str) -> Optional[str]: | |
| """Query DOI directly for BibTeX using content negotiation.""" | |
| try: | |
| # Clean DOI | |
| doi = doi.strip() | |
| if doi.startswith("http"): | |
| doi = re.sub(r'https?://(dx\.)?doi\.org/', '', doi) | |
| url = f"https://doi.org/{doi}" | |
| headers = {"Accept": "application/x-bibtex"} | |
| response = requests.get(url, headers=headers, timeout=30, allow_redirects=True) | |
| if response.status_code == 200 and response.text.strip().startswith("@"): | |
| # Ensure proper UTF-8 decoding | |
| response.encoding = 'utf-8' | |
| text = response.text | |
| # Normalize various dash types to standard BibTeX double-dash | |
| text = text.replace('–', '--') # en-dash | |
| text = text.replace('—', '--') # em-dash | |
| text = text.replace('−', '-') # minus sign | |
| return text | |
| return None | |
| except Exception as e: | |
| print(f" DOI lookup failed: {e}") | |
| return None | |
| def query_crossref_by_title(title: str) -> Optional[Dict[str, Any]]: | |
| """Query CrossRef API by title search.""" | |
| try: | |
| # Clean title | |
| title = re.sub(r'[^\w\s]', ' ', title) | |
| title = ' '.join(title.split()[:15]) # First 15 words | |
| url = f"{CROSSREF_API}/works" | |
| params = { | |
| "query.title": title, | |
| "rows": 1, | |
| "select": "DOI,title,author,published-print,published-online,container-title,volume,issue,page,publisher,type" | |
| } | |
| response = requests.get(url, params=params, headers=CROSSREF_HEADERS, timeout=30) | |
| if response.status_code == 200: | |
| items = response.json().get("message", {}).get("items", []) | |
| if items: | |
| return items[0] | |
| return None | |
| except Exception as e: | |
| print(f" CrossRef title search failed: {e}") | |
| return None | |
| def crossref_to_bibtex_fields(cr: Dict[str, Any]) -> Dict[str, str]: | |
| """Convert CrossRef API response to BibTeX fields.""" | |
| fields = {} | |
| # Title | |
| if "title" in cr and cr["title"]: | |
| fields["title"] = cr["title"][0] if isinstance(cr["title"], list) else cr["title"] | |
| # Authors | |
| if "author" in cr and cr["author"]: | |
| authors = [] | |
| for a in cr["author"]: | |
| if "family" in a: | |
| name = a["family"] | |
| if "given" in a: | |
| name = f"{a['family']}, {a['given']}" | |
| authors.append(name) | |
| if authors: | |
| fields["author"] = " and ".join(authors) | |
| # Year - try multiple date fields | |
| year = None | |
| for date_field in ["published-print", "published-online", "issued", "created"]: | |
| if date_field in cr and cr[date_field]: | |
| date_parts = cr[date_field].get("date-parts", [[]]) | |
| if date_parts and date_parts[0]: | |
| year = str(date_parts[0][0]) | |
| break | |
| if year: | |
| fields["year"] = year | |
| # Journal | |
| if "container-title" in cr and cr["container-title"]: | |
| fields["journal"] = cr["container-title"][0] if isinstance(cr["container-title"], list) else cr["container-title"] | |
| # Volume | |
| if "volume" in cr: | |
| fields["volume"] = str(cr["volume"]) | |
| # Issue/Number | |
| if "issue" in cr: | |
| fields["number"] = str(cr["issue"]) | |
| # Pages | |
| if "page" in cr: | |
| pages = cr["page"] | |
| # Normalize various dash types to standard BibTeX double-dash | |
| pages = pages.replace('–', '--') # en-dash | |
| pages = pages.replace('—', '--') # em-dash | |
| pages = pages.replace('−', '-') # minus sign | |
| pages = pages.replace("-", "--") # regular hyphen | |
| # Avoid quadruple dashes from double-replacement | |
| while '----' in pages: | |
| pages = pages.replace('----', '--') | |
| fields["pages"] = pages | |
| # DOI | |
| if "DOI" in cr: | |
| fields["doi"] = cr["DOI"] | |
| # Publisher | |
| if "publisher" in cr: | |
| fields["publisher"] = cr["publisher"] | |
| # Determine entry type | |
| cr_type = cr.get("type", "") | |
| if cr_type == "journal-article": | |
| fields["_type"] = "article" | |
| elif cr_type in ["book", "monograph"]: | |
| fields["_type"] = "book" | |
| elif cr_type == "proceedings-article": | |
| fields["_type"] = "inproceedings" | |
| elif cr_type == "book-chapter": | |
| fields["_type"] = "incollection" | |
| else: | |
| fields["_type"] = "article" if fields.get("journal") else "misc" | |
| return fields | |
| def merge_metadata(grobid: Dict[str, str], crossref: Dict[str, str], filename: str) -> Dict[str, str]: | |
| """Merge Grobid and CrossRef metadata, preferring CrossRef for most fields.""" | |
| merged = {} | |
| # CrossRef is more reliable for structured data | |
| crossref_preferred = ["year", "volume", "number", "pages", "journal", "publisher", "doi"] | |
| # Grobid might have better title extraction in some cases | |
| grobid_preferred = [] | |
| # Start with Grobid data | |
| for key, value in grobid.items(): | |
| if value: | |
| merged[key] = value | |
| # Override/add from CrossRef | |
| for key, value in crossref.items(): | |
| if value: | |
| if key in crossref_preferred or key not in merged: | |
| merged[key] = value | |
| # Use entry type from CrossRef if available | |
| if "_type" in crossref: | |
| merged["_type"] = crossref["_type"] | |
| # Fallback: extract year from filename if still missing | |
| if "year" not in merged or not merged["year"]: | |
| year_match = re.search(r'(\d{4})', filename) | |
| if year_match: | |
| merged["year"] = year_match.group(1) | |
| return merged | |
| def generate_bibtex_key(filename: str, metadata: Dict[str, str], used_keys: set) -> str: | |
| """ | |
| Generate a unique BibTeX key in AuthorYYYY format. | |
| Disambiguation strategy: | |
| 1. AuthorYYYY | |
| 2. AuthorSecondAuthorYYYY (if duplicate and second author exists) | |
| 3. AuthorYYYYa, AuthorYYYYb, etc. (if still duplicate) | |
| """ | |
| filename_stem = Path(filename).stem | |
| first_author = None | |
| second_author_from_filename = None | |
| year = None | |
| # Pattern 1: Standard AuthorYYYY or Author-AuthorYYYY | |
| match = re.match(r"^([A-Za-z-]+)(\d{4})", filename_stem) | |
| if match: | |
| first_author = match.group(1).replace("-", "") | |
| year = match.group(2) | |
| # Pattern 2: Author&Author2YYYY (e.g., Feyaerts&Henriksen2021) | |
| if not first_author: | |
| match = re.match(r"^([A-Za-z-]+)&([A-Za-z-]+)(\d{4})", filename_stem) | |
| if match: | |
| first_author = match.group(1).replace("-", "") | |
| second_author_from_filename = match.group(2).replace("-", "") | |
| year = match.group(3) | |
| # Pattern 3: "Some Name With Spaces YYYY" (e.g., "Cross-Disorder Group... 2019") | |
| if not first_author: | |
| match = re.match(r"^(.+?)\s+(\d{4})", filename_stem) | |
| if match: | |
| name_part = match.group(1) | |
| year = match.group(2) | |
| first_word_match = re.match(r"^([A-Za-z-]+)", name_part) | |
| if first_word_match: | |
| first_author = first_word_match.group(1).replace("-", "") | |
| # Fall back to metadata if filename didn't match | |
| if not first_author and "author" in metadata: | |
| authors_str = metadata["author"] | |
| first_author_full = re.split(r'\s+and\s+', authors_str)[0] | |
| if "," in first_author_full: | |
| first_author = first_author_full.split(",")[0].strip() | |
| else: | |
| parts = first_author_full.split() | |
| first_author = parts[0] if parts else "" | |
| first_author = re.sub(r"[^A-Za-z]", "", first_author) | |
| if not year: | |
| year = metadata.get("year", "XXXX") | |
| if not first_author: | |
| first_author = re.sub(r"[^A-Za-z]", "", filename_stem) or "Unknown" | |
| # Try basic key first | |
| base_key = f"{first_author}{year}" | |
| if base_key not in used_keys: | |
| return base_key | |
| # Try with second author from filename if available (e.g., Feyaerts&Henriksen2021) | |
| if second_author_from_filename: | |
| two_author_key = f"{first_author}{second_author_from_filename}{year}" | |
| if two_author_key not in used_keys: | |
| return two_author_key | |
| # Try with second author from metadata | |
| if "author" in metadata: | |
| authors = re.split(r'\s+and\s+', metadata["author"]) | |
| if len(authors) >= 2: | |
| second_author_full = authors[1] | |
| if "," in second_author_full: | |
| second_author = second_author_full.split(",")[0].strip() | |
| else: | |
| parts = second_author_full.split() | |
| second_author = parts[-1] if parts else "" # Last name | |
| second_author = re.sub(r"[^A-Za-z]", "", second_author) | |
| if second_author: | |
| two_author_key = f"{first_author}{second_author}{year}" | |
| if two_author_key not in used_keys: | |
| return two_author_key | |
| # Fall back to alphabetical suffix | |
| suffix = ord('a') | |
| while f"{base_key}{chr(suffix)}" in used_keys: | |
| suffix += 1 | |
| if suffix > ord('z'): | |
| suffix = ord('a') | |
| base_key = f"{base_key}z" # Extremely unlikely edge case | |
| return f"{base_key}{chr(suffix)}" | |
| def escape_bibtex(text: str) -> str: | |
| """Escape special characters for BibTeX.""" | |
| if not text: | |
| return "" | |
| # Only escape & which is common in titles | |
| text = text.replace("&", "\\&") | |
| return text | |
| def metadata_to_bibtex(key: str, metadata: Dict[str, str]) -> str: | |
| """Convert metadata dictionary to a BibTeX entry string.""" | |
| entry_type = metadata.get("_type", "misc") | |
| lines = [f"@{entry_type}{{{key},"] | |
| # Order fields nicely | |
| field_order = ["author", "title", "journal", "booktitle", "year", "volume", "number", "pages", "publisher", "doi"] | |
| added_fields = set() | |
| for field in field_order: | |
| if field in metadata and metadata[field]: | |
| value = escape_bibtex(metadata[field]) | |
| lines.append(f" {field} = {{{value}}},") | |
| added_fields.add(field) | |
| # Add any remaining fields | |
| for field, value in metadata.items(): | |
| if field not in added_fields and not field.startswith("_") and value: | |
| value = escape_bibtex(value) | |
| lines.append(f" {field} = {{{value}}},") | |
| # Remove trailing comma from last field | |
| if lines[-1].endswith(","): | |
| lines[-1] = lines[-1][:-1] | |
| lines.append("}") | |
| return "\n".join(lines) | |
| def process_single_pdf(pdf_path: Path, used_keys: set, use_crossref: bool = True, | |
| extract_tei: bool = False) -> Optional[tuple]: | |
| """Process a single PDF and return (key, bibtex_entry, metadata, tei_xml).""" | |
| # Step 1: Get initial data from Grobid | |
| grobid_bibtex = process_pdf_with_grobid(pdf_path) | |
| if not grobid_bibtex: | |
| return None | |
| grobid_fields = parse_bibtex(grobid_bibtex) | |
| # Step 2: Try to get better metadata | |
| enriched_fields = {} | |
| if use_crossref: | |
| doi = grobid_fields.get("doi", "") | |
| title = grobid_fields.get("title", "") | |
| # Try DOI content negotiation first (direct BibTeX from doi.org) | |
| if doi: | |
| print(f" → Fetching BibTeX from DOI: {doi[:50]}...") | |
| doi_bibtex = query_doi_bibtex(doi) | |
| if doi_bibtex: | |
| enriched_fields = parse_bibtex(doi_bibtex) | |
| print(f" ✓ DOI: got BibTeX directly") | |
| # Fall back to CrossRef title search if no DOI or DOI lookup failed | |
| if not enriched_fields and title: | |
| print(f" → Querying CrossRef by title...") | |
| cr_data = query_crossref_by_title(title) | |
| if cr_data: | |
| enriched_fields = crossref_to_bibtex_fields(cr_data) | |
| print(f" ✓ CrossRef: found via title search") | |
| # Step 3: Merge metadata | |
| final_metadata = merge_metadata(grobid_fields, enriched_fields, pdf_path.name) | |
| # Step 4: Generate key (with disambiguation) | |
| key = generate_bibtex_key(pdf_path.name, final_metadata, used_keys) | |
| # Step 5: Generate BibTeX | |
| bibtex_entry = metadata_to_bibtex(key, final_metadata) | |
| # Step 6: Get full TEI if requested | |
| tei_xml = None | |
| if extract_tei: | |
| print(f" → Extracting full TEI...") | |
| tei_xml = process_pdf_fulltext_tei(pdf_path) | |
| if tei_xml: | |
| print(f" ✓ TEI extracted") | |
| else: | |
| print(f" ⚠ TEI extraction failed") | |
| return key, bibtex_entry, final_metadata, tei_xml | |
| def generate_new_filename(key: str, metadata: Dict[str, str]) -> str: | |
| """Generate a new filename based on the BibTeX key.""" | |
| # Use the key directly as the filename (it's already in AuthorYYYY format) | |
| return f"{key}.pdf" | |
| def rename_pdf_file(old_path: Path, new_filename: str) -> Optional[Path]: | |
| """Rename a PDF file, handling conflicts.""" | |
| new_path = old_path.parent / new_filename | |
| if new_path == old_path: | |
| return old_path # No change needed | |
| if new_path.exists(): | |
| print(f" ⚠ Cannot rename: {new_filename} already exists") | |
| return None | |
| try: | |
| old_path.rename(new_path) | |
| return new_path | |
| except OSError as e: | |
| print(f" ⚠ Rename failed: {e}") | |
| return None | |
| def get_expected_key_from_filename(filename: str) -> Optional[str]: | |
| """Extract expected BibTeX key from filename if it matches AuthorYYYY pattern.""" | |
| filename_stem = Path(filename).stem | |
| # Pattern 1: Standard AuthorYYYY or Author-Author2YYYY (e.g., Chesney2014, Aston-Jones2005) | |
| match = re.match(r"^([A-Za-z-]+)(\d{4})([a-z])?$", filename_stem) | |
| if match: | |
| author = match.group(1).replace("-", "") | |
| year = match.group(2) | |
| suffix = match.group(3) or "" | |
| return f"{author}{year}{suffix}" | |
| # Pattern 2: Author&Author2YYYY (e.g., Feyaerts&Henriksen2021) | |
| match = re.match(r"^([A-Za-z-]+)&([A-Za-z-]+)(\d{4})([a-z])?$", filename_stem) | |
| if match: | |
| author1 = match.group(1).replace("-", "") | |
| author2 = match.group(2).replace("-", "") | |
| year = match.group(3) | |
| suffix = match.group(4) or "" | |
| return f"{author1}{author2}{year}{suffix}" | |
| # Pattern 3: "Some Name With Spaces YYYY" (e.g., "Cross-Disorder Group... 2019") | |
| match = re.match(r"^(.+?)\s+(\d{4})([a-z])?$", filename_stem) | |
| if match: | |
| name_part = match.group(1) | |
| year = match.group(2) | |
| suffix = match.group(3) or "" | |
| # Extract first word (or hyphenated word) as the key base | |
| first_word_match = re.match(r"^([A-Za-z-]+)", name_part) | |
| if first_word_match: | |
| author = first_word_match.group(1).replace("-", "") | |
| return f"{author}{year}{suffix}" | |
| return None | |
| def process_directory(input_dir: Path, output_file: Path, use_crossref: bool = True, | |
| rename_files: bool = False, extract_tei: bool = False) -> None: | |
| """Process all PDFs in a directory and generate a BibTeX file.""" | |
| pdf_files = sorted(input_dir.glob("*.pdf")) | |
| if not pdf_files: | |
| print(f"No PDF files found in {input_dir}") | |
| return | |
| print(f"Found {len(pdf_files)} PDF files") | |
| if use_crossref: | |
| print("CrossRef enrichment: enabled") | |
| if rename_files: | |
| print("File renaming: enabled") | |
| if extract_tei: | |
| print("TEI extraction: enabled") | |
| # Load existing BibTeX entries | |
| existing_keys, existing_content = load_existing_bibtex(output_file) | |
| if existing_keys: | |
| print(f"Found {len(existing_keys)} existing entries in {output_file}") | |
| # Track all used keys (existing + new) | |
| used_keys = existing_keys.copy() | |
| # Determine which files to skip | |
| files_to_process = [] | |
| for pdf_path in pdf_files: | |
| expected_key = get_expected_key_from_filename(pdf_path.name) | |
| if expected_key and expected_key in existing_keys: | |
| print(f" Skipping (already in bib): {pdf_path.name}") | |
| else: | |
| files_to_process.append(pdf_path) | |
| if not files_to_process: | |
| print("\nNo new files to process.") | |
| return | |
| print(f"\nProcessing {len(files_to_process)} new files...") | |
| if not start_grobid_container(): | |
| print("Failed to start Grobid. Exiting.") | |
| sys.exit(1) | |
| new_entries = [] | |
| stats = {"success": 0, "crossref_enriched": 0, "failed": 0, "renamed": 0, "tei_saved": 0} | |
| for i, pdf_path in enumerate(files_to_process, 1): | |
| print(f"\n[{i}/{len(files_to_process)}] Processing: {pdf_path.name}") | |
| result = process_single_pdf(pdf_path, used_keys, use_crossref=use_crossref, | |
| extract_tei=extract_tei) | |
| if not result: | |
| print(f" ✗ Failed to process") | |
| stats["failed"] += 1 | |
| continue | |
| key, bibtex_entry, metadata, tei_xml = result | |
| # Add key to used set | |
| used_keys.add(key) | |
| new_entries.append(bibtex_entry) | |
| stats["success"] += 1 | |
| # Check if CrossRef enriched | |
| if metadata.get("volume") or metadata.get("pages") or metadata.get("number"): | |
| stats["crossref_enriched"] += 1 | |
| # Display result | |
| print(f" ✓ Key: {key}") | |
| if metadata.get("title"): | |
| title = metadata["title"] | |
| print(f" Title: {title[:60]}{'...' if len(title) > 60 else ''}") | |
| if metadata.get("author"): | |
| authors = metadata["author"].split(" and ") | |
| author_str = ", ".join(authors[:2]) + (" et al." if len(authors) > 2 else "") | |
| print(f" Authors: {author_str}") | |
| if metadata.get("year"): | |
| extra = [] | |
| if metadata.get("journal"): | |
| extra.append(metadata["journal"][:30]) | |
| if metadata.get("volume"): | |
| vol = metadata["volume"] | |
| if metadata.get("number"): | |
| vol += f"({metadata['number']})" | |
| extra.append(vol) | |
| if metadata.get("pages"): | |
| extra.append(f"pp. {metadata['pages']}") | |
| print(f" Year: {metadata['year']}" + (f" | {', '.join(extra)}" if extra else "")) | |
| # Save TEI XML if extracted | |
| if tei_xml: | |
| tei_filename = f"{key}.tei.xml" | |
| tei_path = pdf_path.parent / tei_filename | |
| try: | |
| tei_path.write_text(tei_xml, encoding="utf-8") | |
| print(f" TEI saved: {tei_filename}") | |
| stats["tei_saved"] += 1 | |
| except OSError as e: | |
| print(f" ⚠ Failed to save TEI: {e}") | |
| # Rename file if requested | |
| if rename_files: | |
| new_filename = generate_new_filename(key, metadata) | |
| if new_filename != pdf_path.name: | |
| new_path = rename_pdf_file(pdf_path, new_filename) | |
| if new_path: | |
| print(f" Renamed: {pdf_path.name} → {new_filename}") | |
| stats["renamed"] += 1 | |
| # Be nice to CrossRef API | |
| if use_crossref: | |
| time.sleep(0.1) | |
| # Write BibTeX file (append new entries or create new) | |
| if existing_content: | |
| # Append to existing file | |
| with open(output_file, "a", encoding="utf-8") as f: | |
| f.write("\n\n") | |
| f.write(f"% Added {len(new_entries)} entries on {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n") | |
| f.write("\n\n".join(new_entries)) | |
| f.write("\n") | |
| else: | |
| # Create new file | |
| with open(output_file, "w", encoding="utf-8") as f: | |
| f.write("% BibTeX file generated by pdf_to_bibtex.py\n") | |
| f.write("% Using Grobid + DOI content negotiation for metadata enrichment\n") | |
| f.write(f"% Generated from: {input_dir}\n") | |
| f.write(f"% Date: {time.strftime('%Y-%m-%d %H:%M:%S')}\n") | |
| f.write(f"% Total entries: {len(new_entries)}\n\n") | |
| f.write("\n\n".join(new_entries)) | |
| f.write("\n") | |
| print(f"\n{'='*60}") | |
| print(f"✓ Successfully updated {output_file}") | |
| print(f" New entries added: {stats['success']}") | |
| print(f" CrossRef enriched: {stats['crossref_enriched']}") | |
| print(f" Total entries in file: {len(existing_keys) + stats['success']}") | |
| if extract_tei: | |
| print(f" TEI files saved: {stats['tei_saved']}") | |
| if rename_files: | |
| print(f" Files renamed: {stats['renamed']}") | |
| if stats["failed"]: | |
| print(f" Failed: {stats['failed']}") | |
| def main(): | |
| parser = argparse.ArgumentParser( | |
| description="Extract bibliographic metadata from PDFs using Grobid + DOI lookup" | |
| ) | |
| parser.add_argument( | |
| "input_dir", | |
| type=Path, | |
| help="Directory containing PDF files" | |
| ) | |
| parser.add_argument( | |
| "output", | |
| type=Path, | |
| nargs="?", | |
| default=Path("references.bib"), | |
| help="Output BibTeX file (default: references.bib)" | |
| ) | |
| parser.add_argument( | |
| "--no-crossref", | |
| action="store_true", | |
| help="Skip DOI lookup and CrossRef title search (use only Grobid data)" | |
| ) | |
| parser.add_argument( | |
| "--rename-files", | |
| action="store_true", | |
| help="Rename PDF files to AuthorYYYY.pdf format based on extracted metadata" | |
| ) | |
| parser.add_argument( | |
| "--extract-tei", | |
| action="store_true", | |
| help="Extract full TEI XML and save as AuthorYYYY.tei.xml alongside PDFs" | |
| ) | |
| parser.add_argument( | |
| "--stop-grobid", | |
| action="store_true", | |
| help="Stop the Grobid container after processing" | |
| ) | |
| args = parser.parse_args() | |
| if not args.input_dir.exists(): | |
| print(f"Error: Input directory '{args.input_dir}' does not exist") | |
| sys.exit(1) | |
| if not args.input_dir.is_dir(): | |
| print(f"Error: '{args.input_dir}' is not a directory") | |
| sys.exit(1) | |
| process_directory( | |
| args.input_dir, | |
| args.output, | |
| use_crossref=not args.no_crossref, | |
| rename_files=args.rename_files, | |
| extract_tei=args.extract_tei | |
| ) | |
| if args.stop_grobid: | |
| print("\nStopping Grobid container...") | |
| subprocess.run(["docker", "stop", GROBID_CONTAINER_NAME], check=False) | |
| print("✓ Grobid container stopped") | |
| if __name__ == "__main__": | |
| main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment