joelkuiper/pdf_to_bibtex.py

## pdf_to_bibtex.py
#!/usr/bin/env python3
"""
pdf_to_bibtex.py - Academic PDF to BibTeX Converter
====================================================

Automatically extract bibliographic metadata from academic PDFs using Grobid
and CrossRef, generating a clean BibTeX file with consistent AuthorYYYY citation keys.

INPUT FORMAT
------------
Expects a directory of PDFs named in AuthorYYYY format (or close variations):

    Literature/
    ├── Adams2013.pdf                      → Adams2013
    ├── Aston-Jones2005.pdf                → AstonJones2005
    ├── Feyaerts&Henriksen2021.pdf         → FeyaertsHenriksen2021
    └── Cross-Disorder Group... 2019.pdf   → CrossDisorder2019

The year must be a 4-digit number at the end (before .pdf). The script uses
the filename to generate citation keys and to skip already-processed files.

FEATURES
--------
• Extracts metadata from PDFs using Grobid (runs via Docker)
• Enriches via DOI content negotiation (direct BibTeX from doi.org)
• Falls back to CrossRef API title search when no DOI available
• Generates BibTeX with AuthorYYYY keys (e.g., Chesney2014, AstonJones2005)
• Smart key disambiguation: AuthorYYYY → AuthorSecondAuthorYYYY → AuthorYYYYa
• Incremental processing: skips PDFs already in the .bib file
• Optional: rename PDFs to match their citation keys
• Optional: extract full TEI XML for each document (structured full-text)

REQUIREMENTS
------------
• Docker (for Grobid)
• Python 3.7+
• requests library: pip install requests

Grobid is pulled automatically on first run (~2GB Docker image).

USAGE
-----
Basic - process all PDFs and create references.bib:

    python pdf_to_bibtex.py /path/to/papers/

Specify output file:

    python pdf_to_bibtex.py /path/to/papers/ my_references.bib

With all options:

    python pdf_to_bibtex.py /path/to/papers/ refs.bib --rename-files --extract-tei

OPTIONS
-------
--no-crossref     Skip DOI lookup and CrossRef title search (use only Grobid data)
--rename-files    Rename PDFs to AuthorYYYY.pdf based on extracted metadata
--extract-tei     Save full TEI XML as AuthorYYYY.tei.xml (Grobid's structured output)
--stop-grobid     Stop the Grobid Docker container after processing

EXAMPLE OUTPUT
--------------
    $ python pdf_to_bibtex.py Literature/

    Found 69 PDF files
    CrossRef enrichment: enabled
    Found 42 existing entries in references.bib
      Skipping (already in bib): Adams2013.pdf
      Skipping (already in bib): Bastos2012.pdf

    Processing 27 new files...
    ✓ Grobid container 'grobid-pdf-extractor' is already running

    [1/27] Processing: NewPaper2024.pdf
      → Querying CrossRef by DOI: 10.1038/s41586-024-07051-0...
      ✓ CrossRef: found via DOI
      ✓ Key: Smith2024
        Title: A new theory of consciousness
        Authors: Smith, John, Doe, Jane et al.
        Year: 2024 | Nature, 625(7994), pp. 112--118

    ============================================================
    ✓ Successfully updated references.bib
      New entries added: 27
      CrossRef enriched: 25
      Total entries in file: 69

GENERATED BIBTEX FORMAT
-----------------------
    @article{Smith2024,
      author = {Smith, John and Doe, Jane and Johnson, Bob},
      title = {A new theory of consciousness},
      journal = {Nature},
      year = {2024},
      volume = {625},
      number = {7994},
      pages = {112--118},
      doi = {10.1038/s41586-024-07051-0}
    }

HOW IT WORKS
------------
1. Grobid extracts title, authors, DOI from PDF header/first page
2. If DOI found → fetch BibTeX directly from doi.org (content negotiation)
3. If no DOI → search CrossRef by title, use best match
4. Merge Grobid + enriched data (DOI/CrossRef preferred for year, volume, pages, etc.)
5. Generate citation key from filename (if AuthorYYYY pattern) or from metadata
6. Append new entries to existing .bib file (won't duplicate)

FILE NAMING CONVENTION
----------------------
The script handles various filename formats:

• AuthorYYYY.pdf → AuthorYYYY (e.g., Chesney2014.pdf → Chesney2014)
• Author-AuthorYYYY.pdf → AuthorAuthorYYYY (e.g., Aston-Jones2005.pdf → AstonJones2005)
• Author&AuthorYYYY.pdf → AuthorAuthorYYYY (e.g., Feyaerts&Henriksen2021.pdf → FeyaertsHenriksen2021)
• "Name With Spaces YYYY.pdf" → FirstWord+YYYY (e.g., "Cross-Disorder Group... 2019.pdf" → CrossDisorder2019)

The script will:
• Use the filename to derive the citation key
• Skip files whose key already exists in the .bib
• With --rename-files, rename PDFs to clean AuthorYYYY.pdf format

For papers with the same first author and year, disambiguation is automatic:
• Smith2024.pdf → Smith2024
• Smith2024-other.pdf → SmithJones2024 (uses second author from filename or metadata)
• Smith2024-third.pdf → Smith2024a (alphabetical suffix)

TEI XML OUTPUT (--extract-tei)
------------------------------
Grobid can produce structured TEI XML with:
• Full text segmented into sections
• Parsed references with links
• Figures and tables identified
• Author affiliations and emails

Useful for text mining, citation analysis, or building a local search index.

NOTES
-----
• Grobid container keeps running after the script (for faster subsequent runs)
• Use --stop-grobid to stop it when done
• DOI lookup uses content negotiation (doi.org) which is fast and reliable
• CrossRef is only used as fallback for title search when no DOI is available
• Some PDFs (scans, unusual layouts) may yield incomplete metadata

AUTHOR
------
Generated with Claude. Feel free to modify and redistribute.

LICENSE
-------
MIT License - do whatever you want with it.
"""

import os
import sys
import re
import time
import subprocess
import argparse
from pathlib import Path
from typing import Optional, Dict, Any, Tuple
import requests

# Grobid configuration
GROBID_IMAGE = "lfoppiano/grobid:0.8.1"
GROBID_CONTAINER_NAME = "grobid-pdf-extractor"
GROBID_PORT = 8070
GROBID_URL = f"http://localhost:{GROBID_PORT}"

# CrossRef configuration (used for title search fallback)
CROSSREF_API = "https://api.crossref.org"
# Be polite - identify ourselves (CrossRef asks for this)
CROSSREF_HEADERS = {
    "User-Agent": "pdf-to-bibtex/1.0 (https://github.com/user/pdf-to-bibtex; mailto:user@example.com)"
}


def load_existing_bibtex(filepath: Path) -> Tuple[set, str]:
    """Load existing BibTeX file and return set of keys and the content."""
    if not filepath.exists():
        return set(), ""

    content = filepath.read_text(encoding="utf-8")

    # Extract all keys from @type{key, patterns
    keys = set(re.findall(r'@\w+\s*\{\s*([^,\s]+)\s*,', content))

    return keys, content


def start_grobid_container() -> bool:
    """Start the Grobid Docker container if not already running."""
    result = subprocess.run(
        ["docker", "ps", "-q", "-f", f"name={GROBID_CONTAINER_NAME}"],
        capture_output=True,
        text=True
    )

    if result.stdout.strip():
        print(f"✓ Grobid container '{GROBID_CONTAINER_NAME}' is already running")
        return True

    result = subprocess.run(
        ["docker", "ps", "-aq", "-f", f"name={GROBID_CONTAINER_NAME}"],
        capture_output=True,
        text=True
    )

    if result.stdout.strip():
        print(f"Starting existing Grobid container '{GROBID_CONTAINER_NAME}'...")
        subprocess.run(["docker", "start", GROBID_CONTAINER_NAME], check=True)
    else:
        print(f"Pulling and starting Grobid container ({GROBID_IMAGE})...")
        subprocess.run([
            "docker", "run", "-d",
            "--name", GROBID_CONTAINER_NAME,
            "-p", f"{GROBID_PORT}:8070",
            GROBID_IMAGE
        ], check=True)

    print("Waiting for Grobid to initialize (this may take a minute)...")
    max_attempts = 60
    for attempt in range(max_attempts):
        try:
            response = requests.get(f"{GROBID_URL}/api/isalive", timeout=5)
            if response.status_code == 200:
                print("✓ Grobid is ready")
                return True
        except requests.exceptions.RequestException:
            pass
        time.sleep(2)
        if (attempt + 1) % 10 == 0:
            print(f"  Still waiting... ({attempt + 1}/{max_attempts})")

    print("✗ Grobid failed to start in time")
    return False


def process_pdf_with_grobid(pdf_path: Path) -> Optional[str]:
    """Send a PDF to Grobid and get BibTeX response."""
    url = f"{GROBID_URL}/api/processHeaderDocument"

    try:
        with open(pdf_path, "rb") as pdf_file:
            files = {"input": (pdf_path.name, pdf_file, "application/pdf")}
            response = requests.post(url, files=files, timeout=120)

        if response.status_code == 200:
            return response.text
        else:
            print(f"  Warning: Grobid returned status {response.status_code}")
            return None
    except Exception as e:
        print(f"  Error processing with Grobid: {e}")
        return None


def process_pdf_fulltext_tei(pdf_path: Path) -> Optional[str]:
    """Send a PDF to Grobid and get full TEI XML response."""
    url = f"{GROBID_URL}/api/processFulltextDocument"

    try:
        with open(pdf_path, "rb") as pdf_file:
            files = {"input": (pdf_path.name, pdf_file, "application/pdf")}
            response = requests.post(url, files=files, timeout=300)  # Longer timeout for full processing

        if response.status_code == 200:
            return response.text
        else:
            print(f"  Warning: Grobid TEI returned status {response.status_code}")
            return None
    except Exception as e:
        print(f"  Error getting TEI from Grobid: {e}")
        return None


def parse_bibtex(bibtex_str: str) -> Dict[str, str]:
    """Parse a BibTeX entry into a dictionary."""
    fields = {}

    # Extract entry type
    type_match = re.match(r'@(\w+)\s*\{', bibtex_str)
    if type_match:
        fields['_type'] = type_match.group(1).lower()

    # Extract fields - handle nested braces and quoted values
    for match in re.finditer(r'(\w+)\s*=\s*[{"]((?:[^{}"]|(?:\{[^{}]*\}))*)[}"]', bibtex_str):
        key = match.group(1).lower()
        value = match.group(2).strip()
        if value:
            fields[key] = value

    return fields


def query_doi_bibtex(doi: str) -> Optional[str]:
    """Query DOI directly for BibTeX using content negotiation."""
    try:
        # Clean DOI
        doi = doi.strip()
        if doi.startswith("http"):
            doi = re.sub(r'https?://(dx\.)?doi\.org/', '', doi)

        url = f"https://doi.org/{doi}"
        headers = {"Accept": "application/x-bibtex"}
        response = requests.get(url, headers=headers, timeout=30, allow_redirects=True)

        if response.status_code == 200 and response.text.strip().startswith("@"):
            # Ensure proper UTF-8 decoding
            response.encoding = 'utf-8'
            text = response.text
            # Normalize various dash types to standard BibTeX double-dash
            text = text.replace('–', '--')  # en-dash
            text = text.replace('—', '--')  # em-dash
            text = text.replace('−', '-')   # minus sign
            return text
        return None
    except Exception as e:
        print(f"  DOI lookup failed: {e}")
        return None


def query_crossref_by_title(title: str) -> Optional[Dict[str, Any]]:
    """Query CrossRef API by title search."""
    try:
        # Clean title
        title = re.sub(r'[^\w\s]', ' ', title)
        title = ' '.join(title.split()[:15])  # First 15 words

        url = f"{CROSSREF_API}/works"
        params = {
            "query.title": title,
            "rows": 1,
            "select": "DOI,title,author,published-print,published-online,container-title,volume,issue,page,publisher,type"
        }
        response = requests.get(url, params=params, headers=CROSSREF_HEADERS, timeout=30)

        if response.status_code == 200:
            items = response.json().get("message", {}).get("items", [])
            if items:
                return items[0]
        return None
    except Exception as e:
        print(f"  CrossRef title search failed: {e}")
        return None


def crossref_to_bibtex_fields(cr: Dict[str, Any]) -> Dict[str, str]:
    """Convert CrossRef API response to BibTeX fields."""
    fields = {}

    # Title
    if "title" in cr and cr["title"]:
        fields["title"] = cr["title"][0] if isinstance(cr["title"], list) else cr["title"]

    # Authors
    if "author" in cr and cr["author"]:
        authors = []
        for a in cr["author"]:
            if "family" in a:
                name = a["family"]
                if "given" in a:
                    name = f"{a['family']}, {a['given']}"
                authors.append(name)
        if authors:
            fields["author"] = " and ".join(authors)

    # Year - try multiple date fields
    year = None
    for date_field in ["published-print", "published-online", "issued", "created"]:
        if date_field in cr and cr[date_field]:
            date_parts = cr[date_field].get("date-parts", [[]])
            if date_parts and date_parts[0]:
                year = str(date_parts[0][0])
                break
    if year:
        fields["year"] = year

    # Journal
    if "container-title" in cr and cr["container-title"]:
        fields["journal"] = cr["container-title"][0] if isinstance(cr["container-title"], list) else cr["container-title"]

    # Volume
    if "volume" in cr:
        fields["volume"] = str(cr["volume"])

    # Issue/Number
    if "issue" in cr:
        fields["number"] = str(cr["issue"])

    # Pages
    if "page" in cr:
        pages = cr["page"]
        # Normalize various dash types to standard BibTeX double-dash
        pages = pages.replace('–', '--')  # en-dash
        pages = pages.replace('—', '--')  # em-dash
        pages = pages.replace('−', '-')   # minus sign
        pages = pages.replace("-", "--")  # regular hyphen
        # Avoid quadruple dashes from double-replacement
        while '----' in pages:
            pages = pages.replace('----', '--')
        fields["pages"] = pages

    # DOI
    if "DOI" in cr:
        fields["doi"] = cr["DOI"]

    # Publisher
    if "publisher" in cr:
        fields["publisher"] = cr["publisher"]

    # Determine entry type
    cr_type = cr.get("type", "")
    if cr_type == "journal-article":
        fields["_type"] = "article"
    elif cr_type in ["book", "monograph"]:
        fields["_type"] = "book"
    elif cr_type == "proceedings-article":
        fields["_type"] = "inproceedings"
    elif cr_type == "book-chapter":
        fields["_type"] = "incollection"
    else:
        fields["_type"] = "article" if fields.get("journal") else "misc"

    return fields


def merge_metadata(grobid: Dict[str, str], crossref: Dict[str, str], filename: str) -> Dict[str, str]:
    """Merge Grobid and CrossRef metadata, preferring CrossRef for most fields."""
    merged = {}

    # CrossRef is more reliable for structured data
    crossref_preferred = ["year", "volume", "number", "pages", "journal", "publisher", "doi"]

    # Grobid might have better title extraction in some cases
    grobid_preferred = []

    # Start with Grobid data
    for key, value in grobid.items():
        if value:
            merged[key] = value

    # Override/add from CrossRef
    for key, value in crossref.items():
        if value:
            if key in crossref_preferred or key not in merged:
                merged[key] = value

    # Use entry type from CrossRef if available
    if "_type" in crossref:
        merged["_type"] = crossref["_type"]

    # Fallback: extract year from filename if still missing
    if "year" not in merged or not merged["year"]:
        year_match = re.search(r'(\d{4})', filename)
        if year_match:
            merged["year"] = year_match.group(1)

    return merged


def generate_bibtex_key(filename: str, metadata: Dict[str, str], used_keys: set) -> str:
    """
    Generate a unique BibTeX key in AuthorYYYY format.

    Disambiguation strategy:
    1. AuthorYYYY
    2. AuthorSecondAuthorYYYY (if duplicate and second author exists)
    3. AuthorYYYYa, AuthorYYYYb, etc. (if still duplicate)
    """
    filename_stem = Path(filename).stem
    first_author = None
    second_author_from_filename = None
    year = None

    # Pattern 1: Standard AuthorYYYY or Author-AuthorYYYY
    match = re.match(r"^([A-Za-z-]+)(\d{4})", filename_stem)
    if match:
        first_author = match.group(1).replace("-", "")
        year = match.group(2)

    # Pattern 2: Author&Author2YYYY (e.g., Feyaerts&Henriksen2021)
    if not first_author:
        match = re.match(r"^([A-Za-z-]+)&([A-Za-z-]+)(\d{4})", filename_stem)
        if match:
            first_author = match.group(1).replace("-", "")
            second_author_from_filename = match.group(2).replace("-", "")
            year = match.group(3)

    # Pattern 3: "Some Name With Spaces YYYY" (e.g., "Cross-Disorder Group... 2019")
    if not first_author:
        match = re.match(r"^(.+?)\s+(\d{4})", filename_stem)
        if match:
            name_part = match.group(1)
            year = match.group(2)
            first_word_match = re.match(r"^([A-Za-z-]+)", name_part)
            if first_word_match:
                first_author = first_word_match.group(1).replace("-", "")

    # Fall back to metadata if filename didn't match
    if not first_author and "author" in metadata:
        authors_str = metadata["author"]
        first_author_full = re.split(r'\s+and\s+', authors_str)[0]
        if "," in first_author_full:
            first_author = first_author_full.split(",")[0].strip()
        else:
            parts = first_author_full.split()
            first_author = parts[0] if parts else ""
        first_author = re.sub(r"[^A-Za-z]", "", first_author)

    if not year:
        year = metadata.get("year", "XXXX")

    if not first_author:
        first_author = re.sub(r"[^A-Za-z]", "", filename_stem) or "Unknown"

    # Try basic key first
    base_key = f"{first_author}{year}"
    if base_key not in used_keys:
        return base_key

    # Try with second author from filename if available (e.g., Feyaerts&Henriksen2021)
    if second_author_from_filename:
        two_author_key = f"{first_author}{second_author_from_filename}{year}"
        if two_author_key not in used_keys:
            return two_author_key

    # Try with second author from metadata
    if "author" in metadata:
        authors = re.split(r'\s+and\s+', metadata["author"])
        if len(authors) >= 2:
            second_author_full = authors[1]
            if "," in second_author_full:
                second_author = second_author_full.split(",")[0].strip()
            else:
                parts = second_author_full.split()
                second_author = parts[-1] if parts else ""  # Last name
            second_author = re.sub(r"[^A-Za-z]", "", second_author)

            if second_author:
                two_author_key = f"{first_author}{second_author}{year}"
                if two_author_key not in used_keys:
                    return two_author_key

    # Fall back to alphabetical suffix
    suffix = ord('a')
    while f"{base_key}{chr(suffix)}" in used_keys:
        suffix += 1
        if suffix > ord('z'):
            suffix = ord('a')
            base_key = f"{base_key}z"  # Extremely unlikely edge case

    return f"{base_key}{chr(suffix)}"


def escape_bibtex(text: str) -> str:
    """Escape special characters for BibTeX."""
    if not text:
        return ""
    # Only escape & which is common in titles
    text = text.replace("&", "\\&")
    return text


def metadata_to_bibtex(key: str, metadata: Dict[str, str]) -> str:
    """Convert metadata dictionary to a BibTeX entry string."""
    entry_type = metadata.get("_type", "misc")

    lines = [f"@{entry_type}{{{key},"]

    # Order fields nicely
    field_order = ["author", "title", "journal", "booktitle", "year", "volume", "number", "pages", "publisher", "doi"]

    added_fields = set()
    for field in field_order:
        if field in metadata and metadata[field]:
            value = escape_bibtex(metadata[field])
            lines.append(f"  {field} = {{{value}}},")
            added_fields.add(field)

    # Add any remaining fields
    for field, value in metadata.items():
        if field not in added_fields and not field.startswith("_") and value:
            value = escape_bibtex(value)
            lines.append(f"  {field} = {{{value}}},")

    # Remove trailing comma from last field
    if lines[-1].endswith(","):
        lines[-1] = lines[-1][:-1]

    lines.append("}")

    return "\n".join(lines)


def process_single_pdf(pdf_path: Path, used_keys: set, use_crossref: bool = True,
                       extract_tei: bool = False) -> Optional[tuple]:
    """Process a single PDF and return (key, bibtex_entry, metadata, tei_xml)."""

    # Step 1: Get initial data from Grobid
    grobid_bibtex = process_pdf_with_grobid(pdf_path)
    if not grobid_bibtex:
        return None

    grobid_fields = parse_bibtex(grobid_bibtex)

    # Step 2: Try to get better metadata
    enriched_fields = {}
    if use_crossref:
        doi = grobid_fields.get("doi", "")
        title = grobid_fields.get("title", "")

        # Try DOI content negotiation first (direct BibTeX from doi.org)
        if doi:
            print(f"  → Fetching BibTeX from DOI: {doi[:50]}...")
            doi_bibtex = query_doi_bibtex(doi)
            if doi_bibtex:
                enriched_fields = parse_bibtex(doi_bibtex)
                print(f"  ✓ DOI: got BibTeX directly")

        # Fall back to CrossRef title search if no DOI or DOI lookup failed
        if not enriched_fields and title:
            print(f"  → Querying CrossRef by title...")
            cr_data = query_crossref_by_title(title)
            if cr_data:
                enriched_fields = crossref_to_bibtex_fields(cr_data)
                print(f"  ✓ CrossRef: found via title search")

    # Step 3: Merge metadata
    final_metadata = merge_metadata(grobid_fields, enriched_fields, pdf_path.name)

    # Step 4: Generate key (with disambiguation)
    key = generate_bibtex_key(pdf_path.name, final_metadata, used_keys)

    # Step 5: Generate BibTeX
    bibtex_entry = metadata_to_bibtex(key, final_metadata)

    # Step 6: Get full TEI if requested
    tei_xml = None
    if extract_tei:
        print(f"  → Extracting full TEI...")
        tei_xml = process_pdf_fulltext_tei(pdf_path)
        if tei_xml:
            print(f"  ✓ TEI extracted")
        else:
            print(f"  ⚠ TEI extraction failed")

    return key, bibtex_entry, final_metadata, tei_xml


def generate_new_filename(key: str, metadata: Dict[str, str]) -> str:
    """Generate a new filename based on the BibTeX key."""
    # Use the key directly as the filename (it's already in AuthorYYYY format)
    return f"{key}.pdf"


def rename_pdf_file(old_path: Path, new_filename: str) -> Optional[Path]:
    """Rename a PDF file, handling conflicts."""
    new_path = old_path.parent / new_filename

    if new_path == old_path:
        return old_path  # No change needed

    if new_path.exists():
        print(f"  ⚠ Cannot rename: {new_filename} already exists")
        return None

    try:
        old_path.rename(new_path)
        return new_path
    except OSError as e:
        print(f"  ⚠ Rename failed: {e}")
        return None


def get_expected_key_from_filename(filename: str) -> Optional[str]:
    """Extract expected BibTeX key from filename if it matches AuthorYYYY pattern."""
    filename_stem = Path(filename).stem

    # Pattern 1: Standard AuthorYYYY or Author-Author2YYYY (e.g., Chesney2014, Aston-Jones2005)
    match = re.match(r"^([A-Za-z-]+)(\d{4})([a-z])?$", filename_stem)
    if match:
        author = match.group(1).replace("-", "")
        year = match.group(2)
        suffix = match.group(3) or ""
        return f"{author}{year}{suffix}"

    # Pattern 2: Author&Author2YYYY (e.g., Feyaerts&Henriksen2021)
    match = re.match(r"^([A-Za-z-]+)&([A-Za-z-]+)(\d{4})([a-z])?$", filename_stem)
    if match:
        author1 = match.group(1).replace("-", "")
        author2 = match.group(2).replace("-", "")
        year = match.group(3)
        suffix = match.group(4) or ""
        return f"{author1}{author2}{year}{suffix}"

    # Pattern 3: "Some Name With Spaces YYYY" (e.g., "Cross-Disorder Group... 2019")
    match = re.match(r"^(.+?)\s+(\d{4})([a-z])?$", filename_stem)
    if match:
        name_part = match.group(1)
        year = match.group(2)
        suffix = match.group(3) or ""
        # Extract first word (or hyphenated word) as the key base
        first_word_match = re.match(r"^([A-Za-z-]+)", name_part)
        if first_word_match:
            author = first_word_match.group(1).replace("-", "")
            return f"{author}{year}{suffix}"

    return None


def process_directory(input_dir: Path, output_file: Path, use_crossref: bool = True,
                      rename_files: bool = False, extract_tei: bool = False) -> None:
    """Process all PDFs in a directory and generate a BibTeX file."""
    pdf_files = sorted(input_dir.glob("*.pdf"))

    if not pdf_files:
        print(f"No PDF files found in {input_dir}")
        return

    print(f"Found {len(pdf_files)} PDF files")
    if use_crossref:
        print("CrossRef enrichment: enabled")
    if rename_files:
        print("File renaming: enabled")
    if extract_tei:
        print("TEI extraction: enabled")

    # Load existing BibTeX entries
    existing_keys, existing_content = load_existing_bibtex(output_file)
    if existing_keys:
        print(f"Found {len(existing_keys)} existing entries in {output_file}")

    # Track all used keys (existing + new)
    used_keys = existing_keys.copy()

    # Determine which files to skip
    files_to_process = []
    for pdf_path in pdf_files:
        expected_key = get_expected_key_from_filename(pdf_path.name)
        if expected_key and expected_key in existing_keys:
            print(f"  Skipping (already in bib): {pdf_path.name}")
        else:
            files_to_process.append(pdf_path)

    if not files_to_process:
        print("\nNo new files to process.")
        return

    print(f"\nProcessing {len(files_to_process)} new files...")

    if not start_grobid_container():
        print("Failed to start Grobid. Exiting.")
        sys.exit(1)

    new_entries = []
    stats = {"success": 0, "crossref_enriched": 0, "failed": 0, "renamed": 0, "tei_saved": 0}

    for i, pdf_path in enumerate(files_to_process, 1):
        print(f"\n[{i}/{len(files_to_process)}] Processing: {pdf_path.name}")

        result = process_single_pdf(pdf_path, used_keys, use_crossref=use_crossref,
                                    extract_tei=extract_tei)

        if not result:
            print(f"  ✗ Failed to process")
            stats["failed"] += 1
            continue

        key, bibtex_entry, metadata, tei_xml = result

        # Add key to used set
        used_keys.add(key)
        new_entries.append(bibtex_entry)
        stats["success"] += 1

        # Check if CrossRef enriched
        if metadata.get("volume") or metadata.get("pages") or metadata.get("number"):
            stats["crossref_enriched"] += 1

        # Display result
        print(f"  ✓ Key: {key}")
        if metadata.get("title"):
            title = metadata["title"]
            print(f"    Title: {title[:60]}{'...' if len(title) > 60 else ''}")
        if metadata.get("author"):
            authors = metadata["author"].split(" and ")
            author_str = ", ".join(authors[:2]) + (" et al." if len(authors) > 2 else "")
            print(f"    Authors: {author_str}")
        if metadata.get("year"):
            extra = []
            if metadata.get("journal"):
                extra.append(metadata["journal"][:30])
            if metadata.get("volume"):
                vol = metadata["volume"]
                if metadata.get("number"):
                    vol += f"({metadata['number']})"
                extra.append(vol)
            if metadata.get("pages"):
                extra.append(f"pp. {metadata['pages']}")
            print(f"    Year: {metadata['year']}" + (f" | {', '.join(extra)}" if extra else ""))

        # Save TEI XML if extracted
        if tei_xml:
            tei_filename = f"{key}.tei.xml"
            tei_path = pdf_path.parent / tei_filename
            try:
                tei_path.write_text(tei_xml, encoding="utf-8")
                print(f"    TEI saved: {tei_filename}")
                stats["tei_saved"] += 1
            except OSError as e:
                print(f"    ⚠ Failed to save TEI: {e}")

        # Rename file if requested
        if rename_files:
            new_filename = generate_new_filename(key, metadata)
            if new_filename != pdf_path.name:
                new_path = rename_pdf_file(pdf_path, new_filename)
                if new_path:
                    print(f"    Renamed: {pdf_path.name} → {new_filename}")
                    stats["renamed"] += 1

        # Be nice to CrossRef API
        if use_crossref:
            time.sleep(0.1)

    # Write BibTeX file (append new entries or create new)
    if existing_content:
        # Append to existing file
        with open(output_file, "a", encoding="utf-8") as f:
            f.write("\n\n")
            f.write(f"% Added {len(new_entries)} entries on {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n")
            f.write("\n\n".join(new_entries))
            f.write("\n")
    else:
        # Create new file
        with open(output_file, "w", encoding="utf-8") as f:
            f.write("% BibTeX file generated by pdf_to_bibtex.py\n")
            f.write("% Using Grobid + DOI content negotiation for metadata enrichment\n")
            f.write(f"% Generated from: {input_dir}\n")
            f.write(f"% Date: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"% Total entries: {len(new_entries)}\n\n")
            f.write("\n\n".join(new_entries))
            f.write("\n")

    print(f"\n{'='*60}")
    print(f"✓ Successfully updated {output_file}")
    print(f"  New entries added: {stats['success']}")
    print(f"  CrossRef enriched: {stats['crossref_enriched']}")
    print(f"  Total entries in file: {len(existing_keys) + stats['success']}")
    if extract_tei:
        print(f"  TEI files saved: {stats['tei_saved']}")
    if rename_files:
        print(f"  Files renamed: {stats['renamed']}")
    if stats["failed"]:
        print(f"  Failed: {stats['failed']}")


def main():
    parser = argparse.ArgumentParser(
        description="Extract bibliographic metadata from PDFs using Grobid + DOI lookup"
    )
    parser.add_argument(
        "input_dir",
        type=Path,
        help="Directory containing PDF files"
    )
    parser.add_argument(
        "output",
        type=Path,
        nargs="?",
        default=Path("references.bib"),
        help="Output BibTeX file (default: references.bib)"
    )
    parser.add_argument(
        "--no-crossref",
        action="store_true",
        help="Skip DOI lookup and CrossRef title search (use only Grobid data)"
    )
    parser.add_argument(
        "--rename-files",
        action="store_true",
        help="Rename PDF files to AuthorYYYY.pdf format based on extracted metadata"
    )
    parser.add_argument(
        "--extract-tei",
        action="store_true",
        help="Extract full TEI XML and save as AuthorYYYY.tei.xml alongside PDFs"
    )
    parser.add_argument(
        "--stop-grobid",
        action="store_true",
        help="Stop the Grobid container after processing"
    )

    args = parser.parse_args()

    if not args.input_dir.exists():
        print(f"Error: Input directory '{args.input_dir}' does not exist")
        sys.exit(1)

    if not args.input_dir.is_dir():
        print(f"Error: '{args.input_dir}' is not a directory")
        sys.exit(1)

    process_directory(
        args.input_dir,
        args.output,
        use_crossref=not args.no_crossref,
        rename_files=args.rename_files,
        extract_tei=args.extract_tei
    )

    if args.stop_grobid:
        print("\nStopping Grobid container...")
        subprocess.run(["docker", "stop", GROBID_CONTAINER_NAME], check=False)
        print("✓ Grobid container stopped")


if __name__ == "__main__":
    main()
No results found