Skip to content

Instantly share code, notes, and snippets.

@ethanabrooks
Created December 30, 2025 14:35
Show Gist options
  • Select an option

  • Save ethanabrooks/038404535fcffebda76a55b043859b90 to your computer and use it in GitHub Desktop.

Select an option

Save ethanabrooks/038404535fcffebda76a55b043859b90 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "0a3a1d08",
"metadata": {
"papermill": {
"duration": 0.002916,
"end_time": "2025-12-30T14:29:54.412508",
"exception": false,
"start_time": "2025-12-30T14:29:54.409592",
"status": "completed"
},
"tags": []
},
"source": [
"# Web Content Extraction Tool Comparison\n",
"\n",
"Comparing tools for extracting readable text/markdown from web pages.\n",
"\n",
"## Tools Evaluated\n",
"\n",
"| Tool | Type | Notes |\n",
"| ------------------- | ---------- | ---------------------------------------- |\n",
"| trafilatura | Python | Purpose-built for web text extraction |\n",
"| newspaper3k | Python | News article extraction |\n",
"| readability-lxml | Python | Python port of Mozilla Readability |\n",
"| Mozilla Readability | JavaScript | Original Firefox Reader View library |\n",
"| Playwright | Python | Browser automation for JS-rendered pages |\n",
"| html2text | Python | HTML to Markdown converter |\n",
"| BeautifulSoup | Python | Manual extraction baseline |\n",
"| Parallel.ai | API | Commercial service (requires API key) |\n"
]
},
{
"cell_type": "markdown",
"id": "025255de",
"metadata": {
"papermill": {
"duration": 0.002391,
"end_time": "2025-12-30T14:29:54.417642",
"exception": false,
"start_time": "2025-12-30T14:29:54.415251",
"status": "completed"
},
"tags": []
},
"source": [
"## Configuration\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d880d862",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.424864Z",
"iopub.status.busy": "2025-12-30T14:29:54.424545Z",
"iopub.status.idle": "2025-12-30T14:29:54.428230Z",
"shell.execute_reply": "2025-12-30T14:29:54.427807Z"
},
"papermill": {
"duration": 0.007598,
"end_time": "2025-12-30T14:29:54.428810",
"exception": false,
"start_time": "2025-12-30T14:29:54.421212",
"status": "completed"
},
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Parameters - these can be overridden by papermill\n",
"TEST_URL = \"https://en.wikipedia.org/wiki/WBA_interim_middleweight_championship#List_of_interim_champions\"\n",
"MAX_CHARS = 3000 # Maximum characters to display per output"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7b1ea96f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.433161Z",
"iopub.status.busy": "2025-12-30T14:29:54.433051Z",
"iopub.status.idle": "2025-12-30T14:29:54.435019Z",
"shell.execute_reply": "2025-12-30T14:29:54.434570Z"
},
"papermill": {
"duration": 0.005079,
"end_time": "2025-12-30T14:29:54.435561",
"exception": false,
"start_time": "2025-12-30T14:29:54.430482",
"status": "completed"
},
"tags": [
"injected-parameters"
]
},
"outputs": [],
"source": [
"# Parameters\n",
"TEST_URL = \"https://amazon.com\"\n"
]
},
{
"cell_type": "markdown",
"id": "b24ee9dd",
"metadata": {
"papermill": {
"duration": 0.001454,
"end_time": "2025-12-30T14:29:54.438616",
"exception": false,
"start_time": "2025-12-30T14:29:54.437162",
"status": "completed"
},
"tags": []
},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "97de6407",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.443113Z",
"iopub.status.busy": "2025-12-30T14:29:54.442948Z",
"iopub.status.idle": "2025-12-30T14:29:54.478893Z",
"shell.execute_reply": "2025-12-30T14:29:54.478088Z"
},
"papermill": {
"duration": 0.03975,
"end_time": "2025-12-30T14:29:54.479847",
"exception": false,
"start_time": "2025-12-30T14:29:54.440097",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"import subprocess\n",
"import requests\n",
"from pathlib import Path"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ee68e648",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.484989Z",
"iopub.status.busy": "2025-12-30T14:29:54.484861Z",
"iopub.status.idle": "2025-12-30T14:29:54.680092Z",
"shell.execute_reply": "2025-12-30T14:29:54.679525Z"
},
"papermill": {
"duration": 0.198835,
"end_time": "2025-12-30T14:29:54.680698",
"exception": false,
"start_time": "2025-12-30T14:29:54.481863",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fetched 2,003 bytes\n"
]
}
],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\",\n",
" \"Accept\": \"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",\n",
" \"Accept-Language\": \"en-US,en;q=0.9\",\n",
"}\n",
"response = requests.get(TEST_URL, headers=headers, timeout=30)\n",
"html_content = None\n",
"fetch_error = None\n",
"\n",
"if response.ok:\n",
" html_content = response.text\n",
" print(f\"Fetched {len(html_content):,} bytes\")\n",
"else:\n",
" fetch_error = f\"HTTP {response.status_code}: {response.reason}\"\n",
" print(f\"Fetch failed: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "20a00121",
"metadata": {
"papermill": {
"duration": 0.001886,
"end_time": "2025-12-30T14:29:54.684397",
"exception": false,
"start_time": "2025-12-30T14:29:54.682511",
"status": "completed"
},
"tags": []
},
"source": [
"## 1. Trafilatura\n",
"\n",
"[trafilatura](https://trafilatura.readthedocs.io/) - Purpose-built for web text extraction with native markdown output.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "77dc050c",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.689351Z",
"iopub.status.busy": "2025-12-30T14:29:54.689232Z",
"iopub.status.idle": "2025-12-30T14:29:54.854731Z",
"shell.execute_reply": "2025-12-30T14:29:54.854146Z"
},
"papermill": {
"duration": 0.168541,
"end_time": "2025-12-30T14:29:54.855473",
"exception": false,
"start_time": "2025-12-30T14:29:54.686932",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"JavaScript is disabled\n",
"In order to continue, we need to verify that you're not a robot. This requires JavaScript. Enable JavaScript and then reload the page.\n"
]
}
],
"source": [
"import trafilatura\n",
"\n",
"if html_content:\n",
" trafilatura_text = trafilatura.extract(\n",
" html_content,\n",
" output_format=\"markdown\",\n",
" include_tables=True,\n",
" include_links=True,\n",
" include_images=False,\n",
" )\n",
" print(trafilatura_text[:MAX_CHARS] if trafilatura_text else \"No content\")\n",
"else:\n",
" trafilatura_text = None\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "37293fa6",
"metadata": {
"papermill": {
"duration": 0.002089,
"end_time": "2025-12-30T14:29:54.859550",
"exception": false,
"start_time": "2025-12-30T14:29:54.857461",
"status": "completed"
},
"tags": []
},
"source": [
"## 2. Newspaper3k\n",
"\n",
"[newspaper3k](https://newspaper.readthedocs.io/) - Designed for news article extraction.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f406ca17",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.864166Z",
"iopub.status.busy": "2025-12-30T14:29:54.863994Z",
"iopub.status.idle": "2025-12-30T14:29:54.976194Z",
"shell.execute_reply": "2025-12-30T14:29:54.975788Z"
},
"papermill": {
"duration": 0.115388,
"end_time": "2025-12-30T14:29:54.977123",
"exception": false,
"start_time": "2025-12-30T14:29:54.861735",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No content\n"
]
}
],
"source": [
"from newspaper import Article\n",
"\n",
"article = Article(TEST_URL)\n",
"newspaper_error = None\n",
"\n",
"if html_content:\n",
" article.set_html(html_content)\n",
" try:\n",
" article.parse()\n",
" except ValueError as e:\n",
" newspaper_error = f\"ValueError: {e}\"\n",
"\n",
" if newspaper_error:\n",
" print(f\"Newspaper3k error: {newspaper_error}\")\n",
" else:\n",
" print(article.text[:MAX_CHARS] if article.text else \"No content\")\n",
"else:\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "4580af76",
"metadata": {
"papermill": {
"duration": 0.008402,
"end_time": "2025-12-30T14:29:54.987938",
"exception": false,
"start_time": "2025-12-30T14:29:54.979536",
"status": "completed"
},
"tags": []
},
"source": [
"## 3. Readability-lxml\n",
"\n",
"[readability-lxml](https://github.com/buriy/python-readability) - Python port of Mozilla Readability.\n",
"Outputs HTML, so we pipe through html2text for markdown.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a0111b20",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:54.997103Z",
"iopub.status.busy": "2025-12-30T14:29:54.996796Z",
"iopub.status.idle": "2025-12-30T14:29:55.013155Z",
"shell.execute_reply": "2025-12-30T14:29:55.012092Z"
},
"papermill": {
"duration": 0.021786,
"end_time": "2025-12-30T14:29:55.014473",
"exception": false,
"start_time": "2025-12-30T14:29:54.992687",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# JavaScript is disabled\n",
"\n",
"In order to continue, we need to verify that you're not a robot. This requires JavaScript. Enable JavaScript and then reload the page. \n",
"\n"
]
}
],
"source": [
"from readability import Document\n",
"import html2text\n",
"\n",
"h2t = html2text.HTML2Text()\n",
"h2t.ignore_links = False\n",
"h2t.ignore_images = True\n",
"h2t.body_width = 0\n",
"\n",
"readability_markdown = \"\"\n",
"readability_error = None\n",
"\n",
"if html_content:\n",
" doc = Document(html_content)\n",
" try:\n",
" readable_html = doc.summary()\n",
" readability_markdown = h2t.handle(readable_html)\n",
" except Exception as e:\n",
" readability_error = f\"{type(e).__name__}: {e}\"\n",
"\n",
" if readability_error:\n",
" print(f\"Readability-lxml error: {readability_error}\")\n",
" else:\n",
" print(readability_markdown[:MAX_CHARS])\n",
"else:\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "3ef93b51",
"metadata": {
"papermill": {
"duration": 0.003044,
"end_time": "2025-12-30T14:29:55.019314",
"exception": false,
"start_time": "2025-12-30T14:29:55.016270",
"status": "completed"
},
"tags": []
},
"source": [
"## 4. Mozilla Readability (JavaScript)\n",
"\n",
"[Mozilla Readability](https://github.com/mozilla/readability) - Original Firefox Reader View library, called via Node.js.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3dba3e69",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:55.029038Z",
"iopub.status.busy": "2025-12-30T14:29:55.028705Z",
"iopub.status.idle": "2025-12-30T14:29:55.851498Z",
"shell.execute_reply": "2025-12-30T14:29:55.850891Z"
},
"papermill": {
"duration": 0.828357,
"end_time": "2025-12-30T14:29:55.852055",
"exception": false,
"start_time": "2025-12-30T14:29:55.023698",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n"
]
}
],
"source": [
"script_path = Path(\"readability_extract.js\")\n",
"mozilla_markdown = \"\"\n",
"\n",
"if not html_content:\n",
" print(f\"Skipped: {fetch_error}\")\n",
"elif not script_path.exists():\n",
" print(\"readability_extract.js not found\")\n",
"else:\n",
" result = subprocess.run(\n",
" [\"node\", str(script_path)],\n",
" input=html_content,\n",
" capture_output=True,\n",
" text=True,\n",
" timeout=30,\n",
" )\n",
" if result.returncode == 0:\n",
" mozilla_result = json.loads(result.stdout)\n",
" mozilla_markdown = h2t.handle(mozilla_result.get(\"content\", \"\"))\n",
" print(mozilla_markdown[:MAX_CHARS])\n",
" else:\n",
" print(f\"Error: {result.stderr}\")"
]
},
{
"cell_type": "markdown",
"id": "ceaffb93",
"metadata": {
"papermill": {
"duration": 0.001508,
"end_time": "2025-12-30T14:29:55.855351",
"exception": false,
"start_time": "2025-12-30T14:29:55.853843",
"status": "completed"
},
"tags": []
},
"source": [
"## 5. Playwright\n",
"\n",
"[Playwright](https://playwright.dev/) - Browser automation that renders JavaScript before extraction.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5e9bfba5",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:29:55.859029Z",
"iopub.status.busy": "2025-12-30T14:29:55.858865Z",
"iopub.status.idle": "2025-12-30T14:30:00.353294Z",
"shell.execute_reply": "2025-12-30T14:30:00.352749Z"
},
"papermill": {
"duration": 4.497131,
"end_time": "2025-12-30T14:30:00.353886",
"exception": false,
"start_time": "2025-12-30T14:29:55.856755",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skip to\n",
"Main content\n",
"Keyboard shortcuts\n",
"Search\n",
"opt\n",
"+\n",
"/\n",
"Cart\n",
"shift\n",
"+\n",
"opt\n",
"+\n",
"C\n",
"Home\n",
"shift\n",
"+\n",
"opt\n",
"+\n",
"H\n",
"Orders\n",
"shift\n",
"+\n",
"opt\n",
"+\n",
"O\n",
"Show/Hide shortcuts\n",
"shift\n",
"+\n",
"opt\n",
"+\n",
"Z\n",
"To move between items, use your keyboard's up or down arrows.\n",
".us\n",
"Delivering to Brooklyn 11231\n",
"Update location\n",
"All\n",
"Select the department you want to search in\n",
"All Departments\n",
"Alexa Skills\n",
"Amazon Autos\n",
"Amazon Devices\n",
"Amazon Fresh\n",
"Amazon Global Store\n",
"Amazon Haul\n",
"Amazon One Medical\n",
"Amazon Pharmacy\n",
"Amazon Resale\n",
"Appliances\n",
"Apps & Games\n",
"Arts, Crafts & Sewing\n",
"Audible Books & Originals\n",
"Automotive Parts & Accessories\n",
"Baby\n",
"Beauty & Personal Care\n",
"Books\n",
"CDs & Vinyl\n",
"Cell Phones & Accessories\n",
"Clothing, Shoes & Jewelry\n",
"Women's Clothing, Shoes & Jewelry\n",
"Men's Clothing, Shoes & Jewelry\n",
"Girl's Clothing, Shoes & Jewelry\n",
"Boy's Clothing, Shoes & Jewelry\n",
"Baby Clothing, Shoes & Jewelry\n",
"Collectibles & Fine Art\n",
"Computers\n",
"Credit and Payment Cards\n",
"Digital Music\n",
"Electronics\n",
"Garden & Outdoor\n",
"Gift Cards\n",
"Grocery & Gourmet Food\n",
"Handmade\n",
"Health, Household & Baby Care\n",
"Home & Business Services\n",
"Home & Kitchen\n",
"Industrial & Scientific\n",
"Just for Prime\n",
"Kindle Store\n",
"Luggage & Travel Gear\n",
"Luxury Stores\n",
"Magazine Subscriptions\n",
"Movies & TV\n",
"Musical Instruments\n",
"Office Products\n",
"Pet Supplies\n",
"Premium Beauty\n",
"Prime Video\n",
"Same-Day Store\n",
"Smart Home\n",
"Software\n",
"Sports & Outdoors\n",
"Subscribe & Save\n",
"Subscription Boxes\n",
"Tools & Home Improvement\n",
"Toys & Games\n",
"Under $10\n",
"Video Games\n",
"Whole Foods Market\n",
"Search Amazon\n",
"EN\n",
"Hello, sign in\n",
"Account & Lists\n",
"Returns\n",
"& Orders\n",
"0\n",
"Cart\n",
"Sign in\n",
"New customer?\n",
"Start here.\n",
"Your Lists\n",
"Create a List\n",
"Find a List or Registry\n",
"Your Account\n",
"Account\n",
"Orders\n",
"Keep Shopping For\n",
"Recommendations\n",
"Browsing History\n",
"Your Shopping preferences\n",
"Amazon Credit Cards\n",
"Watchlist\n",
"Video Purchases & Rentals\n",
"Kindle Unlimited\n",
"Content & Devices\n",
"Subscribe & Save Items\n",
"Memberships & Subscriptions\n",
"Prime Membership\n",
"Music Library\n",
"Start a Selling Account\n",
"Create Your Free Business Account\n",
"Customer Service\n",
"All\n",
"Amazon Haul\n",
"Medical Care\n",
"Amazon Basics\n",
"Best Sellers\n",
"Books\n",
"New Releases\n",
"Registry\n",
"Today's Deals\n",
"Gift Cards\n",
"Smart Home\n",
"Groceries\n",
"Prime\n",
"Pharmacy\n",
"Customer Service\n",
"Music\n",
"Amazon Home\n",
"Fashion\n",
"Toys & Games\n",
"Sports & Outdoors\n",
"Beauty & Personal Care\n",
"Automotive\n",
"Sell\n",
"Home Improvement\n",
"Computers\n",
"Kindle Books\n",
"Previous slide\n",
"Video Player is loading.\n",
"Play\n",
"Beginning of dialog window. Escape will cancel and close the window.\n",
"Text\n",
"Color\n",
"White\n",
"Black\n",
"Red\n",
"Green\n",
"Blue\n",
"Yellow\n",
"Magenta\n",
"Cyan\n",
"Transparency\n",
"Opaque\n",
"Semi-Transparent\n",
"Background\n",
"Color\n",
"Black\n",
"White\n",
"Red\n",
"Green\n",
"Blue\n",
"Yellow\n",
"Magenta\n",
"Cyan\n",
"Transparency\n",
"Opaque\n",
"Semi-Transparent\n",
"Transparent\n",
"Window\n",
"Color\n",
"Black\n",
"White\n",
"Red\n",
"Green\n",
"Blue\n",
"Yellow\n",
"Magenta\n",
"Cyan\n",
"Transparency\n",
"Transparent\n",
"Semi-Transparent\n",
"Opaque\n",
"Font Size\n",
"50%\n",
"75%\n",
"100%\n",
"125%\n",
"150%\n",
"175%\n",
"200%\n",
"300%\n",
"400%\n",
"Text Edge Style\n",
"None\n",
"Raised\n",
"Depressed\n",
"Uniform\n",
"Dropshadow\n",
"Font Family\n",
"Proportional Sans-Serif\n",
"Monospace Sans-Serif\n",
"Proportional Serif\n",
"Monospace Serif\n",
"Casual\n",
"Script\n",
"Small Caps\n",
"Reset\n",
"restore all settings to the default values\n",
"Done\n",
"Close Modal Dialog\n",
"End of dialog window.\n",
"Video Player is loading.\n",
"Play\n",
"\n"
]
}
],
"source": [
"import asyncio\n",
"from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout\n",
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"\n",
"async def fetch_with_playwright(\n",
" url: str, timeout: int = 30000\n",
") -> tuple[str | None, str | None]:\n",
" \"\"\"Returns (html, error). One will be None.\"\"\"\n",
" try:\n",
" async with async_playwright() as p:\n",
" browser = await p.chromium.launch(headless=True)\n",
" page = await browser.new_page()\n",
" response = await page.goto(url, wait_until=\"domcontentloaded\", timeout=timeout)\n",
" await page.wait_for_timeout(3000) # Let JS render\n",
" html = await page.content()\n",
" await browser.close()\n",
" status = response.status if response else None\n",
" if status and status >= 400:\n",
" return None, f\"HTTP {status}\"\n",
" return html, None\n",
" except PlaywrightTimeout:\n",
" return None, f\"Timeout after {timeout}ms\"\n",
" except Exception as e:\n",
" return None, f\"{type(e).__name__}: {e}\"\n",
"\n",
"\n",
"playwright_html = None\n",
"playwright_extracted = None\n",
"playwright_error = None\n",
"\n",
"loop = asyncio.get_event_loop()\n",
"result = loop.run_until_complete(fetch_with_playwright(TEST_URL))\n",
"playwright_html, playwright_error = result\n",
"\n",
"if playwright_html:\n",
" playwright_extracted = trafilatura.extract(\n",
" playwright_html,\n",
" output_format=\"markdown\",\n",
" include_tables=True,\n",
" include_links=True,\n",
" include_images=False,\n",
" )\n",
" print(\n",
" playwright_extracted[:MAX_CHARS]\n",
" if playwright_extracted\n",
" else \"No content extracted from HTML\"\n",
" )\n",
"else:\n",
" print(f\"Playwright error: {playwright_error}\")"
]
},
{
"cell_type": "markdown",
"id": "6e522bf1",
"metadata": {
"papermill": {
"duration": 0.001522,
"end_time": "2025-12-30T14:30:00.357190",
"exception": false,
"start_time": "2025-12-30T14:30:00.355668",
"status": "completed"
},
"tags": []
},
"source": [
"## 6. Parallel.ai\n",
"\n",
"[Parallel.ai](https://docs.parallel.ai/) - Commercial API for web extraction using the Python SDK.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a66a4eb3",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:00.361194Z",
"iopub.status.busy": "2025-12-30T14:30:00.361057Z",
"iopub.status.idle": "2025-12-30T14:30:01.948563Z",
"shell.execute_reply": "2025-12-30T14:30:01.947699Z"
},
"papermill": {
"duration": 1.590986,
"end_time": "2025-12-30T14:30:01.949610",
"exception": false,
"start_time": "2025-12-30T14:30:00.358624",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## Skip to\n",
"\n",
"* [Main content]()\n",
"\n",
"* * *\n",
"\n",
"## Keyboard shortcuts\n",
"\n",
"* [Search alt \\+ /](javascript:void\\(0\\))\n",
"* [Cart shift \\+ alt \\+ C](javascript:void\\(0\\))\n",
"* [Home shift \\+ alt \\+ H](javascript:void\\(0\\))\n",
"* [Orders shift \\+ alt \\+ O](javascript:void\\(0\\))\n",
"* Show/Hide shortcuts\n",
" \n",
" shift \\+ alt \\+ Z\n",
"\n",
"To move between items, use your keyboard's up or down arrows.\n",
"\n",
"[.us](/ref=nav_logo)\n",
"\n",
"Delivering to Secaucus 07094 Update location\n",
"\n",
"All\n",
"\n",
"Select the department you want to search in All Departments Alexa Skills Amazon Autos Amazon Devices Amazon Fresh Amazon Global Store Amazon Haul Amazon One Medical Amazon Pharmacy Amazon Resale Appliances Apps & Games Arts, Crafts & Sewing Audible Books & Originals Automotive Parts & Accessories Baby Beauty & Personal Care Books CDs & Vinyl Cell Phones & Accessories Clothing, Shoes & Jewelry Women's Clothing, Shoes & Jewelry Men's Clothing, Shoes & Jewelry Girl's Clothing, Shoes & Jewelry Boy's Clothing, Shoes & Jewelry Baby Clothing, Shoes & Jewelry Collectibles & Fine Art Computers Credit and Payment Cards Digital Music Electronics Garden & Outdoor Gift Cards Grocery & Gourmet Food Handmade Health, Household & Baby Care Home & Business Services Home & Kitchen Industrial & Scientific Just for Prime Kindle Store Luggage & Travel Gear Luxury Stores Magazine Subscriptions Movies & TV Musical Instruments Office Products Pet Supplies Premium Beauty Prime Video Same-Day Store Smart Home Software Sports & Outdoors Subscribe & Save Subscription Boxes Tools & Home Improvement Toys & Games Under $10 Video Games Whole Foods Market\n",
"\n",
"Search Amazon\n",
"\n",
"[EN](/customer-preferences/edit?ie=UTF8&preferencesReturnUrl=%2F&ref_=topnav_lang)\n",
"\n",
"[Hello, sign in Account & Lists](https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.com%2F%3Fref_%3Dnav_ya_signin&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0)\n",
"\n",
"[Returns & Orders](/gp/css/order-history?ref_=nav_orders_first) [0 Cart](/gp/cart/view.html?ref_=nav_cart)\n",
"\n",
"[_Previous slide_](#)\n",
"\n",
"1. [](/b/?_encoding=UTF8&node=212705961011&enabledRefinements=%5B%7B%22rid%22%3A%22category%22%2C%22value%22%3A%22213981608011%22%2C%22ridType%22%3A%22browse%22%2C%22type%22%3A%22browse%22%7D%2C%7B%22rid%22%3A%22p_n_availability%22%2C%22value%22%3A%222661600011%22%2C%22ridType%22%3A%22SEARCH_SHORT_ID%22%2C%22type%22%3A%22BROWSE_NODE%22%7D%2C%7B%22rid%22%3A%22p_n_condition-type%22%2C%22value%22%3A%226461716011%22%2C%22ridType%22%3A%22SEARCH_SHORT_ID%22%2C%22type%22%3A%22BROWSE_NODE%22%7D%5D&ref_=ny26_standard_top33_act_cta&pd_rd_w=FsBqc&content-id=amzn1.sym.093dc239-2394-40a0-99eb-8476961ac040&pf_rd_p=093dc239-2394-40a0-99eb-8476961ac040&pf_rd_r=VB53ZT3AT1FZ9RAQME7V&pd_rd_wg=Qxaoq&pd_rd_r=857388ef-aa3c-42b1-8442-ef113a1dfd5d)\n",
"2. [](/b/?_encoding=UTF8&n\n"
]
}
],
"source": [
"from parallel import Parallel\n",
"\n",
"parallel_result = None\n",
"parallel_error = None\n",
"\n",
"api_key = os.getenv(\"PARALLEL_API_KEY\")\n",
"if not api_key:\n",
" parallel_error = \"PARALLEL_API_KEY not set\"\n",
"else:\n",
" client = Parallel(api_key=api_key)\n",
" extract = client.beta.extract(\n",
" urls=[TEST_URL],\n",
" objective=\"Extract the main content of this page\",\n",
" excerpts=True,\n",
" full_content=True,\n",
" )\n",
" parallel_result = extract.results\n",
"\n",
"if parallel_result:\n",
" for result in parallel_result:\n",
" if result.full_content:\n",
" print(result.full_content[:MAX_CHARS])\n",
"else:\n",
" print(f\"Parallel.ai error: {parallel_error}\")"
]
},
{
"cell_type": "markdown",
"id": "fb8e4c91",
"metadata": {
"papermill": {
"duration": 0.003342,
"end_time": "2025-12-30T14:30:01.957057",
"exception": false,
"start_time": "2025-12-30T14:30:01.953715",
"status": "completed"
},
"tags": []
},
"source": [
"## 7. Exa\n",
"\n",
"[Exa](https://exa.ai/) - AI-native search and content extraction API.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fd2bc1d0",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:01.964409Z",
"iopub.status.busy": "2025-12-30T14:30:01.964241Z",
"iopub.status.idle": "2025-12-30T14:30:02.539596Z",
"shell.execute_reply": "2025-12-30T14:30:02.538987Z"
},
"papermill": {
"duration": 0.580186,
"end_time": "2025-12-30T14:30:02.540447",
"exception": false,
"start_time": "2025-12-30T14:30:01.960261",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Amazon.com. Spend less. Smile more.\n",
"![](http://fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:131-1984468-8102452:86MZDBATS9FV26T5ZBKE$uedata=s:%2Frd%2Fuedata%3Fstaticb%26id%3D86MZDBATS9FV26T5ZBKE:0)![](https://m.media-amazon.com/images/G/01/gno/sprites/nav-sprite-global-1x-reorg-privacy._CB779528203_.png)\n",
"[.us](http://amazon.com/ref=nav_logo)\n",
"[\n",
"Delivering to Boardman 97818Update location\n",
"]()\n",
"All**\n",
"Select the department you want to search inAll DepartmentsAlexa SkillsAmazon AutosAmazon DevicesAmazon Global StoreAmazon HaulAmazon One MedicalAmazon PharmacyAmazon ResaleAppliancesApps & GamesArts, Crafts & SewingAudible Books & OriginalsAutomotive Parts & AccessoriesBabyBeauty & Personal CareBooksCDs & VinylCell Phones & AccessoriesClothing, Shoes & JewelryWomen's Clothing, Shoes & JewelryMen's Clothing, Shoes & JewelryGirl's Clothing, Shoes & JewelryBoy's Clothing, Shoes & JewelryBaby Clothing, Shoes & JewelryCollectibles & Fine ArtComputersCredit and Payment CardsDigital MusicElectronicsGarden & OutdoorGift CardsGrocery & Gourmet FoodHandmadeHealth, Household & Baby CareHome & Business ServicesHome & KitchenIndustrial & ScientificJust for PrimeKindle StoreLuggage & Travel GearLuxury StoresMagazine SubscriptionsMovies & TVMusical InstrumentsOffice ProductsPet SuppliesPremium BeautyPrime VideoSmart HomeSoftwareSports & OutdoorsSubscribe & SaveSubscription BoxesTools & Home ImprovementToys & GamesUnder $10Video Games\n",
"Search Amazon\n",
"[\n",
"EN\n",
"](http://amazon.com/customer-preferences/edit?ie=UTF8&preferencesReturnUrl=/&ref_=topnav_lang)\n",
"[\n",
"Hello, sign in\n",
"Account & Lists](https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.amazon.com/?_encoding=UTF8&ref_=nav_ya_signin&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0)\n",
"[Returns& Orders](http://amazon.com/gp/css/order-history?ref_=nav_orders_first)[\n",
"0\n",
"Cart\n",
"](http://amazon.com/gp/cart/view.html?ref_=nav_cart)\n",
"[**All](http://amazon.com/gp/site-directory?ref_=nav_em_js_disabled)\n",
"* [Amazon Haul](http://amazon.com/haul/store?ref_=nav_cs_hul_disb)\n",
"* [Medical Care](https://health.amazon.com/prime?ref_=nav_cs_all_health_ingress_onem_h)\n",
"* [Amazon Basics](http://amazon.com/Amazon_Basics?channel=discovbar&field-lbr_brands_browse-bin=AmazonBasics&ref_=nav_cs_amazonbasics)\n",
"* [Best Sellers](http://amazon.com/gp/bestsellers/?ref_=nav_cs_bestsellers)\n",
"* [Books](http://amazon.com/books-used-books-textbooks/b/?ie=UTF8&node=283155&ref_=nav_cs_books)\n",
"* [New Releases](http://amazon.com/gp/new-releases/?ref_=nav_cs_newreleases)\n",
"* [Registry](http://amazon.com/gp/browse.html?node=16115931011&ref_=nav_cs_registry)\n",
"* [Today's Deals](http://amazon.com/deals?ref_=nav_cs_gb)\n",
"* [Gift Cards](http://amazon.com/gift-cards/b/?ie=UTF8&node=2238192011&ref_=nav_cs_gc)\n",
"* [Smart Home](http://amazon.com/Smart-Home/b/?ie=UTF8&node=656314001\n"
]
}
],
"source": [
"from exa_py import Exa\n",
"\n",
"exa_result = None\n",
"exa_error = None\n",
"\n",
"exa_api_key = os.getenv(\"EXA_API_KEY\")\n",
"if not exa_api_key:\n",
" exa_error = \"EXA_API_KEY not set\"\n",
"else:\n",
" exa = Exa(exa_api_key)\n",
" results = exa.get_contents(urls=[TEST_URL], text=True)\n",
" if results.results:\n",
" exa_result = results.results[0].text\n",
"\n",
"if exa_result:\n",
" print(exa_result[:MAX_CHARS])\n",
"else:\n",
" print(f\"Exa error: {exa_error}\")"
]
},
{
"cell_type": "markdown",
"id": "b5c89c1a",
"metadata": {
"papermill": {
"duration": 0.002674,
"end_time": "2025-12-30T14:30:02.546808",
"exception": false,
"start_time": "2025-12-30T14:30:02.544134",
"status": "completed"
},
"tags": []
},
"source": [
"## 8. html2text (direct)\n",
"\n",
"[html2text](https://github.com/Alir3z4/html2text) - Converts HTML to Markdown without readability filtering.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "298e1fb0",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:02.553271Z",
"iopub.status.busy": "2025-12-30T14:30:02.553095Z",
"iopub.status.idle": "2025-12-30T14:30:02.556168Z",
"shell.execute_reply": "2025-12-30T14:30:02.555630Z"
},
"papermill": {
"duration": 0.007356,
"end_time": "2025-12-30T14:30:02.556720",
"exception": false,
"start_time": "2025-12-30T14:30:02.549364",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# JavaScript is disabled\n",
"\n",
"In order to continue, we need to verify that you're not a robot. This requires JavaScript. Enable JavaScript and then reload the page. \n",
"\n"
]
}
],
"source": [
"if html_content:\n",
" html2text_output = h2t.handle(html_content)\n",
" print(html2text_output[:MAX_CHARS])\n",
"else:\n",
" html2text_output = \"\"\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "f6584f23",
"metadata": {
"papermill": {
"duration": 0.003426,
"end_time": "2025-12-30T14:30:02.563019",
"exception": false,
"start_time": "2025-12-30T14:30:02.559593",
"status": "completed"
},
"tags": []
},
"source": [
"## 9. BeautifulSoup\n",
"\n",
"[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - Manual text extraction baseline.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1a165cb6",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:02.568945Z",
"iopub.status.busy": "2025-12-30T14:30:02.568797Z",
"iopub.status.idle": "2025-12-30T14:30:02.572322Z",
"shell.execute_reply": "2025-12-30T14:30:02.571870Z"
},
"papermill": {
"duration": 0.007581,
"end_time": "2025-12-30T14:30:02.572769",
"exception": false,
"start_time": "2025-12-30T14:30:02.565188",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"JavaScript is disabled\n",
"In order to continue, we need to verify that you're not a robot.\n",
" This requires JavaScript. Enable JavaScript and then reload the page.\n"
]
}
],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"if html_content:\n",
" soup = BeautifulSoup(html_content, \"lxml\")\n",
" for el in soup([\"script\", \"style\", \"nav\", \"footer\", \"header\"]):\n",
" el.decompose()\n",
"\n",
" content = soup.find(\"div\", {\"id\": \"mw-content-text\"})\n",
" bs_text = (\n",
" content.get_text(separator=\"\\n\", strip=True)\n",
" if content\n",
" else soup.get_text(separator=\"\\n\", strip=True)\n",
" )\n",
" print(bs_text[:MAX_CHARS])\n",
"else:\n",
" bs_text = \"\"\n",
" print(f\"Skipped: {fetch_error}\")"
]
},
{
"cell_type": "markdown",
"id": "899b203c",
"metadata": {
"papermill": {
"duration": 0.002172,
"end_time": "2025-12-30T14:30:02.577301",
"exception": false,
"start_time": "2025-12-30T14:30:02.575129",
"status": "completed"
},
"tags": []
},
"source": [
"## Summary\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0888b242",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-30T14:30:02.582286Z",
"iopub.status.busy": "2025-12-30T14:30:02.582077Z",
"iopub.status.idle": "2025-12-30T14:30:02.585007Z",
"shell.execute_reply": "2025-12-30T14:30:02.584523Z"
},
"papermill": {
"duration": 0.006236,
"end_time": "2025-12-30T14:30:02.585523",
"exception": false,
"start_time": "2025-12-30T14:30:02.579287",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"parallel.ai : 93,063 chars\n",
"exa : 82,822 chars\n",
"playwright : 8,637 chars\n",
"beautifulsoup : 165 chars\n",
"readability-lxml : 162 chars\n",
"html2text : 162 chars\n",
"trafilatura : 157 chars\n",
"mozilla readability : 1 chars\n",
"newspaper3k : 0 chars\n"
]
}
],
"source": [
"results = {\n",
" \"trafilatura\": len(trafilatura_text or \"\"),\n",
" \"newspaper3k\": len(article.text or \"\")\n",
" if html_content and not newspaper_error\n",
" else 0,\n",
" \"readability-lxml\": len(readability_markdown),\n",
" \"mozilla readability\": len(mozilla_markdown),\n",
" \"playwright\": len(playwright_extracted or \"\"),\n",
" \"parallel.ai\": len(parallel_result[0].full_content or \"\") if parallel_result else 0,\n",
" \"exa\": len(exa_result or \"\"),\n",
" \"html2text\": len(html2text_output),\n",
" \"beautifulsoup\": len(bs_text),\n",
"}\n",
"\n",
"if fetch_error:\n",
" print(\n",
" f\"Note: requests fetch failed ({fetch_error}), some tools used Playwright-fetched HTML\\n\"\n",
" )\n",
"\n",
"for name, length in sorted(results.items(), key=lambda x: -x[1]):\n",
" print(f\"{name:25s}: {length:>8,} chars\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
},
"papermill": {
"default_parameters": {},
"duration": 9.419267,
"end_time": "2025-12-30T14:30:03.006823",
"environment_variables": {},
"exception": null,
"input_path": "compare_extractors.ipynb",
"output_path": "amazon.ipynb",
"parameters": {
"TEST_URL": "https://amazon.com"
},
"start_time": "2025-12-30T14:29:53.587556",
"version": "2.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment