You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm now zeroing in on the best way to introduce TanStack Start development. Initially, I needed to pin down the core goal - building an application with this framework. I'm focusing on defining the key areas to address and how to best explain them to a user who is new to the development environment.
Mapping Ecosystem Dependencies
I've been mapping the critical dependencies within the TanStack Start ecosystem. Now, I'm focusing on clarifying which libraries are essential and which are merely recommended, particularly for different parts of the application. Highlighting the type-safe nature of libraries such as TanStack Query and Form will be vital. The choice between ORMs is another point of consideration, as well as outlining the role of Tailwind CSS and popular UI libraries.
Prioritizing Library Grouping
I'm now zeroing in on a structured library breakdown, emphasizing the "must-have" core libraries for TanStack Start. The aim is to showcase how TanStack Router forms the framework's foundation, and to highlight essential components such as TanStack Query for handling asynchronous state. I'm prioritizing the presentation of these core libraries before diving into recommended "T3 Stack" equivalent libraries like TanStack Form or Tailwind.
Organizing Core Documentation
I'm presently structuring the documentation guide, prioritizing TanStack Router's docs as the critical starting point. Following that, I plan to cover TanStack Start's basics and server functions. I'm also considering highlighting crucial concepts, such as file-based routing and type-safe server functions. My target is to explain the essential concepts in a structured and professional tone, while also mentioning the still-emerging nature of TanStack Start.
Concluding Project Structure
I'm now putting the finishing touches on the project structure. This entails producing a concise summary, categorizing the libraries by core functionality, UI, and backend. I'll include links to the official documentation and develop a step-by-step learning path. I'm also ensuring that Vite's foundational role is clearly emphasized throughout the guide. Lastly, I'm highlighting the essential aspects of the framework: the router as the central component, server functions as the backend glue, and making sure to cover the underlying use of Vite. I will then translate all of this into Japanese.
I've been focusing on the domain aspect of the MCP server, specifically gathering URLs for document searching. Currently, my thought process revolves around identifying and categorizing suitable data domains. I'm exploring different methods to source these URLs, focusing on relevance and coverage.
Clarifying Library Documentation
I'm now zeroing in on the specific documentation URLs required for the core TanStack libraries. I've re-evaluated my previous direction. My immediate task is gathering the accurate URLs for Start, Router, Query, and Form from the official tanstack.com domain. These URLs are crucial for text extraction, the next step, I will determine the documentation URLs.
Refining Documentation URLs
I'm making progress identifying documentation URLs for essential libraries. I've gathered base URLs, focusing on the /docs paths for optimal content extraction. I've added Vite, Zod, Drizzle ORM, and Tailwind CSS to the list. For TanStack, I'm noting the shared documentation structure, starting with Start.
Prioritizing URL Acquisition
I've just solidified my documentation URL list, focusing on key TanStack libraries. I've specifically targeted the /latest/docs structure to ensure the most current content. I am now adding TanStack Query. I'm also planning to incorporate a note regarding TanStack's use of versioning, which will be essential for the scraper to ensure it gets the latest documentation.
Gathering Final Documentation
I'm wrapping up the final documentation URL acquisition. I've added TanStack Form to the list, completing the TanStack core libraries. Vite, Zod, and Drizzle ORM are also included. I'm now focusing on a Markdown list format to present the documentation URLs. Next, I'll translate the list into Japanese.
I'm currently focused on pinpointing the essential documentation domains needed for the initial document search and text extraction functionality. I've compiled a preliminary list and am now cross-referencing to ensure completeness and relevance to the TanStack Start framework. My aim is to be thorough in identifying all pertinent resources.
Refining Domain Scope
I've incorporated the crucial feedback on missing elements, adding React and TypeScript to my indexing focus. My attention is now on the optimal documentation URLs for React and TypeScript, selecting key paths within their official docs to maximize efficiency for the Minimum Credible Product. I'm prioritizing direct relevance to TanStack Start's architecture and the project's foundational technologies. My goal is to streamline the document search.
Incorporating TypeScript Details
I realized my initial focus on React and TypeScript documentation overlooked crucial details about the importance of TypeScript's type system to TanStack Start. I've added specific paths like the TypeScript handbook, focusing on Concepts and Configuration. I also thought about including MDN, but given that generic JS/web API knowledge is usually presumed, I've kept it as optional.
Generating Focused URLs
I've finalized the core URLs for React and TypeScript, prioritizing /reference for React API details and the root TypeScript /docs for comprehensiveness. MDN's Web API docs are included as an optional resource. I've also established the best URLs for initial crawling. My current work now centers on organizing these resources into distinct groups based on relevance, crafting a clear format for presenting the final output to effectively serve the target user persona building an MCP server.
Clarifying Domain Requirements
I'm now revising my initial domain list, ensuring it directly addresses the request's core. Given the existing structure, I've simplified the URL set to focus exclusively on React and TypeScript documentation, streamlining the content for clarity and direct usability. For React, I'm using https://react.dev/ as the primary domain, and for TS, I've selected https://www.typescriptlang.org/docs/. This minimizes extraneous data while providing a straightforward, focused resource set. I recognize the value of MDN, especially for Web Standards, but will mention it as an extra.
AI:
承知いたしました。基礎となる **React** と **TypeScript**、および TanStack Start が標準 Web API (Request/Response) に準拠している点を考慮し、MDN も含めた**「TanStack Start 開発用 MCP サーバーのための完全なドキュメントリスト」**を提示します。
これらをスクレイピング対象として設定してください。
1. 基礎技術 (Foundation)
フレームワークの基盤となる言語とライブラリです。特に React は Server Components や Suspense の理解に、TypeScript は高度な型定義の理解に必須です。
I've been drilling down on the fundamental goal – crafting a prompt that will yield a valid JSON configuration. The primary focus is on how to communicate with Claude Code so it understands the request to construct a JSON file. Right now, I'm thinking about the most effective language to convey the specific structure and data required for scraping documentation, keeping in mind the MCP server context. I'm focusing on simplicity and clarity so that Claude doesn't get confused!
Refining the JSON Schema
I'm now focusing on a declarative JSON schema to drive content extraction. The core shift is from imperative code to JSON config, which outlines URLs, selectors, and cleanup steps for document scraping. My focus is on supporting React, TypeScript, MDN, TanStack (various parts), Zod, and Vite documentation sites. I'm also considering chrome-devtools-mcp as a potential tool to identify the correct CSS/XPath selectors.
Evolving the Prompt Structure
I'm now focusing on a more sophisticated prompt strategy. The aim is to make Claude Code the "Selector Engineer." It needs to analyze HTML, build a structured JSON config, and understand its purpose. I've drafted a JSON schema with "domain," "entry_point," "selectors," and "exclude" fields, and I'm refining the instructions for its role. I'm striving for a modular design, so I can adapt easily to other document scraping scenarios.
Optimizing the Process
My focus is now on streamlining the entire process, specifically the prompt design. I'm breaking down the task into smaller, manageable phases: defining the task, specifying target websites from the previous list, ensuring the use of chrome-devtools-mcp for DOM inspection, and clearly defining the desired JSON output. I'm also considering an iterative approach, either site-by-site or in manageable groups, to optimize efficiency. Furthermore, I'm addressing the nuances of the JSON format, ensuring it supports markdown conversion and handles element removal as specified.
Fine-tuning JSON Structure
I'm now zeroing in on the finer points of the JSON schema, particularly the selectors. I'm making sure it includes content_selector, which explicitly defines what to keep in the extracted content, and ignore_selectors, which specifies what to remove from within that content to ensure clean Markdown output. I'm focusing on providing very clear instructions so Claude understands how to use the browser tools to verify and refine these selectors.
Defining Output Format
The final piece of the prompt is now ready. I've polished the tone and integrated the "Collective Intelligence" context. Also, I've specifically instructed Claude to note the structural similarities of the TanStack sites to save time. I've now clarified the final output structure, which will include the prompt itself, followed by an explanation of its functionality. I've also added a JSON structure that makes the root_url and other associated elements clear.
Clarifying Selector Refinement
Now, I'm focusing on the "rules" field within the JSON output. I've realized I need to be more explicit with the selectors, ensuring main_content, title, and exclude are very precise. The example given provides a clear illustration, but I need to make sure the instructions are comprehensive enough so Claude understands what kind of values those fields should hold.
---
### このプロンプトの意図とポイント
1. **ツールの強制使用 (`chrome-devtools-mcp`)**:
* 単に知識ベースから回答するのではなく、「実際にアクセスしてDOMを見る」ことを強制しています。これにより、クラス名が変わっていたり、Tailwindのようなユーティリティクラス(`div.flex.p-4`など)ばかりで指定しにくい場合でも、`article` や `main[role="main"]` といった意味論的なセレクタを探させることができます。
2. **`exclude` (除外ルール) の重視**:
* LLMにドキュメントを読ませる際、最大の敵は「ナビゲーションメニュー」「広告」「"この記事は役に立ちましたか?"ボタン」などのノイズです。これらをCSSセレクタレベルで除去する設定をJSONに含めることで、後段のMarkdown変換の品質を劇的に向上させます。
3. **TanStackファミリーの類似性**:
* TanStack系は同じドキュメント生成ツールを使っている可能性が高いですが、微妙にクラス名が違う場合があります。あえて「個別に確認せよ」と指示することで、共通化できる部分とできない部分をAIに判断させます。
4. **JSON構造の指定**:
* あなたが作りたい「MCPサーバー」が読み取りやすいスキーマを予め定義しました。これにより、出力されたJSONをそのまま設定ファイルとして保存できます。
(返答終了理由: STOP)
---
ユーザー:
あ、URLを指定したら必要なツールの情報取得方法を取得するプロンプトにしたいです。一気に取得するのではなく任意のドキュメントのドメインを渡したらそれ用の情報が取れるイメージです。
おそらく検索用のプロンプトと特定のドキュメントのmarkdown化のプロンプトは別かもしれません
---
<details>
<summary>AIの思考プロセス</summary>
**Examining the Core Task**
I've been drilling down on the fundamental objective: crafting precise prompts for Claude Code. My focus is now on how best to utilize `chrome-devtools-mcp` to retrieve config data from document processing systems given a URL. I'm exploring different prompt strategies to ensure comprehensive data extraction.
**Developing Config Extraction Strategy**
\n\nI'm now focusing on generating the "Collective Intelligence JSON" config. The challenge is clear: turning a URL into structured data. My current thinking revolves around two distinct prompt types. Type A focuses on site navigation, identifying pages. Type B is all about extracting title and body, filtering out irrelevant content. The user's input highlights two distinct phases, Search/Discovery and Single Document config, that inform my prompt strategy.
**Formulating a Master Prompt**
\n\nI'm now zeroing in on a "Master Prompt" approach. Instead of separate modules, the idea is to handle a specific domain/URL with a single prompt. My focus is on delivering a streamlined process for the user. I've been considering this new structure, and how best to represent it to the user. This master prompt will need to cover title selection, main content identification, exclusion of irrelevant elements (ads, navigation), and potential navigation rules, if needed. I'm focusing on simplicity for the user.
**Designing Unified Prompts**
\n\nI'm now integrating the "Analyzer" prompt (Scenario 1) with the "Search/Navigation" aspect (Scenario 2). My focus is to create a single, versatile prompt that handles both data extraction and site traversal. I'm exploring how to embed logic for identifying search features (e.g., Algolia boxes) and crawling mechanisms (sitemaps, sidebars) within the same prompt. The goal is to provide a seamless, unified solution. The user will get a single prompt capable of addressing all needs.
**Synthesizing Prompt Architecture**
I'm now integrating the "Config Generator" and "Search Logic Analyzer" prompts into a cohesive system. My focus is on creating a two-prompt workflow. The first handles immediate content extraction, while the second dynamically analyzes the site for search features or alternative traversal methods. I'm exploring how best to leverage `search_document` and `read_document` functionalities for a smooth user experience.
**Defining Config Output Strategy**
\n\nI'm now focusing on the final output format. I'm prioritizing clarity and completeness, and moving towards defining the JSON schema that will act as the contract for the output. The current thought is to generate a comprehensive prompt, specifically tailored for scenarios requiring site navigation like this case, ensuring the prompt will handle potential fallback situations. The output must reliably get the sidebar link selector, along with code samples and the why the design pattern is preferred.
**Refining JSON Output**
\n\nI'm now zeroing in on the JSON structure for the final output. The goal is to accommodate both scenarios: direct content extraction and site navigation. My plan is to expand the JSON to include separate sections for "selectors" (for direct content) and "navigation" (for finding related documents). This will ensure a robust and flexible output, catering to diverse use cases. I'm focusing on simplicity and clarity within this JSON schema.
</details>
AI:
なるほど、**「任意のURLを渡したら、そのサイトをスクレイピングするための設定(JSON)を生成してくれる」**という、いわば「設定生成器(Config Generator)」としてのプロンプトをご希望ですね。
ご推察の通り、MCPツールの役割(`search_document` と `read_document`)に合わせて、解析すべき観点は異なります。
1. **Read用(抽出)**: 記事単体の「タイトル」「本文」「ノイズ除去」の構造解析
2. **Search用(探索)**: サイト全体の「サイドバー構造」「次のページ」「サイトマップ」の構造解析
これらをClaude Codeに実行させるための、2つの特化型プロンプトを作成しました。
---
### 前提:出力するJSONのスキーマ(共通言語)
Claude Codeにこのフォーマットを目指させます。
```json
{
"domain": "対象ドメイン",
"extraction": {
"title": "タイトル要素のCSSセレクタ",
"content": "本文ラッパーのCSSセレクタ",
"exclude": ["除去する要素のセレクタ配列"]
},
"navigation": {
"sidebar_items": "サイドバーのリンク要素のセレクタ",
"pagination_next": "「次へ」ボタンのセレクタ"
}
}
---
### 使い方(ワークフローのイメージ)
開発フローとしては以下のようになります。
1. あなたが「TanStack Startのドキュメントを追加したい」と思う。
2. **Claude Code** に `プロンプト1` + `https://tanstack.com/start/...` を投げる。
* → `extraction` 設定(JSON)が返ってくる。
3. **Claude Code** に `プロンプト2` + `同URL` を投げる。
* → `navigation` 設定(JSON)が返ってくる。
4. 返ってきたJSONをマージして、あなたの「集合知設定ファイル(config.json)」に追記する。
#### なぜ分けるのか?
* **精度向上**: 「本文の抽出」と「リンクの探索」を同時に考えさせると、コンテキストウィンドウが溢れたり、DOM解析が雑になったりするため。
* **用途の違い**: `search_document` ツールを実装するときは `navigation` の情報だけあればよく、実際にユーザーが「これを読む」と言った段階で初めて `extraction` の情報が必要になるため。
このアプローチなら、将来的に「検索はGoogle Custom Search APIに任せるから `navigation` 設定は不要」といった変更にも柔軟に対応できます。
(返答終了理由: STOP)
---
ユーザー:
あとページネーションの対応の情報が必要かもです
---
<details>
<summary>AIの思考プロセス</summary>
**Refining the Approach**
I'm zeroing in on the best prompt strategy. Analyzing the request, I see the core need is extracting *pagination* details from docs, for Collective Intelligence setup. Now, I'm considering how to phrase the prompt so it's clear and effective, without being too verbose. I'm focusing on the user's specific goal and making sure the prompt reflects that concisely.
**Mapping Pagination Strategies**
I've been examining different pagination types prevalent in documentation, focusing on how a Chrome DevTools MCP server and LLM can crawl the content. I've narrowed it down to four main approaches: linear "next/previous" flow, list-based numbered pagination, infinite scrolling (unlikely), and most commonly sidebar navigation. The goal is to define specific prompts tailored to each to ensure full content extraction.
**Pinpointing Crawler Configuration**
I'm now identifying the crucial pieces of information the crawler needs: the "Next" button's CSS selector, the link's attribute (often `href`), and the overall pagination *strategy* (linear or list-based). I'm drafting prompts, starting with a basic attempt and then getting increasingly specific, to cover various scenarios. I'm iterating toward a robust "Config Generator" approach.
**Determining the Navigation Type**
I'm now prioritizing the critical distinction between "Next page in a list" and "Next chapter in a guide," essential for accurate content extraction. I'm focusing on crafting prompts that can reliably identify the "Next" button's function and whether it's part of a pagination sequence or linear document flow. This helps determine the correct crawler configuration. Furthermore, I've defined the necessary JSON output format for this crucial "pagination" information.
**Integrating Single-Page Detection**
I've realized a simple "next" button prompt isn't sufficient. Now I'm integrating single-page detection, as sites like Zod have all content on one page, while others, like TanStack, are multi-page. This requires the LLM to analyze the structure and determine if it's SPA/anchor-based or multi-page, for correct crawler configuration. I'm adding a prompt to verify the existence of a "Next" button too.
</details>
AI:
おっしゃる通りです。特にクローラーとしてドキュメントを順番に読み込んでいく場合、「次のページへのリンク」を正確に特定する設定(Pagination)は不可欠です。
ドキュメントサイトには主に2種類のページ送りパターンがあります:
1. **Linear Navigation (順次遷移)**: 記事の最後に「Next: Introduction」のように次の章へのリンクがある(多くのドキュメントはこれ)。
2. **List Pagination (リストページ送り)**: ブログ一覧などで「1, 2, 3... 次へ」となっている(Changelogなどで見られる)。
これらを識別し、設定を生成させるためのプロンプトです。
---
### プロンプト 3: `pagination` (ページネーション・順次読み込み) 解析用
このプロンプトは、**「この記事を読み終わったら、次にどこへ遷移すべきか?」** というルールを抽出させます。
**Claude Codeへの入力:**
```markdown
# 依頼: ドキュメントのページネーション(順次遷移)構造の解析
ターゲットURL: [ここに解析したいURLを入れる]
あなたはWebクローラーの設計者です。
現在閲覧しているドキュメントを読み終えた後、論理的に「次の章」や「次のページ」へ遷移するためのナビゲーション要素を解析してください。
## 解析手順
1. 指定URLにアクセスし、ページの **最下部付近** を重点的にDOM検証してください。
2. 「Next (次へ)」「Previous (前へ)」といったコンテキストを持つリンク要素を探してください。
- サイドバーではなく、**記事本文の直後** にあるフッターナビゲーションが対象です。
## 調査項目
1. **Next Link Selector**: 「次の記事」へ遷移するための `a` タグのCSSセレクタ。
- ラッパーの `div` ではなく、`href` を持つ `a` タグそのものを指定してください。
2. **Title Attribute**: そのリンクの中に、次の記事のタイトルが含まれている要素があれば、そのセレクタ(例: `span.title`)。
3. **Strategy**: このサイトは「ページごとにURLが変わる (multi_page)」か、それとも「1ページで完結しておりスクロール/アンカー移動のみ (single_page)」か判断してください。
- ヒント: Zodなどは `single_page`、TanStack Startなどは `multi_page` です。
## 出力フォーマット
以下のJSONのみを出力してください。解説は不要です。
```json
{
"pagination": {
"type": "string (例: 'linear_next_button' | 'single_page_scroll')",
"next_link": "string (例: a.pagination-nav__link--next)",
"prev_link": "string (例: a.pagination-nav__link--prev)",
"next_title_selector": "string (例: .pagination-nav__label)"
}
}
I've been meticulously analyzing the user's input, particularly their observation that the selector should be an array and executed sequentially. This sparked a deep dive into how to best structure the JSON schema to accommodate this crucial sequential execution requirement. I'm exploring different approaches, aiming to devise a strategy that is both elegant and efficient.
Developing Robust Scrapers
I'm now fully immersed in building a robust Multi-Channel Processing (MCP) server designed to scrape documentation for TanStack Start and related technologies. My primary aim is to mitigate the "Selector Fragility" issue I pinpointed earlier, and I'm currently using a collective intelligence approach. Leveraging JSON configurations generated by large language models, like Claude Code, I'm working to enhance reliability by adopting an array-of-selectors strategy where failure of one selector triggers a retry with the next. This has led to a slight increase in scraping logic complexity, but should result in a significant robustness gain.
Refining Scraper Logic
I've been meticulously refining the scraping logic to incorporate fallback and exclusion selector arrays. My focus is on ensuring the MCP server effectively processes these arrays, mirroring the behavior described. I'm thinking about how to frame this in a way that minimizes ambiguity. For instance, title/content extractions will use a "first match wins" approach and exclude extractions employ a "remove all matching" strategy. I've also started outlining the Python/JS pseudocode to handle selector array processing.
Implementing Selector Priorities
I'm now focusing on refining the proposed JSON schema. Simply transitioning from a string to an array for selectors isn't sufficient. To enhance clarity, I'm defining selector priorities with "Fallback" (first match wins) and "Aggregation" (apply all) strategies. I'm modifying prompts to elicit "Primary," "Secondary," and "Fallback" selectors, using the TanStack site as a concrete example to illustrate how selectors such as title might be specified using this method. I'm also planning to incorporate advice about distinguishing a "Fallback Strategy" from a "Union Strategy" when writing this guidance.
Distinguishing Selector Behavior
I'm currently honing in on the critical distinction between how we handle exclude selectors versus title/content selectors within the array structure. The array order now directly dictates priority: index 0 is the highest. But the exclude array needs distinct handling—it should apply all selectors listed, unlike title or content which use a "first match wins" approach. This difference is paramount for ensuring correct behavior within the scraper logic. I'm focusing on how to clearly articulate this difference in the schema and the related guidance.
---
### 3. MCPサーバー側の実装イメージ (Python)
設定ファイルを受け取って実際にDOMを解析する部分のロジックは以下のようになります。
`find_first_valid`(どれか一つ)と `remove_all`(すべて削除)を使い分けるのがポイントです。
```python
from bs4 import BeautifulSoup
def extract_content(html: str, config: dict):
soup = BeautifulSoup(html, 'html.parser')
# 1. 不要な要素をすべて削除 (Union戦略)
# excludeリストにあるセレクタにマッチするものは問答無用で消す
for selector in config['extraction'].get('exclude', []):
for tag in soup.select(selector):
tag.decompose()
# 2. タイトルの取得 (Fallback戦略)
# リストの上から順に検索し、最初に見つかったものを採用
title = None
for selector in config['extraction'].get('title', []):
element = soup.select_one(selector)
if element:
title = element.get_text(strip=True)
break # 見つかったのでループ終了
# 3. 本文の取得 (Fallback戦略)
content_html = None
for selector in config['extraction'].get('content', []):
element = soup.select_one(selector)
if element:
# 内部のHTMLを保持したまま取得
content_html = str(element)
break
return {
"title": title or "No Title Found",
"content": content_html or ""
}
ドキュメント抽出エンジンの再考: 集合知 vs. 汎用アルゴリズム
🤔 "集合知"は本当に最適解だったのか?
先ほどまで、ドキュメントの構造定義をJSON設定ファイル(集合知)に外出しし、CSSセレクタを駆使してコンテンツを抽出するアプローチを固めていた。
しかし、「Mozilla Readability」1 という存在を思い出した。Firefoxのリーダービュー機能の中核となっているライブラリだ。これは、CSSセレクタなどの設定なしで、ヒューリスティックに「本文っぽい部分」を自動抽出してくれる。
もしこれが使えるなら、苦労して「このサイトのタイトルは
h1.titleで...」とJSONを定義し続ける必要がなくなるのではないか?🔬 Readabilityのアプローチを検証する
Readabilityの仕組みは、DOMツリーを走査し、テキストの密度、リンクの密度、クラス名(
article,body,contentなどが含まれるか)、要素の種類などをスコアリングして、最も「記事らしい」要素を特定するというものだ。Pythonには
readability-lxmlやbreadabilityといったポートがあるし、JavaScript (Node.js) なら本家の@mozilla/readabilityが使える。言語にはこだわっていないので、精度が出るならJSに切り替えてもいい。比較: Config-based vs. Heuristic-based
技術ドキュメント特有の課題
技術ドキュメントは、一般的なブログ記事とは構造が違うことがある。
<main>の中に<nav>が入り込んでいることがあり、Readabilityが除去しきれないことがある。💡 ハイブリッド戦略への転換
完全にReadabilityに頼るのはリスクがあるが、完全に手動設定(Config-based)にするのもメンテナンスが辛い。
ならば、両方使えばいいのではないか?
基本戦略を「ハイブリッド」に修正する。
extractionルールがあれば、それを最優先で使う。これは「精度100%」を目指す場合や、Readabilityが失敗する特殊なサイト用。こうすれば、マイナーなライブラリのドキュメントを追加したい時、いちいちJSONを書かなくても「とりあえずURLだけ登録すれば Readability がなんとかしてくれる」という手軽さが手に入る。そして、精度が気に入らない場合だけ JSON を書けばいい。これはシステムとしての堅牢性と使い勝手を大きく向上させる。
🔄 検索システム全体のアーキテクチャ更新
これまでの検討を踏まえ、MCPサーバーのロジックを以下のように更新する。
フェーズ1: ナビゲーション解析 (Navigation)
ここだけは Readability では代替できない。「次のページ」や「サイドバー」の構造は、依然として設定ファイル(またはClaude Codeによる解析)が必要だ。ここは変更なし。
フェーズ2: コンテンツ抽出 (Extraction)
ここを柔軟にする。
このアプローチなら、
chrome-devtools-mcpで生成する設定ファイルは、必須のnavigationと、オプションのextraction(精度を高めたい場合のみ)という構成になる。「集合知」の出番は、Readabilityがうまく動かないサイトや、ページネーションの解析といった「より高度な部分」に集中させることができる。これは理にかなっている。
Footnotes
Webページの本文抽出アルゴリズムのデファクトスタンダード。元々はArc90が開発したブックマークレットで、現在はMozillaによってメンテナンスされている。余計な装飾を除去し、読みやすい形式に変換する。 ↩