lukehinds/README.md

## README.md

      
    Raw
  

              README.md
            
          
    DeepFabric Dataset Tools

Utility scripts for analyzing, filtering, and cleaning synthetic datasets generated by DeepFabric.
Scripts

filter_tool_dataset.py

Generic quality filter for tool-calling datasets. Removes problematic patterns that can cause models to develop bad habits during training.
Features:

Auto-detection mode discovers issues from the data itself
Schema-agnostic: works with any tool-calling dataset (Blender, Kubernetes, GitHub, etc.)
Removes samples with template placeholders (e.g., {{objectName}})
Filters excessive same-tool calls
Detects and removes recovery/fallback patterns
Deduplicates similar samples
Balances over-represented tool distributions

Usage:
Analyzing a Dataset

Before filtering, you should analyze your dataset to understand what issues exist. The analyze mode scans your dataset and reports statistics about tool usage patterns, identifies tools that appear too frequently as the first call (which can bias models), detects suspicious sequences that may indicate error recovery behavior, and counts samples containing template placeholders or excessive repeated tool calls. This is a read-only operation that helps you decide what filtering to apply.
python tools/filter_tool_dataset.py --analyze input.jsonl
Filtering with Auto-Detection

The simplest way to clean a dataset is to use auto-detection mode. The script will analyze your dataset, automatically identify problematic patterns based on statistical thresholds, and apply appropriate filters. This works well for most datasets without requiring you to know the specific tool names or patterns in advance.
python tools/filter_tool_dataset.py input.jsonl output.jsonl --auto
Filtering with Explicit Tool Balancing

When training data contains too many samples that start with the same tool, models learn to over-rely on that tool. For example, if 40% of your samples begin with get_scene_info, the model may learn to call it unnecessarily. Tool balancing downsamples over-represented tools to create a more even distribution. Use this when you know which specific tools are problematic in your dataset.
python tools/filter_tool_dataset.py input.jsonl output.jsonl \
    --balance-tools "get_scene_info,list_pods"
Filtering Recovery Patterns

Recovery patterns are tool sequences where one tool is consistently followed by another, often indicating error-handling or fallback behavior in the training data. For example, if get_object_info frequently fails and is followed by get_scene_info as a fallback, the model may learn this unhelpful pattern. Specify patterns as colon-separated pairs to remove samples containing these sequences.
python tools/filter_tool_dataset.py input.jsonl output.jsonl \
    --recovery-patterns "get_object_info:get_scene_info"
Using a Configuration File

For complex filtering rules or when you need to apply the same filters across multiple datasets, you can define your configuration in a YAML file. This is especially useful for team workflows where you want consistent, reproducible filtering.
python tools/filter_tool_dataset.py input.jsonl output.jsonl --config filter.yaml
Options:


Option
Description


--analyze
Only analyze, don't filter


--auto
Auto-detect and apply patterns from data


--config FILE
YAML config file for domain-specific rules


--balance-tools LIST
Comma-separated tools to balance


--recovery-patterns LIST
Comma-separated A:B patterns to filter


--max-same-tool N
Max calls to same tool per sample (default: 3)


--similarity-threshold N
Deduplication threshold 0.0-1.0 (default: 0.85)


--balance-target N
Target percentage for balanced tools (default: 0.10)


--first-tool-threshold N
Threshold for flagging first tools (default: 0.15)


--no-balance
Skip tool balancing


--keep-recovery
Keep samples with recovery patterns


--keep-broken
Keep samples with broken/placeholder responses


YAML Config Example:
# filter-config.yaml
broken_patterns:
  - "\\{\\{[^}]+\\}\\}"  # Template placeholders
  - "error.*not found"   # Error responses

recovery_sequences:
  - [get_object_info, get_scene_info]  # Blender-specific
  - [get_pod, list_pods]               # Kubernetes-specific

balance_tools:
  - get_scene_info
  - list_pods

max_same_tool_calls: 3
similarity_threshold: 0.85
balance_target_percentage: 0.10

dedupe_graph.py

When generating topic trees with LLMs, the same or very similar topics can appear multiple times in different branches. This creates redundancy in your dataset and can lead to repetitive training examples. This script detects and removes duplicate topics from JSON graph files, preserving the tree structure by merging children and updating all references.
Features:

Multiple matching strategies: exact hash, case-insensitive, fuzzy
Reports duplicate groups with statistics
Merges children when removing duplicates
Updates all parent/child references

Usage:
Detecting Duplicates

Before removing anything, run the script in report mode to see what duplicates exist. This shows you groups of duplicate topics, their node IDs, and the topic text. By default, it uses exact matching based on SHA256 hashes of the topic text.
python tools/dedupe_graph.py --input graph.json
Case-Insensitive Matching

Sometimes topics are duplicated with different capitalization (e.g., "Machine Learning" vs "machine learning"). Case-insensitive matching normalizes all text to lowercase before comparing, catching these variations.
python tools/dedupe_graph.py --input graph.json --strategy case-insensitive
Fuzzy Matching

For near-duplicates that aren't exact matches (e.g., "Introduction to Python" vs "Python Introduction"), fuzzy matching uses string similarity to find topics that are close enough to be considered duplicates. The threshold (0.0 to 1.0) controls how similar topics must be - 0.85 means 85% similar.
python tools/dedupe_graph.py --input graph.json --strategy fuzzy --threshold 0.85
Removing Duplicates

Once you've reviewed the duplicates and are satisfied with what will be removed, run in remove mode. The script keeps the node with the lowest ID from each duplicate group, merges children from removed nodes into the kept node, and updates all parent/child references throughout the graph.
python tools/dedupe_graph.py --input graph.json --output deduped.json --mode remove
Options:


Option
Description


--input, -i
Input JSON graph file (required)


--output, -o
Output file for deduplicated graph


--mode, -m
report (default) or remove


--strategy, -s
exact (default), case-insensitive, or fuzzy


--threshold, -t
Similarity threshold for fuzzy matching (default: 0.9)


--verbose, -v
Show detailed output


Strategies:

exact: Matches by SHA256 hash of topic text (uses pre-computed topic_hash if available)
case-insensitive: Normalizes to lowercase before hashing
fuzzy: Uses SequenceMatcher for similarity-based grouping


Common Workflows

Cleaning a Tool-Calling Dataset

# 1. Analyze the dataset first
python tools/filter_tool_dataset.py --analyze my-dataset.jsonl

# 2. Review the output and decide on filtering strategy

# 3. Apply filters (auto mode for convenience)
python tools/filter_tool_dataset.py my-dataset.jsonl my-dataset-clean.jsonl --auto

# Or with explicit settings based on analysis
python tools/filter_tool_dataset.py my-dataset.jsonl my-dataset-clean.jsonl \
    --balance-tools "get_scene_info" \
    --recovery-patterns "get_object_info:get_scene_info"
Deduplicating a Topic Graph

# 1. Check for duplicates
python tools/dedupe_graph.py --input topics.json --verbose

# 2. If duplicates found, remove them
python tools/dedupe_graph.py --input topics.json --output topics-deduped.json --mode remove
Option	Description
`--analyze`	Only analyze, don't filter
`--auto`	Auto-detect and apply patterns from data
`--config FILE`	YAML config file for domain-specific rules
`--balance-tools LIST`	Comma-separated tools to balance
`--recovery-patterns LIST`	Comma-separated `A:B` patterns to filter
`--max-same-tool N`	Max calls to same tool per sample (default: 3)
`--similarity-threshold N`	Deduplication threshold 0.0-1.0 (default: 0.85)
`--balance-target N`	Target percentage for balanced tools (default: 0.10)
`--first-tool-threshold N`	Threshold for flagging first tools (default: 0.15)
`--no-balance`	Skip tool balancing
`--keep-recovery`	Keep samples with recovery patterns
`--keep-broken`	Keep samples with broken/placeholder responses
Option	Description
`--input`, `-i`	Input JSON graph file (required)
`--output`, `-o`	Output file for deduplicated graph
`--mode`, `-m`	`report` (default) or `remove`
`--strategy`, `-s`	`exact` (default), `case-insensitive`, or `fuzzy`
`--threshold`, `-t`	Similarity threshold for fuzzy matching (default: 0.9)
`--verbose`, `-v`	Show detailed output