Skip to content

Instantly share code, notes, and snippets.

@lukehinds
Created January 8, 2026 17:04
Show Gist options
  • Select an option

  • Save lukehinds/3945c2c7d42757240407fb0e7d70b529 to your computer and use it in GitHub Desktop.

Select an option

Save lukehinds/3945c2c7d42757240407fb0e7d70b529 to your computer and use it in GitHub Desktop.

DeepFabric Dataset Tools

Utility scripts for analyzing, filtering, and cleaning synthetic datasets generated by DeepFabric.

Scripts

filter_tool_dataset.py

Generic quality filter for tool-calling datasets. Removes problematic patterns that can cause models to develop bad habits during training.

Features:

  • Auto-detection mode discovers issues from the data itself
  • Schema-agnostic: works with any tool-calling dataset (Blender, Kubernetes, GitHub, etc.)
  • Removes samples with template placeholders (e.g., {{objectName}})
  • Filters excessive same-tool calls
  • Detects and removes recovery/fallback patterns
  • Deduplicates similar samples
  • Balances over-represented tool distributions

Usage:

Analyzing a Dataset

Before filtering, you should analyze your dataset to understand what issues exist. The analyze mode scans your dataset and reports statistics about tool usage patterns, identifies tools that appear too frequently as the first call (which can bias models), detects suspicious sequences that may indicate error recovery behavior, and counts samples containing template placeholders or excessive repeated tool calls. This is a read-only operation that helps you decide what filtering to apply.

python tools/filter_tool_dataset.py --analyze input.jsonl

Filtering with Auto-Detection

The simplest way to clean a dataset is to use auto-detection mode. The script will analyze your dataset, automatically identify problematic patterns based on statistical thresholds, and apply appropriate filters. This works well for most datasets without requiring you to know the specific tool names or patterns in advance.

python tools/filter_tool_dataset.py input.jsonl output.jsonl --auto

Filtering with Explicit Tool Balancing

When training data contains too many samples that start with the same tool, models learn to over-rely on that tool. For example, if 40% of your samples begin with get_scene_info, the model may learn to call it unnecessarily. Tool balancing downsamples over-represented tools to create a more even distribution. Use this when you know which specific tools are problematic in your dataset.

python tools/filter_tool_dataset.py input.jsonl output.jsonl \
    --balance-tools "get_scene_info,list_pods"

Filtering Recovery Patterns

Recovery patterns are tool sequences where one tool is consistently followed by another, often indicating error-handling or fallback behavior in the training data. For example, if get_object_info frequently fails and is followed by get_scene_info as a fallback, the model may learn this unhelpful pattern. Specify patterns as colon-separated pairs to remove samples containing these sequences.

python tools/filter_tool_dataset.py input.jsonl output.jsonl \
    --recovery-patterns "get_object_info:get_scene_info"

Using a Configuration File

For complex filtering rules or when you need to apply the same filters across multiple datasets, you can define your configuration in a YAML file. This is especially useful for team workflows where you want consistent, reproducible filtering.

python tools/filter_tool_dataset.py input.jsonl output.jsonl --config filter.yaml

Options:

Option Description
--analyze Only analyze, don't filter
--auto Auto-detect and apply patterns from data
--config FILE YAML config file for domain-specific rules
--balance-tools LIST Comma-separated tools to balance
--recovery-patterns LIST Comma-separated A:B patterns to filter
--max-same-tool N Max calls to same tool per sample (default: 3)
--similarity-threshold N Deduplication threshold 0.0-1.0 (default: 0.85)
--balance-target N Target percentage for balanced tools (default: 0.10)
--first-tool-threshold N Threshold for flagging first tools (default: 0.15)
--no-balance Skip tool balancing
--keep-recovery Keep samples with recovery patterns
--keep-broken Keep samples with broken/placeholder responses

YAML Config Example:

# filter-config.yaml
broken_patterns:
  - "\\{\\{[^}]+\\}\\}"  # Template placeholders
  - "error.*not found"   # Error responses

recovery_sequences:
  - [get_object_info, get_scene_info]  # Blender-specific
  - [get_pod, list_pods]               # Kubernetes-specific

balance_tools:
  - get_scene_info
  - list_pods

max_same_tool_calls: 3
similarity_threshold: 0.85
balance_target_percentage: 0.10

dedupe_graph.py

When generating topic trees with LLMs, the same or very similar topics can appear multiple times in different branches. This creates redundancy in your dataset and can lead to repetitive training examples. This script detects and removes duplicate topics from JSON graph files, preserving the tree structure by merging children and updating all references.

Features:

  • Multiple matching strategies: exact hash, case-insensitive, fuzzy
  • Reports duplicate groups with statistics
  • Merges children when removing duplicates
  • Updates all parent/child references

Usage:

Detecting Duplicates

Before removing anything, run the script in report mode to see what duplicates exist. This shows you groups of duplicate topics, their node IDs, and the topic text. By default, it uses exact matching based on SHA256 hashes of the topic text.

python tools/dedupe_graph.py --input graph.json

Case-Insensitive Matching

Sometimes topics are duplicated with different capitalization (e.g., "Machine Learning" vs "machine learning"). Case-insensitive matching normalizes all text to lowercase before comparing, catching these variations.

python tools/dedupe_graph.py --input graph.json --strategy case-insensitive

Fuzzy Matching

For near-duplicates that aren't exact matches (e.g., "Introduction to Python" vs "Python Introduction"), fuzzy matching uses string similarity to find topics that are close enough to be considered duplicates. The threshold (0.0 to 1.0) controls how similar topics must be - 0.85 means 85% similar.

python tools/dedupe_graph.py --input graph.json --strategy fuzzy --threshold 0.85

Removing Duplicates

Once you've reviewed the duplicates and are satisfied with what will be removed, run in remove mode. The script keeps the node with the lowest ID from each duplicate group, merges children from removed nodes into the kept node, and updates all parent/child references throughout the graph.

python tools/dedupe_graph.py --input graph.json --output deduped.json --mode remove

Options:

Option Description
--input, -i Input JSON graph file (required)
--output, -o Output file for deduplicated graph
--mode, -m report (default) or remove
--strategy, -s exact (default), case-insensitive, or fuzzy
--threshold, -t Similarity threshold for fuzzy matching (default: 0.9)
--verbose, -v Show detailed output

Strategies:

  • exact: Matches by SHA256 hash of topic text (uses pre-computed topic_hash if available)
  • case-insensitive: Normalizes to lowercase before hashing
  • fuzzy: Uses SequenceMatcher for similarity-based grouping

Common Workflows

Cleaning a Tool-Calling Dataset

# 1. Analyze the dataset first
python tools/filter_tool_dataset.py --analyze my-dataset.jsonl

# 2. Review the output and decide on filtering strategy

# 3. Apply filters (auto mode for convenience)
python tools/filter_tool_dataset.py my-dataset.jsonl my-dataset-clean.jsonl --auto

# Or with explicit settings based on analysis
python tools/filter_tool_dataset.py my-dataset.jsonl my-dataset-clean.jsonl \
    --balance-tools "get_scene_info" \
    --recovery-patterns "get_object_info:get_scene_info"

Deduplicating a Topic Graph

# 1. Check for duplicates
python tools/dedupe_graph.py --input topics.json --verbose

# 2. If duplicates found, remove them
python tools/dedupe_graph.py --input topics.json --output topics-deduped.json --mode remove
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment