Utility scripts for analyzing, filtering, and cleaning synthetic datasets generated by DeepFabric.
Generic quality filter for tool-calling datasets. Removes problematic patterns that can cause models to develop bad habits during training.
Features:
- Auto-detection mode discovers issues from the data itself
- Schema-agnostic: works with any tool-calling dataset (Blender, Kubernetes, GitHub, etc.)
- Removes samples with template placeholders (e.g.,
{{objectName}}) - Filters excessive same-tool calls
- Detects and removes recovery/fallback patterns
- Deduplicates similar samples
- Balances over-represented tool distributions
Usage:
Before filtering, you should analyze your dataset to understand what issues exist. The analyze mode scans your dataset and reports statistics about tool usage patterns, identifies tools that appear too frequently as the first call (which can bias models), detects suspicious sequences that may indicate error recovery behavior, and counts samples containing template placeholders or excessive repeated tool calls. This is a read-only operation that helps you decide what filtering to apply.
python tools/filter_tool_dataset.py --analyze input.jsonlThe simplest way to clean a dataset is to use auto-detection mode. The script will analyze your dataset, automatically identify problematic patterns based on statistical thresholds, and apply appropriate filters. This works well for most datasets without requiring you to know the specific tool names or patterns in advance.
python tools/filter_tool_dataset.py input.jsonl output.jsonl --autoWhen training data contains too many samples that start with the same tool, models learn to over-rely on that tool. For example, if 40% of your samples begin with get_scene_info, the model may learn to call it unnecessarily. Tool balancing downsamples over-represented tools to create a more even distribution. Use this when you know which specific tools are problematic in your dataset.
python tools/filter_tool_dataset.py input.jsonl output.jsonl \
--balance-tools "get_scene_info,list_pods"Recovery patterns are tool sequences where one tool is consistently followed by another, often indicating error-handling or fallback behavior in the training data. For example, if get_object_info frequently fails and is followed by get_scene_info as a fallback, the model may learn this unhelpful pattern. Specify patterns as colon-separated pairs to remove samples containing these sequences.
python tools/filter_tool_dataset.py input.jsonl output.jsonl \
--recovery-patterns "get_object_info:get_scene_info"For complex filtering rules or when you need to apply the same filters across multiple datasets, you can define your configuration in a YAML file. This is especially useful for team workflows where you want consistent, reproducible filtering.
python tools/filter_tool_dataset.py input.jsonl output.jsonl --config filter.yamlOptions:
| Option | Description |
|---|---|
--analyze |
Only analyze, don't filter |
--auto |
Auto-detect and apply patterns from data |
--config FILE |
YAML config file for domain-specific rules |
--balance-tools LIST |
Comma-separated tools to balance |
--recovery-patterns LIST |
Comma-separated A:B patterns to filter |
--max-same-tool N |
Max calls to same tool per sample (default: 3) |
--similarity-threshold N |
Deduplication threshold 0.0-1.0 (default: 0.85) |
--balance-target N |
Target percentage for balanced tools (default: 0.10) |
--first-tool-threshold N |
Threshold for flagging first tools (default: 0.15) |
--no-balance |
Skip tool balancing |
--keep-recovery |
Keep samples with recovery patterns |
--keep-broken |
Keep samples with broken/placeholder responses |
YAML Config Example:
# filter-config.yaml
broken_patterns:
- "\\{\\{[^}]+\\}\\}" # Template placeholders
- "error.*not found" # Error responses
recovery_sequences:
- [get_object_info, get_scene_info] # Blender-specific
- [get_pod, list_pods] # Kubernetes-specific
balance_tools:
- get_scene_info
- list_pods
max_same_tool_calls: 3
similarity_threshold: 0.85
balance_target_percentage: 0.10When generating topic trees with LLMs, the same or very similar topics can appear multiple times in different branches. This creates redundancy in your dataset and can lead to repetitive training examples. This script detects and removes duplicate topics from JSON graph files, preserving the tree structure by merging children and updating all references.
Features:
- Multiple matching strategies: exact hash, case-insensitive, fuzzy
- Reports duplicate groups with statistics
- Merges children when removing duplicates
- Updates all parent/child references
Usage:
Before removing anything, run the script in report mode to see what duplicates exist. This shows you groups of duplicate topics, their node IDs, and the topic text. By default, it uses exact matching based on SHA256 hashes of the topic text.
python tools/dedupe_graph.py --input graph.jsonSometimes topics are duplicated with different capitalization (e.g., "Machine Learning" vs "machine learning"). Case-insensitive matching normalizes all text to lowercase before comparing, catching these variations.
python tools/dedupe_graph.py --input graph.json --strategy case-insensitiveFor near-duplicates that aren't exact matches (e.g., "Introduction to Python" vs "Python Introduction"), fuzzy matching uses string similarity to find topics that are close enough to be considered duplicates. The threshold (0.0 to 1.0) controls how similar topics must be - 0.85 means 85% similar.
python tools/dedupe_graph.py --input graph.json --strategy fuzzy --threshold 0.85Once you've reviewed the duplicates and are satisfied with what will be removed, run in remove mode. The script keeps the node with the lowest ID from each duplicate group, merges children from removed nodes into the kept node, and updates all parent/child references throughout the graph.
python tools/dedupe_graph.py --input graph.json --output deduped.json --mode removeOptions:
| Option | Description |
|---|---|
--input, -i |
Input JSON graph file (required) |
--output, -o |
Output file for deduplicated graph |
--mode, -m |
report (default) or remove |
--strategy, -s |
exact (default), case-insensitive, or fuzzy |
--threshold, -t |
Similarity threshold for fuzzy matching (default: 0.9) |
--verbose, -v |
Show detailed output |
Strategies:
- exact: Matches by SHA256 hash of topic text (uses pre-computed
topic_hashif available) - case-insensitive: Normalizes to lowercase before hashing
- fuzzy: Uses SequenceMatcher for similarity-based grouping
# 1. Analyze the dataset first
python tools/filter_tool_dataset.py --analyze my-dataset.jsonl
# 2. Review the output and decide on filtering strategy
# 3. Apply filters (auto mode for convenience)
python tools/filter_tool_dataset.py my-dataset.jsonl my-dataset-clean.jsonl --auto
# Or with explicit settings based on analysis
python tools/filter_tool_dataset.py my-dataset.jsonl my-dataset-clean.jsonl \
--balance-tools "get_scene_info" \
--recovery-patterns "get_object_info:get_scene_info"# 1. Check for duplicates
python tools/dedupe_graph.py --input topics.json --verbose
# 2. If duplicates found, remove them
python tools/dedupe_graph.py --input topics.json --output topics-deduped.json --mode remove