Comparison of five Claude Code sessions that received the same prompt, with varying skill configurations and prompt refinements.
All sessions received essentially the same base prompt: create a demo Dagster project with Fivetran → dbt → Snowflake → PowerBI, Alteryx, Domo (migrating off) → Census/Fivetran Activations, with event-driven sensors and observe/orchestrate modes.
skills-10 received an enhanced prompt with additional explicit instructions: "Make sure any component that connects to an external system is using a state-backed component, uses a local cache and writes a set of mock assets using that cache, and that when it executes it logs a sample message and metadata instead of connecting to the external system. When modifying a component that exists, ALWAYS subclass, do not create a custom component."
| Aspect | skills-6 | skills-6-no-demo | skills-7 | skills-9 | skills-10 |
|---|---|---|---|---|---|
| Skills used | dagster-demo + dagster-expert |
dagster-expert only |
dagster-expert only |
dagster-expert only |
dagster-expert only |
| Prompt | Base | Base | Base | Base | Enhanced (explicit mock/subclass instructions) |
| Project name | data_stack_demo |
analytics_orchestrator |
data_platform |
demo_data_stack |
data_platform |
| Total assets | ~20 | ~39 | ~30+ | ~20-25 | ~28 |
| Custom components | 7 (3 subclass + 4 custom) | 5 (all custom) | 8 (2 subclass + 6 custom) | 6 (3 subclass + 3 custom) | 7 (3 subclass + 4 custom) |
| Jobs | 6 + 4 scheduled | 5 explicit | 7 | 3 | 4 |
| Sensors | 2 asset sensors | 4 run-status sensors | 5 asset sensors + 1 schedule | 5 asset sensors | 2 orchestration + 2 observe |
| Schedules | 4 via ScheduledJobComponent | 1 daily at 6 AM | 1 hourly (streaming checks) | 0 | 0 |
| Demo mode | Yes — subclass overrides | No — real state-backed | Hybrid — demo fallback | Yes — full mock | Yes — DEMO_MODE=True toggle |
| Defs folders | 3 | 12+ | 13 | 7 | 8 |
| Extra features | — | Failure alert sensor | GCP Dataflow (batch + streaming), asset checks | — | Fan-out sensor, explicit demo toggle |
| Python files | ~15 | ~20+ | 13 | 12 | 12 |
| YAML files | ~5 | ~12 | 14 | 12 | 7 |
- Approach: Subclassed
FivetranAccountComponent,DbtProjectComponent,PowerBIWorkspaceComponentto inject mock data; wrote 4 custom components (Census, Alteryx, Domo, ScheduledJob) - Structure: 3 defs folders (data_pipeline, orchestration_sensors, pipeline_sensors) — most consolidated layout
- Orchestration: Scheduled jobs as fallback + asset sensors as primary chain
- Data sources: 2 (Salesforce, NetSuite), 4 staging models, 2 mart models
- Approach: Used library components directly + 5 custom components from scratch (Alteryx, Census, Domo, FivetranActivation, PowerBIWithDbt)
- Structure: 12+ defs folders (one per integration) — most granular layout
- Orchestration: Single daily schedule at 6 AM + 4 run-status sensors chaining everything + failure alert sensor
- Data sources: 3 (Salesforce, Stripe, HubSpot), 6 staging models, 4 mart models
- Notable: Production-oriented, requires real credentials, most comprehensive data model
- Approach: 2 subclassed components (Fivetran, dbt) + 6 custom (Dataflow Batch, Dataflow Streaming, Census, Alteryx, Domo, FivetranActivations)
- Structure: 13 defs folders including 3 GCP Dataflow folders (batch, streaming observed, streaming orchestrated)
- Orchestration: 5 asset sensors + 1 hourly schedule for streaming data checks
- Unique: Added GCP Dataflow integration with bounded batch jobs, unbounded streaming (observed + orchestrated modes), and asset checks for streaming data quality (freshness, volume, schema drift)
- Demo mode: Hybrid — demo fallback when credentials unavailable, real APIs when provided
- Built across 2 main sessions (initial build + Dataflow addition)
- Approach: 3 subclassed components (FivetranIngest, PowerBI, dbt) + 3 custom (Alteryx, Domo, Census)
- Structure: 7 defs folders with clean domain grouping (ingestion, transform, consumption, reverse_etl, orchestration)
- Orchestration: Purely sensor-driven (0 schedules), 5 asset sensors chain the full pipeline
- Demo mode: Full mock — all components hardcode demo credentials,
[MOCK]log prefix on API calls - Notable: Cleanest folder structure, all
StateBackedComponentsubclasses withwrite_state_to_path()overrides
- Approach: 3 subclassed components (DemoFivetran, DemoDbt, DemoPowerBI) + 4 custom
StateBackedComponentsubclasses (Census, Alteryx, Domo, FivetranActivations) - Structure: 8 defs folders (fivetran_ingestion, dbt_transforms, powerbi, census, alteryx, domo, fivetran_activations, orchestration)
- Orchestration: 2 orchestration sensors —
fivetran_to_dbt_sensor(asset sensor) +dbt_complete_fan_out_sensor(run-status sensor that triggers 3 downstream jobs in parallel). 0 schedules. - Demo mode: Explicit
DEMO_MODE: bool = Truetoggle per component; each component overrideswrite_state_to_path()with mock JSON andexecute()with[MOCK]log messages - Data sources: 2 (Salesforce, Stripe), 6 staging + 3 mart dbt models
- Notable: Most consistent component pattern — every component follows identical structure (
write_state_to_pathoverride +DEMO_MODEtoggle). Fan-out sensor is the most elegant orchestration pattern across all projects. Highest edit count (17) — iterative refinement approach. - Prompt effect: The enhanced prompt's explicit instructions about state-backed components and subclassing produced results very similar to what the
dagster-demoskill achieved in skills-6, suggesting the skill's guidance can be replicated with prompt engineering.
| Metric | skills-6 (with skill) | skills-6-no-demo | skills-7 (main) | skills-7 (cont.) | skills-9 | skills-10 |
|---|---|---|---|---|---|---|
| Wall clock | ~110 min | ~80 min | ~20 min | ~15 min | ~5 hrs (breaks) | ~37 min |
| Total turn duration | 19.1 min | 33.6 min | 5.9 min | 9.2 min | 6.9 min | 13.5 min |
| User messages | 169 | 269 | 95 | 104 | 133 | 103 |
| Assistant turns | 237 | 375 | 117 | 141 | 178 | 143 |
| Output tokens | 56,781 | 81,910 | 18,708 | 36,006 | 40,737 | 47,357 |
| Cache read tokens | 21.8M | 30.9M | 5.9M | 8.4M | 13.6M | 10.3M |
| Cache write tokens | 696K | 1.08M | 275K | 307K | 1.03M | 848K |
| Total tool calls | 165 | 261 | 90 | 100 | 130 | 100 |
| File writes | 33 | 64 | 21 | 24 | 41 | 24 |
| File reads | 34 | 74 | 32 | 36 | 38 | 36 |
| Bash commands | 79 | 101 | 28 | 28 | 29 | 21 |
| Edits | 15 | 5 | 7 | 5 | 6 | 17 |
| Agents spawned | 0 | 0 | 0 | 1 | 9 | 1 |
| Skills invoked | dagster-demo, dagster-expert |
dagster-expert |
dagster-expert |
dagster-expert |
dagster-expert |
dagster-expert |
| Metric | skills-6 | skills-6-no-demo | skills-7 (2 sessions) | skills-9 | skills-10 |
|---|---|---|---|---|---|
| Output tokens | 56,781 | 81,910 | 54,714 | 40,737 | 47,357 |
| Cache read tokens | 21.8M | 30.9M | 14.3M | 13.6M | 10.3M |
| Total tool calls | 165 | 261 | 190 | 130 | 100 |
| File writes | 33 | 64 | 45 | 41 | 24 |
- Invoked the
dagster-demoskill which provided a structured recipe for creating demo projects - Also used
dagster-expertfor Dagster-specific guidance - Focused on subclassing existing components to inject mock data
- Fewer files written (33) — the skill guided a more consolidated structure
- More edits (15) — iterative refinement of fewer files
- Had 3 prior sessions exploring skill behavior before the main build
- Only used
dagster-expertfor general Dagster guidance - Built everything from scratch without a demo-mode recipe
- Created nearly 2x the files (64) with more granular folder structure
- Used 74 reads — extensive exploration/reference needed
- Most total tokens consumed (~82K output)
- Needed the most user interactions (269)
- Split across 2 main sessions: initial pipeline build (20 min), then continuation adding more components (15 min)
- First session built the core Fivetran → dbt → downstream pipeline
- Second session added GCP Dataflow integration (batch + streaming) with asset checks — a unique feature not in other projects
- Moderate token usage (~55K total output across both sessions)
- Most efficient per-session turn duration (5.9 min + 9.2 min)
- Single long session with breaks (wall clock ~5 hours, but only 6.9 min of active turn time)
- Heaviest use of subagents (9 Agent calls) — delegated research and exploration
- Fewest total tool calls (130) and lowest output tokens (40.7K) — most efficient single-session build
- Used WebFetch (3) and WebSearch (2) for documentation lookup
- Produced the cleanest folder structure with the most consistent mock patterns
- Single session, 37 min wall clock, 13.5 min active turn time
- Fewest tool calls overall (100) and fewest file writes (24) — most efficient build
- Highest edit count (17) — wrote fewer files but refined them more iteratively
- Only 21 bash commands — least shell usage of any session
- The enhanced prompt's explicit instructions eliminated exploration overhead: no need to discover the right patterns through trial and error
- Produced the most consistent component pattern across all 7 components
- Fan-out orchestration sensor (1 sensor triggers 3 parallel jobs) is the most elegant design
-
Demo vs Production: skills-6, skills-9, and skills-10 all created full mock/demo projects. skills-6-no-demo created production-oriented code. skills-7 was a hybrid with demo fallbacks.
-
Component granularity: skills-6 most consolidated (3 folders), skills-6-no-demo most granular (12+), skills-7 expanded scope (13 including Dataflow), skills-9 and skills-10 cleanest middle ground (7-8 folders).
-
Sensor approach: skills-6, skills-7, and skills-9 used asset materialization sensors. skills-6-no-demo used run status sensors. skills-10 uniquely combined both — an asset sensor for Fivetran→dbt plus a run-status fan-out sensor for dbt→downstream.
-
Unique features: skills-7 was the only project to include GCP Dataflow integration and asset checks for streaming data quality. skills-6 was the only one with a
ScheduledJobComponent. skills-6-no-demo was the only one with a failure alert sensor. skills-10 had the cleanest fan-out orchestration pattern.
| Efficiency | Scope | Quality | |
|---|---|---|---|
| skills-6 (with skill) | Good — skill reduced exploration overhead | Standard pipeline | Clean but consolidated |
| skills-6-no-demo | Worst — 82K tokens, 261 tool calls | Most comprehensive (39 assets, 4 marts) | Production-ready but verbose |
| skills-7 | Good — 55K tokens across 2 sessions | Extended scope (+ Dataflow, asset checks) | Hybrid demo/production |
| skills-9 | Good — 41K tokens, 130 tool calls | Standard pipeline | Cleanest structure and patterns |
| skills-10 (enhanced prompt) | Best — 47K tokens, 100 tool calls, 24 writes | Standard pipeline (~28 assets) | Most consistent component pattern |
The most efficient session was skills-10, which achieved the fewest tool calls (100) and file writes (24) through an enhanced prompt that explicitly specified architectural requirements. This suggests that embedding key design decisions directly in the prompt (state-backed components, subclassing, mock patterns) is more effective than relying on either skills or the model's own exploration. The dagster-demo skill (skills-6) achieved similar architectural outcomes but required more overhead to load and apply the skill's guidance. All five sessions produced working Dagster projects from the same base prompt, demonstrating that prompt specificity has the highest impact on both efficiency and output consistency.