cnolanminich/session_comparison.md

## session_comparison.md

      
    Raw
  

              session_comparison.md
            
          
    Dagster Demo Skill: Session Comparison

Comparison of five Claude Code sessions that received the same prompt, with varying skill configurations and prompt refinements.
The Prompt

All sessions received essentially the same base prompt: create a demo Dagster project with Fivetran → dbt → Snowflake → PowerBI, Alteryx, Domo (migrating off) → Census/Fivetran Activations, with event-driven sensors and observe/orchestrate modes.
skills-10 received an enhanced prompt with additional explicit instructions: "Make sure any component that connects to an external system is using a state-backed component, uses a local cache and writes a set of mock assets using that cache, and that when it executes it logs a sample message and metadata instead of connecting to the external system. When modifying a component that exists, ALWAYS subclass, do not create a custom component."
Projects at a Glance


Aspect
skills-6
skills-6-no-demo
skills-7
skills-9
skills-10


Skills used
dagster-demo + dagster-expert
dagster-expert only
dagster-expert only
dagster-expert only
dagster-expert only


Prompt
Base
Base
Base
Base
Enhanced (explicit mock/subclass instructions)


Project name
data_stack_demo
analytics_orchestrator
data_platform
demo_data_stack
data_platform


Total assets
~20
~39
~30+
~20-25
~28


Custom components
7 (3 subclass + 4 custom)
5 (all custom)
8 (2 subclass + 6 custom)
6 (3 subclass + 3 custom)
7 (3 subclass + 4 custom)


Jobs
6 + 4 scheduled
5 explicit
7
3
4


Sensors
2 asset sensors
4 run-status sensors
5 asset sensors + 1 schedule
5 asset sensors
2 orchestration + 2 observe


Schedules
4 via ScheduledJobComponent
1 daily at 6 AM
1 hourly (streaming checks)
0
0


Demo mode
Yes — subclass overrides
No — real state-backed
Hybrid — demo fallback
Yes — full mock
Yes — DEMO_MODE=True toggle


Defs folders
3
12+
13
7
8


Extra features
—
Failure alert sensor
GCP Dataflow (batch + streaming), asset checks
—
Fan-out sensor, explicit demo toggle


Python files
~15
~20+
13
12
12


YAML files
~5
~12
14
12
7


Detailed Project Differences

skills-6 (with dagster-demo skill)


Approach: Subclassed FivetranAccountComponent, DbtProjectComponent, PowerBIWorkspaceComponent to inject mock data; wrote 4 custom components (Census, Alteryx, Domo, ScheduledJob)
Structure: 3 defs folders (data_pipeline, orchestration_sensors, pipeline_sensors) — most consolidated layout
Orchestration: Scheduled jobs as fallback + asset sensors as primary chain
Data sources: 2 (Salesforce, NetSuite), 4 staging models, 2 mart models

skills-6-no-dagster-demo (no dagster-demo skill)


Approach: Used library components directly + 5 custom components from scratch (Alteryx, Census, Domo, FivetranActivation, PowerBIWithDbt)
Structure: 12+ defs folders (one per integration) — most granular layout
Orchestration: Single daily schedule at 6 AM + 4 run-status sensors chaining everything + failure alert sensor
Data sources: 3 (Salesforce, Stripe, HubSpot), 6 staging models, 4 mart models
Notable: Production-oriented, requires real credentials, most comprehensive data model

skills-7 (no dagster-demo skill, multi-session)


Approach: 2 subclassed components (Fivetran, dbt) + 6 custom (Dataflow Batch, Dataflow Streaming, Census, Alteryx, Domo, FivetranActivations)
Structure: 13 defs folders including 3 GCP Dataflow folders (batch, streaming observed, streaming orchestrated)
Orchestration: 5 asset sensors + 1 hourly schedule for streaming data checks
Unique: Added GCP Dataflow integration with bounded batch jobs, unbounded streaming (observed + orchestrated modes), and asset checks for streaming data quality (freshness, volume, schema drift)
Demo mode: Hybrid — demo fallback when credentials unavailable, real APIs when provided
Built across 2 main sessions (initial build + Dataflow addition)

skills-9 (no dagster-demo skill)


Approach: 3 subclassed components (FivetranIngest, PowerBI, dbt) + 3 custom (Alteryx, Domo, Census)
Structure: 7 defs folders with clean domain grouping (ingestion, transform, consumption, reverse_etl, orchestration)
Orchestration: Purely sensor-driven (0 schedules), 5 asset sensors chain the full pipeline
Demo mode: Full mock — all components hardcode demo credentials, [MOCK] log prefix on API calls
Notable: Cleanest folder structure, all StateBackedComponent subclasses with write_state_to_path() overrides

skills-10 (enhanced prompt, no dagster-demo skill)


Approach: 3 subclassed components (DemoFivetran, DemoDbt, DemoPowerBI) + 4 custom StateBackedComponent subclasses (Census, Alteryx, Domo, FivetranActivations)
Structure: 8 defs folders (fivetran_ingestion, dbt_transforms, powerbi, census, alteryx, domo, fivetran_activations, orchestration)
Orchestration: 2 orchestration sensors — fivetran_to_dbt_sensor (asset sensor) + dbt_complete_fan_out_sensor (run-status sensor that triggers 3 downstream jobs in parallel). 0 schedules.
Demo mode: Explicit DEMO_MODE: bool = True toggle per component; each component overrides write_state_to_path() with mock JSON and execute() with [MOCK] log messages
Data sources: 2 (Salesforce, Stripe), 6 staging + 3 mart dbt models
Notable: Most consistent component pattern — every component follows identical structure (write_state_to_path override + DEMO_MODE toggle). Fan-out sensor is the most elegant orchestration pattern across all projects. Highest edit count (17) — iterative refinement approach.
Prompt effect: The enhanced prompt's explicit instructions about state-backed components and subclassing produced results very similar to what the dagster-demo skill achieved in skills-6, suggesting the skill's guidance can be replicated with prompt engineering.

Session Metrics


Metric
skills-6 (with skill)
skills-6-no-demo
skills-7 (main)
skills-7 (cont.)
skills-9
skills-10


Wall clock
~110 min
~80 min
~20 min
~15 min
~5 hrs (breaks)
~37 min


Total turn duration
19.1 min
33.6 min
5.9 min
9.2 min
6.9 min
13.5 min


User messages
169
269
95
104
133
103


Assistant turns
237
375
117
141
178
143


Output tokens
56,781
81,910
18,708
36,006
40,737
47,357


Cache read tokens
21.8M
30.9M
5.9M
8.4M
13.6M
10.3M


Cache write tokens
696K
1.08M
275K
307K
1.03M
848K


Total tool calls
165
261
90
100
130
100


File writes
33
64
21
24
41
24


File reads
34
74
32
36
38
36


Bash commands
79
101
28
28
29
21


Edits
15
5
7
5
6
17


Agents spawned
0
0
0
1
9
1


Skills invoked
dagster-demo, dagster-expert
dagster-expert
dagster-expert
dagster-expert
dagster-expert
dagster-expert


Aggregate Totals (combining multi-session projects)


Metric
skills-6
skills-6-no-demo
skills-7 (2 sessions)
skills-9
skills-10


Output tokens
56,781
81,910
54,714
40,737
47,357


Cache read tokens
21.8M
30.9M
14.3M
13.6M
10.3M


Total tool calls
165
261
190
130
100


File writes
33
64
45
41
24


Path Taken

skills-6 (with dagster-demo skill)


Invoked the dagster-demo skill which provided a structured recipe for creating demo projects
Also used dagster-expert for Dagster-specific guidance
Focused on subclassing existing components to inject mock data
Fewer files written (33) — the skill guided a more consolidated structure
More edits (15) — iterative refinement of fewer files
Had 3 prior sessions exploring skill behavior before the main build

skills-6-no-dagster-demo (no skill)


Only used dagster-expert for general Dagster guidance
Built everything from scratch without a demo-mode recipe
Created nearly 2x the files (64) with more granular folder structure
Used 74 reads — extensive exploration/reference needed
Most total tokens consumed (~82K output)
Needed the most user interactions (269)

skills-7 (no skill, multi-session)


Split across 2 main sessions: initial pipeline build (20 min), then continuation adding more components (15 min)
First session built the core Fivetran → dbt → downstream pipeline
Second session added GCP Dataflow integration (batch + streaming) with asset checks — a unique feature not in other projects
Moderate token usage (~55K total output across both sessions)
Most efficient per-session turn duration (5.9 min + 9.2 min)

skills-9 (no skill)


Single long session with breaks (wall clock ~5 hours, but only 6.9 min of active turn time)
Heaviest use of subagents (9 Agent calls) — delegated research and exploration
Fewest total tool calls (130) and lowest output tokens (40.7K) — most efficient single-session build
Used WebFetch (3) and WebSearch (2) for documentation lookup
Produced the cleanest folder structure with the most consistent mock patterns

skills-10 (enhanced prompt, no skill)


Single session, 37 min wall clock, 13.5 min active turn time
Fewest tool calls overall (100) and fewest file writes (24) — most efficient build
Highest edit count (17) — wrote fewer files but refined them more iteratively
Only 21 bash commands — least shell usage of any session
The enhanced prompt's explicit instructions eliminated exploration overhead: no need to discover the right patterns through trial and error
Produced the most consistent component pattern across all 7 components
Fan-out orchestration sensor (1 sensor triggers 3 parallel jobs) is the most elegant design

Key Architectural Differences


Demo vs Production: skills-6, skills-9, and skills-10 all created full mock/demo projects. skills-6-no-demo created production-oriented code. skills-7 was a hybrid with demo fallbacks.


Component granularity: skills-6 most consolidated (3 folders), skills-6-no-demo most granular (12+), skills-7 expanded scope (13 including Dataflow), skills-9 and skills-10 cleanest middle ground (7-8 folders).


Sensor approach: skills-6, skills-7, and skills-9 used asset materialization sensors. skills-6-no-demo used run status sensors. skills-10 uniquely combined both — an asset sensor for Fivetran→dbt plus a run-status fan-out sensor for dbt→downstream.


Unique features: skills-7 was the only project to include GCP Dataflow integration and asset checks for streaming data quality. skills-6 was the only one with a ScheduledJobComponent. skills-6-no-demo was the only one with a failure alert sensor. skills-10 had the cleanest fan-out orchestration pattern.


Summary


Efficiency
Scope
Quality


skills-6 (with skill)
Good — skill reduced exploration overhead
Standard pipeline
Clean but consolidated


skills-6-no-demo
Worst — 82K tokens, 261 tool calls
Most comprehensive (39 assets, 4 marts)
Production-ready but verbose


skills-7
Good — 55K tokens across 2 sessions
Extended scope (+ Dataflow, asset checks)
Hybrid demo/production


skills-9
Good — 41K tokens, 130 tool calls
Standard pipeline
Cleanest structure and patterns


skills-10 (enhanced prompt)
Best — 47K tokens, 100 tool calls, 24 writes
Standard pipeline (~28 assets)
Most consistent component pattern


The most efficient session was skills-10, which achieved the fewest tool calls (100) and file writes (24) through an enhanced prompt that explicitly specified architectural requirements. This suggests that embedding key design decisions directly in the prompt (state-backed components, subclassing, mock patterns) is more effective than relying on either skills or the model's own exploration. The dagster-demo skill (skills-6) achieved similar architectural outcomes but required more overhead to load and apply the skill's guidance. All five sessions produced working Dagster projects from the same base prompt, demonstrating that prompt specificity has the highest impact on both efficiency and output consistency.
Aspect	skills-6	skills-6-no-demo	skills-7	skills-9	skills-10
Skills used	`dagster-demo` + `dagster-expert`	`dagster-expert` only	`dagster-expert` only	`dagster-expert` only	`dagster-expert` only
Prompt	Base	Base	Base	Base	Enhanced (explicit mock/subclass instructions)
Project name	`data_stack_demo`	`analytics_orchestrator`	`data_platform`	`demo_data_stack`	`data_platform`
Total assets	~20	~39	~30+	~20-25	~28
Custom components	7 (3 subclass + 4 custom)	5 (all custom)	8 (2 subclass + 6 custom)	6 (3 subclass + 3 custom)	7 (3 subclass + 4 custom)
Jobs	6 + 4 scheduled	5 explicit	7	3	4
Sensors	2 asset sensors	4 run-status sensors	5 asset sensors + 1 schedule	5 asset sensors	2 orchestration + 2 observe
Schedules	4 via ScheduledJobComponent	1 daily at 6 AM	1 hourly (streaming checks)	0	0
Demo mode	Yes — subclass overrides	No — real state-backed	Hybrid — demo fallback	Yes — full mock	Yes — `DEMO_MODE=True` toggle
Defs folders	3	12+	13	7	8
Extra features	—	Failure alert sensor	GCP Dataflow (batch + streaming), asset checks	—	Fan-out sensor, explicit demo toggle
Python files	~15	~20+	13	12	12
YAML files	~5	~12	14	12	7
Metric	skills-6 (with skill)	skills-6-no-demo	skills-7 (main)	skills-7 (cont.)	skills-9	skills-10
Wall clock	~110 min	~80 min	~20 min	~15 min	~5 hrs (breaks)	~37 min
Total turn duration	19.1 min	33.6 min	5.9 min	9.2 min	6.9 min	13.5 min
User messages	169	269	95	104	133	103
Assistant turns	237	375	117	141	178	143
Output tokens	56,781	81,910	18,708	36,006	40,737	47,357
Cache read tokens	21.8M	30.9M	5.9M	8.4M	13.6M	10.3M
Cache write tokens	696K	1.08M	275K	307K	1.03M	848K
Total tool calls	165	261	90	100	130	100
File writes	33	64	21	24	41	24
File reads	34	74	32	36	38	36
Bash commands	79	101	28	28	29	21
Edits	15	5	7	5	6	17
Agents spawned	0	0	0	1	9	1
Skills invoked	`dagster-demo`, `dagster-expert`	`dagster-expert`	`dagster-expert`	`dagster-expert`	`dagster-expert`	`dagster-expert`
	Efficiency	Scope	Quality
skills-6 (with skill)	Good — skill reduced exploration overhead	Standard pipeline	Clean but consolidated
skills-6-no-demo	Worst — 82K tokens, 261 tool calls	Most comprehensive (39 assets, 4 marts)	Production-ready but verbose
skills-7	Good — 55K tokens across 2 sessions	Extended scope (+ Dataflow, asset checks)	Hybrid demo/production
skills-9	Good — 41K tokens, 130 tool calls	Standard pipeline	Cleanest structure and patterns
skills-10 (enhanced prompt)	Best — 47K tokens, 100 tool calls, 24 writes	Standard pipeline (~28 assets)	Most consistent component pattern