lethain/arch-decisions-strategy-messy.md

## arch-decisions-strategy-messy.md

      
    Raw
  

              arch-decisions-strategy-messy.md
            
          
    Engineering Architecture Decision-Making Strategy

Executive Summary: We will implement a federated architecture decision framework combining team autonomy with coordinated oversight, using Architecture Decision Records for documentation and a lightweight advisory process to eliminate decision-making ambiguity while maintaining engineering velocity.
Policy

Our policies for engineering architecture decision-making are:
1. Federated Decision Authority Framework


Team-Level Decisions: Product engineering teams have full authority over architecture decisions that affect only their services and don't create cross-team dependencies
Cross-Team Decisions: Architecture changes affecting multiple teams require input from our Architecture Advisory Group (AAG) - composed of Staff+ engineers representing each major domain
Organization-Level Decisions: Technology choices that impact company-wide standards (new programming languages, major infrastructure changes) require CTO approval with AAG recommendation

This policy directly addresses our diagnosed "Decision Authority Ambiguity" by creating clear decision rights at each organizational level, following Google's Technical Lead Network model.
2. Mandatory Architecture Decision Records (ADRs)


All significant architecture decisions must be documented using ADRs within 48 hours of the decision
ADRs must include: context, options considered, decision made, and rationale
ADRs are stored in a searchable, central repository accessible to all engineers
Teams cannot proceed with implementation until the ADR is published and reviewed

This addresses our "Lack of Decision Documentation" constraint while providing the context transfer mechanism needed for onboarding and future decision-making.
3. Weekly Architecture Advisory Sessions


30-minute weekly "Architecture Office Hours" where any engineer can present decisions for feedback
AAG members rotate facilitation duties to prevent bottlenecks
Non-binding advisory format: teams receive feedback but make final decisions
Escalation trigger: If 2+ AAG members strongly disagree with a team's decision, it escalates to CTO review

This provides the coordination mechanism needed to prevent "Inconsistent Technical Standards" while maintaining team autonomy.
4. Technology Choice Governance


Default technology stack approved for new projects (current: [specify your stack])
Experimental technology trials require AAG approval and sunset review after 6 months
New language/framework adoption requires demonstrable advantage over existing options and commitment to long-term support
Exception requests require written justification addressing operational impact, team expertise, and migration costs

This prevents the "Technical Debt Accumulation" from inconsistent technology choices while allowing innovation.
5. Architecture Decision Review Cycles


Monthly AAG retrospectives reviewing decision patterns and identifying process improvements
Quarterly architecture decision audits assessing outcomes of major decisions
Annual technology strategy review evaluating overall architectural direction and technology portfolio
Metrics tracked: decision time-to-resolution, cross-team friction incidents, architectural debt accumulation

This creates the feedback loops necessary to evolve our decision-making process based on outcomes.
Operations

Architecture Advisory Group Structure


Composition: 5-7 Staff+ engineers representing domains (Frontend, Backend, Infrastructure, Data, Security)
Selection: Nominated by teams, confirmed by engineering leadership based on technical expertise and collaborative judgment
Term limits: 18-month rotations to prevent entrenchment and distribute experience
Time commitment: 2-3 hours/week (office hours, retrospectives, decision review)

ADR Implementation Process


ADR Template: Standardized format stored in engineering repository
Review mechanism: Automated Slack notifications when ADRs are published
Search capability: ADRs tagged by technology, team, and decision type for discoverability
Integration: ADRs linked to relevant pull requests and technical documentation

Escalation and Exception Handling


Standard escalation path: Team → AAG feedback → CTO review if unresolved
Emergency decisions: Can be made without full process but require retroactive ADR within 24 hours
Appeals process: Teams can request CTO review of AAG recommendations they strongly disagree with
External consultation: AAG can request input from external technical advisors for complex decisions

Measurement and Metrics


Leading indicators: ADR completion rate, Architecture Office Hours attendance, escalation frequency
Lagging indicators: Cross-team integration issues, technical debt metrics, engineer satisfaction with decision-making
Review cycles: Monthly metrics review in AAG retrospectives, quarterly presentation to engineering leadership
Success criteria: 90% ADR compliance, <5% decision escalation rate, improved engineering satisfaction scores

Communication and Training


Onboarding integration: New engineers receive ADR training and review recent architectural decisions
Documentation maintenance: AAG maintains decision-making guidebook with examples and common patterns
Process evolution: Quarterly updates to decision-making framework based on retrospective feedback
Transparency: Monthly "Architecture Decisions" newsletter highlighting significant choices and their rationale


This strategy combines the team autonomy model successfully used at Netflix and Amazon with the coordinated oversight from Google's Technical Lead networks, while avoiding the bureaucratic overhead that makes traditional enterprise architecture governance ineffective. The federated approach directly addresses our diagnosed constraints while providing the documentation and coordination mechanisms necessary to scale architectural decision-making as the organization grows.
Exploration Section

Industry Patterns and Precedents

Based on research across technology organizations, three distinct approaches have emerged for managing architecture decision-making, each with documented successes and failures:
1. Advisory Architecture Process (Stripe/Netflix Model)

Stripe's approach:

Architecture decisions are made by the implementing team
A group of senior engineers (Staff+) provides feedback and guidance
No formal approval is required, but teams are expected to incorporate feedback
Escalations go to engineering leadership only for disagreements on critical decisions

Netflix's "Freedom and Responsibility":

Engineers make decisions within their sphere of responsibility
Context is shared broadly through documentation and RFCs
"Keeper test" ensures high-performing individuals drive decisions
Strong emphasis on documentation to enable distributed decision-making

2. Federated Architecture Councils (Amazon/Google Model)

Amazon's "Two-Pizza Team" model:

Each service team owns their architecture decisions
"Well-Architected Framework" provides consistent evaluation criteria

Google's approach uses Technical Lead Networks:

Technical Leads (TLs) in each area coordinate architecture decisions
Area-specific expertise concentrated in dedicated roles
Regular "Architecture Review Committee" for company-wide decisions
Strong emphasis on written design documents and peer review

3. Centralized Architecture Authority (Traditional Enterprise)

Microsoft's historical model (pre-cloud transformation):

Central architecture board approves all significant decisions
Detailed architecture standards and governance processes
Technology choices limited to approved stack
Strong emphasis on consistency and risk management

Traditional enterprise approach seen at companies like IBM, Oracle:

Enterprise architects define technology standards
Project approval gates require architecture compliance
Centralized technology evaluation and vendor management
Risk mitigation prioritized over speed

Framework References from Established Strategy Resources

Architecture Decision Records (ADRs)

Michael Nygard's pattern, widely adopted across industry:

Lightweight documentation of architecture decisions
Captures context, options considered, and rationale
Immutable record enabling future teams to understand reasoning
Successfully implemented at ThoughtWorks, Spotify, and numerous startups

Technology Strategy Patterns (Eben Hewitt)

The "Architectural Decision Authority" pattern:

Clearly defined decision rights at different organizational levels
Escalation paths for cross-cutting concerns
Balance between autonomy and coordination
Specific implementation guidance for different organization sizes

Diagnosis

Based on our analysis of the current state and industry research, we've identified the following root causes and constraints:
Technical Constraints


Decision Authority Ambiguity: There is no clear framework for determining who has final authority on architecture decisions, leading to inconsistent outcomes where the most persistent voices prevail rather than the most informed ones.
Inconsistent Technical Standards: Without coordinated decision-making, teams make incompatible technology choices that create integration challenges, operational overhead, and knowledge fragmentation across the organization.
Lack of Decision Documentation: Architecture decisions are made in meetings, Slack discussions, or informal conversations, leaving future engineers without context for why systems were designed as they were.

Organizational Constraints


Staff+ Engineer Utilization: Senior engineers spend significant time in reactive architectural debates rather than proactive technical leadership, reducing their impact on strategic technical initiatives.
Team Autonomy vs. Coordination Tension: Teams want sufficient autonomy to move quickly, but the absence of coordination mechanisms creates downstream problems that ultimately slow everyone down.
Onboarding and Context Transfer: New engineers struggle to understand architectural patterns and decision-making precedents, leading to repeated debates about previously settled questions.

Business Impact


Reduced Engineering Velocity: The combination of unclear decision rights and lack of precedent documentation means architectural questions consume disproportionate engineering time.
Technical Debt Accumulation: Inconsistent architectural decisions create technical debt that becomes expensive to resolve as the codebase grows.
Talent Retention Risk: Experienced engineers become frustrated with inefficient decision-making processes, while newer engineers feel excluded from important technical discussions.

Cultural Factors


Engineering Culture Mismatch: The organization values both technical excellence and rapid iteration, but the current ad-hoc decision-making process satisfies neither value effectively.
Knowledge Hoarding: Without formal documentation requirements, architectural knowledge remains concentrated in individuals rather than being institutionalized.

Constraints We Must Work Within


Team Size and Growth: We cannot significantly expand the number of senior engineers dedicated to architecture coordination, so any solution must scale efficiently.
Existing Technical Diversity: We already have multiple programming languages and architectural patterns in production, so we cannot impose uniform standards retroactively.
Product Development Pressure: Product teams have aggressive delivery timelines that cannot accommodate lengthy approval processes.

This diagnosis aligns with patterns documented in "Technology Strategy Patterns" where Hewitt notes that architectural decision-making problems typically stem from unclear decision rights rather than technical incompetence. Similarly, the case studies in "Crafting Engineering Strategy" demonstrate that successful organizations explicitly define decision-making authority rather than leaving it implicit.
The symptoms we're experiencing - where "highly opinionated engineers can effectively overrule others' work" - match the "loud voice wins" anti-pattern identified in Netflix's engineering culture documentation, where they emphasize the need for explicit decision-making frameworks to prevent this dysfunction.
Our situation requires balancing the autonomy that enables velocity (as demonstrated in Amazon's "two-pizza team" model) with the coordination that prevents architectural fragmentation (as implemented in Google's Technical Lead networks). The solution must be lightweight enough to avoid the bureaucratic overhead that killed architectural governance at many traditional enterprises, while providing enough structure to eliminate the current ambiguity.
No results found