DannyCrews/gist:19e66002285875a37a67949017546229

## gistfile1.txt

# New Relic Deployment Markers: Users Guide

## What Are Deployment Markers?

Deployment markers are timestamped events in New Relic that record when code changes are released to your applications. They appear as vertical lines or annotations overlaid on your APM performance charts, creating a clear visual boundary between "before the change" and "after the change."

**Think of them as bookmarks in your application's performance timeline** - they help you answer the critical question: "What changed when performance shifted?"

## The Fundamental Problem Deployment Markers Solve

Your application generates thousands of metrics every minute: response times, error rates, throughput, CPU usage, memory consumption. These metrics constantly fluctuate due to traffic patterns, user behavior, and external dependencies.

**The challenge**: When metrics suddenly change, how do you know why?

**Common scenarios:**
```
Scenario 1: Error rate doubles at 3:47 PM
  Possible causes:
    - Code deployment introduced a bug
    - Infrastructure failure (database down)
    - Traffic spike (DDoS, viral content)
    - External API outage
    - Configuration change
    - Database query plan changed

  Without markers: Check all possibilities manually (30+ minutes)
  With markers: See deployment at 3:45 PM immediately (30 seconds)
```

```
Scenario 2: Response time gradually increases over 2 weeks
  Possible causes:
    - Code gradually getting slower (technical debt)
    - Database growing larger (query performance degrading)
    - Memory leak accumulating
    - Cache hit rate declining

  Without markers: Unclear when degradation started
  With markers: Each deployment shows incremental impact
```

## Understanding the Deployment Detail View

When you click on a deployment marker in New Relic, you see a detailed breakdown of that deployment's impact:

### Header Information

**Entity**: Which application/service was deployed
- Example: "API Gateway - Production", "Frontend - Staging"
- Critical in microservices: ensures you're looking at the right service

**Timestamp**: Exact moment the deployment completed
- Example: "Jan 17, 2026, 04:59:40.067 PM"
- Precision to the millisecond for accurate correlation

**Version**: Unique identifier for this release
- Example: "v2.5.1", "2026-01-17-build-8885", "commit-a3f9c2d"
- Should map to your source control (git tag, commit hash)
- Allows you to know exactly which code is running

**Deployment ID**: New Relic's internal identifier
- Used for API queries and programmatic access

### Key Impacts Section

Shows high-level changes in transaction volume.

**Example: "12.3K occurrences, +15.2%"**

This means 15.2% more transactions occurred in the comparison period after deployment vs. before.

**Interpreting the change:**

**Positive indicators:**
- ✅ Increased transactions + stable error rate = healthy growth
- ✅ Decreased errors + stable traffic = bug fixes working
- ✅ Decreased slow transactions = performance improvement

**Negative indicators:**
- ❌ Decreased transactions + normal error rate = users can't reach endpoint
- ❌ Increased errors + stable traffic = new bugs introduced
- ❌ Increased slow transactions = performance regression

**Neutral indicators:**
- 📊 Small changes (<5%) = normal variance, likely unrelated to deployment
- 📊 Expected changes = feature removal, traffic shifting to new endpoints

### Related Errors Section

Shows errors that occurred during or immediately after the deployment.

**"235 related errors found"** means:
- Errors occurred in the time window around this deployment
- New Relic detected correlation between deployment timing and error timing
- Click through to see error details, stack traces, and affected users

**"No related errors found"** means:
- No errors during deployment execution
- No error spikes immediately after deployment
- ⚠️ Note: Doesn't guarantee zero errors - check the Errors page separately for ongoing issues

**Using this data:**
```
Clean deployment:
  ✓ No related errors
  ✓ Action: Monitor for delayed effects (next 24 hours)

Problematic deployment:
  ✗ 1,247 related errors
  ✗ Action: Click through to identify error type
          → If critical: rollback immediately
          → If minor: plan hotfix
```

### Related Alerts Section

Shows if any alert conditions triggered around the deployment.

**"High Error Rate Alert triggered 3 minutes after deployment"** tells you:
- Deployment breached a predefined threshold
- Impact was severe enough to warrant notification
- Automatic incident created for tracking

**"No related alerts found"** means:
- Metrics stayed within acceptable bounds
- No automatic incidents created
- Changes were either positive or within tolerance

**Interpreting alert correlation:**
```
Alert triggered immediately (0-5 min):
  → Strong indication deployment caused the issue
  → Errors/performance degradation in changed code paths

Alert triggered later (30+ min):
  → Possible correlation but verify
  → Could be delayed effect (memory leak, cache warming)
  → Could be coincidental (unrelated traffic spike)

No alert but metrics changed:
  → Change exists but below alert threshold
  → Review whether thresholds need adjustment
  → Good candidate for optimization work
```

### Web Transaction Impacts

This is the most powerful section - shows exactly which endpoints were affected by the deployment.

**Example display:**
```
Transaction: WebTransaction/Action/api/checkout
  3 hours before: 285.3 ms average
  3 hours after:  312.7 ms average
  Impact: +27.4 ms (+9.6% slower)

Transaction: WebTransaction/Action/api/search
  3 hours before: 156.8 ms average
  3 hours after:  142.1 ms average
  Impact: -14.7 ms (-9.4% faster)
```

**What this tells you:**

1. **Specific impact**: Not all endpoints affected equally
   - Checkout got slower (regression)
   - Search got faster (improvement)

2. **Magnitude**: Quantified change in milliseconds and percentage
   - 27ms may seem small but 10% is significant
   - Helps prioritize which issues to fix

3. **Transaction isolation**: Identifies exactly where to investigate
   - Look at checkout code changes
   - Verify search optimization worked

**Analysis patterns:**

**Single transaction slower, others unchanged:**
```
Likely cause: Change specific to that endpoint
Action: Review code changes for that transaction
Look for: New database queries, external API calls, inefficient algorithms
```

**All transactions slower by similar amount:**
```
Likely cause: Global overhead added (middleware, logging, monitoring)
Action: Review infrastructure changes, framework updates
Look for: New request interceptors, increased logging verbosity
```

**Some transactions slower, some faster:**
```
Likely cause: Refactoring shifted performance characteristics
Action: Verify slower transactions are acceptable trade-off
Look for: Shared resources (database, cache, external services)
```

**Specific transaction disappeared from list:**
```
Likely cause: Endpoint removed, renamed, or broken
Action: Check if removal was intentional
Look for: Error logs showing 404s or routing failures
```

### Deployment Attributes

Additional metadata about the deployment:

**Version**: Your application version identifier
- Should be meaningful and sortable
- Best practices: semantic versioning (v1.2.3), date-based (2026-01-17.1)

**Changelog**: Optional field for release notes
- Best practice: Include ticket numbers, feature summaries
- Example: "JIRA-5432: Optimize database queries for user dashboard. JIRA-5441: Fix null pointer in payment processing."
- Makes future troubleshooting much easier

**User**: Who triggered the deployment (if recorded)
- Helpful for accountability and context
- Example: "deploy-bot" vs "john.doe@company.com"

**Description**: Additional context about the deployment
- Use for: deployment type (hotfix, regular release, rollback)
- Environment details (canary, blue-green, rolling)

## How to Use Deployment Markers Effectively

### 1. Post-Deployment Validation (The Critical First Hour)

**Minute 0-5: Immediate Health Check**

As soon as the deployment marker appears:

```
Quick scan checklist:
  ☐ Related errors: Any new exceptions?
  ☐ Related alerts: Did thresholds breach?
  ☐ Key transactions: Response times acceptable?
  ☐ Throughput: Traffic flowing normally?
  ☐ Error rate: Within baseline?
```

**What you're looking for:**

**Green light (proceed with confidence):**
- No related errors
- No related alerts
- Transaction times improved or stable (<5% change)
- Throughput stable or increased
- Error rate stable or decreased

**Yellow light (monitor closely):**
- Small number of errors (1-10) in non-critical paths
- Transaction times slightly increased (5-15%)
- Single transaction affected, others fine
- No alerts but metrics approaching thresholds

**Red light (prepare to rollback):**
- Hundreds of related errors
- Critical alerts triggered
- Transaction times doubled or more
- Throughput dropped significantly
- Error rate spiked >3x baseline

**Minute 5-30: Deeper Analysis**

After initial health check passes, dig deeper:

```
1. Review "Top 10 web transactions"
   - Click each transaction showing significant change
   - Examine transaction traces from after deployment
   - Identify specific slow operations (DB queries, external calls)

2. Check transaction distribution
   - APM → Transactions
   - Sort by throughput
   - Verify traffic distribution matches expectations

3. Review database performance
   - APM → Databases
   - Check for new slow queries
   - Verify query counts are reasonable

4. Check external services
   - APM → External services
   - Verify third-party API response times
   - Check for new external calls

5. Review error details
   - APM → Errors
   - Group by error class
   - Read error messages and stack traces
   - Verify errors make sense given code changes
```

**Minute 30-60: User Impact Assessment**

After technical validation, assess user experience:

```
1. Check Apdex score
   - APM → Summary
   - Apdex shows user satisfaction (0.0 = awful, 1.0 = perfect)
   - Target: >0.9 for most applications
   - Compare pre/post deployment

2. Review browser monitoring (if enabled)
   - Browser → Summary
   - Check page load times from user perspective
   - Verify frontend performance acceptable

3. Check synthetic monitors (if configured)
   - Synthetics → Monitors
   - Verify scripted checks passing
   - Confirm critical user flows working

4. Monitor real user metrics
   - Browser → Session traces
   - Sample actual user sessions
   - Identify any broken workflows
```

### 2. Investigating Performance Issues

**Scenario**: Users report slowness, or you notice degraded metrics

**Step 1: Identify the timeframe**
```
When did the issue start?
  - Check alerts for when threshold breached
  - Ask users when they first noticed
  - Review metric charts for inflection point
```

**Step 2: Find relevant deployment markers**
```
Navigate to: APM → Your Application
Set time range: Start from when issue began, look back 24 hours
Scan timeline: Identify all deployment markers in that window

Example:
  Issue reported: 5:15 PM
  Time range: 4:00 PM - 5:30 PM
  Markers found:
    - 4:45 PM - Backend API v2.3.1
    - 5:02 PM - Frontend v1.8.2
```

**Step 3: Evaluate each deployment**
```
For each marker, click and review:

Deployment at 4:45 PM (Backend API):
  ✓ No related errors
  ✓ Transaction times stable
  ✗ Throughput decreased 15%
  → Not the smoking gun, but worth noting

Deployment at 5:02 PM (Frontend):
  ✗ 847 related errors
  ✗ Page load time increased 200%
  ✗ Related alert triggered at 5:04 PM
  → FOUND IT: This deployment caused the issue
```

**Step 4: Drill into the problematic deployment**
```
Click the deployment marker at 5:02 PM

Review Web Transaction Impacts:
  /checkout: 1.2s → 3.8s (+217% slower) ← Problem here
  /search: 0.3s → 0.3s (no change)
  /home: 0.5s → 0.5s (no change)

Conclusion: Checkout endpoint specifically affected
```

**Step 5: Identify root cause**
```
Click on the slow transaction: /checkout

View transaction traces:
  - Find slowest trace from after 5:02 PM
  - Examine trace details
  - Identify which segment is slow

Example trace breakdown:
  Middleware: 10ms
  Controller: 15ms
  Database query: 3,200ms ← This is the problem
  Rendering: 100ms

Root cause: New database query taking 3+ seconds
```

**Step 6: Correlate with code changes**
```
Check deployment version: v1.8.2
Look up in version control: git show v1.8.2
Review changes to /checkout endpoint

Find: Added new query to fetch user's full order history
Realize: Query has no index, scanning millions of rows
```

**Step 7: Plan remediation**
```
Options:
  1. Immediate rollback to v1.8.1
     - Restores performance immediately
     - Loses new features in v1.8.2

  2. Emergency hotfix
     - Add database index
     - Deploy as v1.8.3
     - Takes 20-30 minutes

  3. Optimize query
     - Rewrite to only fetch recent orders
     - Add caching
     - Deploy as v1.8.3
     - Takes 2-3 hours

Decision: Rollback now, prepare proper fix for tomorrow
```

### 3. Analyzing Trends Over Time

**Scenario**: Understanding long-term performance trajectory

**View deployment history:**
```
Navigate to: APM → Deployments (in left sidebar)

You'll see a chronological list:
  Jan 20, 3:00 PM - v2.1.5
  Jan 18, 2:15 PM - v2.1.4
  Jan 15, 4:30 PM - v2.1.3
  Jan 12, 1:45 PM - v2.1.2
  Jan 10, 10:00 AM - v2.1.1
  ...
```

**Track baseline shifts:**

Create a spreadsheet or dashboard tracking:

| Deployment | Date | Avg Response Time | Error Rate | Throughput | Apdex |
|------------|------|-------------------|------------|------------|-------|
| v2.1.1 | Jan 10 | 420ms | 0.3% | 2,100 rpm | 0.94 |
| v2.1.2 | Jan 12 | 435ms | 0.3% | 2,050 rpm | 0.93 |
| v2.1.3 | Jan 15 | 480ms | 0.5% | 2,000 rpm | 0.89 |
| v2.1.4 | Jan 18 | 520ms | 0.8% | 1,950 rpm | 0.85 |
| v2.1.5 | Jan 20 | 440ms | 0.4% | 2,100 rpm | 0.92 |

**Identify patterns:**

```
v2.1.1 → v2.1.2: Slight degradation (acceptable)
v2.1.2 → v2.1.3: Noticeable degradation (10% slower, higher errors)
v2.1.3 → v2.1.4: Continued degradation (trend established)
v2.1.4 → v2.1.5: Significant improvement (optimization work paid off)

Conclusion:
  - v2.1.3 introduced performance regression
  - v2.1.4 made it worse
  - v2.1.5 fixed both and improved beyond v2.1.1 baseline
```

**Investigate the regression:**

```
Click on v2.1.3 deployment marker
Review Web Transaction Impacts
Identify which transactions got slower

Compare with v2.1.2:
  - What features were added?
  - What refactoring occurred?
  - What dependencies were updated?

Common culprits:
  - ORM changes (N+1 queries introduced)
  - Dependency updates (framework overhead increased)
  - New features (expensive operations in critical path)
  - Logging changes (excessive debug logging in production)
```

### 4. Comparing Across Environments

Use deployment markers to validate your promotion pipeline:

**The ideal pattern:**
```
Development Environment:
  Deploy: Monday 10:00 AM - v2.2.0-dev
  Monitor: 24 hours
  Result: ✓ No issues, metrics stable

Staging Environment:
  Deploy: Tuesday 10:00 AM - v2.2.0-staging
  Monitor: 24 hours
  Result: ✓ No issues, metrics stable

Production Environment:
  Deploy: Wednesday 10:00 AM - v2.2.0
  Expected: Similar metrics to staging
  Result: ✓ Metrics match staging prediction
```

**When production differs from staging:**

```
Staging: Response time 300ms, 0.2% errors
Production: Response time 800ms, 2.5% errors

Investigation checklist:
  ☐ Traffic volume: Production has 10x traffic?
  ☐ Data volume: Production database 100x larger?
  ☐ External dependencies: Different APIs in prod vs staging?
  ☐ Infrastructure: Production servers under-provisioned?
  ☐ Configuration: Environment-specific settings causing issues?
  ☐ Geographic distribution: Production users globally distributed?
```

**Building confidence through consistency:**

```
Track prediction accuracy:

Deployment #1:
  Staging impact: +15ms response time
  Predicted prod: +15ms
  Actual prod: +18ms
  Accuracy: 95%

Deployment #2:
  Staging impact: -30ms response time
  Predicted prod: -30ms
  Actual prod: -28ms
  Accuracy: 93%

Deployment #3:
  Staging impact: +5ms response time
  Predicted prod: +5ms
  Actual prod: +85ms
  Accuracy: 6% ← INVESTIGATE

Why so different?
  → Found: Production has a caching layer staging doesn't
  → The code change invalidated cache frequently
  → Staging didn't show this because no cache to invalidate
```

### 5. Making Rollback Decisions

Deployment markers provide objective data for rollback decisions.

**Define rollback criteria in advance:**

```
CRITICAL - Rollback immediately, no questions asked:
  ✗ Error rate > 10x baseline
  ✗ Availability < 95% (users can't access site)
  ✗ Critical business function completely broken (payments, login)
  ✗ Data corruption detected
  ✗ Security vulnerability exposed

MAJOR - Rollback within 30 minutes unless fix available:
  ✗ Error rate 3-10x baseline for 15+ minutes
  ✗ Response time > 2x baseline for 20+ minutes
  ✗ Significant user complaints (>10 in 10 minutes)
  ✗ Revenue-impacting feature degraded

MINOR - Monitor closely, fix forward if possible:
  ✗ Error rate 1.5-3x baseline
  ✗ Response time 1.25-2x baseline
  ✗ Non-critical features affected
  ✗ Small number of users affected

ACCEPTABLE - Monitor but no action needed:
  ✓ Error rate <1.5x baseline
  ✓ Response time <1.25x baseline
  ✓ Known/expected issues with acceptable impact
```

**Using markers to execute rollbacks:**

**Step 1: Identify last known good version**
```
Current deployment: v2.5.3 (causing issues)
Review deployment history in reverse:

v2.5.3 - Jan 20, 5:00 PM:
  Error rate: 5.2% (PROBLEM)
  Response time: 850ms

v2.5.2 - Jan 18, 2:00 PM:
  Error rate: 0.4% (GOOD)
  Response time: 420ms

Decision: Rollback to v2.5.2
```

**Step 2: Verify the baseline was healthy**
```
Click v2.5.2 deployment marker

Check "3 hours after" metrics:
  ✓ Error rate: 0.4%
  ✓ Response time: 420ms
  ✓ Apdex: 0.94
  ✓ No related alerts

Confirm: v2.5.2 is a safe rollback target
```

**Step 3: Execute rollback**
```
Deploy v2.5.2 to production

New deployment marker appears:
  "v2.5.2-rollback" or "v2.5.2" (redeployment)
  Timestamp: Jan 20, 5:35 PM
```

**Step 4: Validate recovery**
```
Monitor the rollback deployment marker:

Minute 0-5:
  ☐ Error rate dropping back to baseline?
  ☐ Response time improving?
  ☐ Alerts clearing?

Minute 5-15:
  ☐ Metrics stable at previous baseline?
  ☐ User complaints stopped?
  ☐ Business functions restored?

Success criteria:
  ✓ Error rate back to 0.4%
  ✓ Response time back to 420ms
  ✓ Apdex back to 0.94
  ✓ All alerts cleared

Result: Rollback successful, incident resolved
```

**Step 5: Post-mortem**
```
Review v2.5.3 marker to understand what went wrong:
  - What changed between v2.5.2 and v2.5.3?
  - Why didn't testing catch this?
  - What can prevent this in the future?
  - When can we safely retry deploying the new features?
```

## Advanced Analysis Techniques

### 1. Correlating Multiple Signals

Deployment markers are most powerful when combined with other data sources:

**Cross-reference with infrastructure events:**
```
Timeline view:
  4:45 PM - Code deployment (marker)
  4:46 PM - Auto-scaling added 3 servers (infrastructure event)
  4:50 PM - Response time improved 30%

Analysis:
  Code deployment triggered traffic increase
  → Auto-scaling responded appropriately
  → System handled load well
  → Deployment successful
```

**Cross-reference with external services:**
```
Timeline view:
  3:00 PM - Code deployment (marker)
  3:02 PM - External API response time spiked
  3:05 PM - Your app error rate spiked

Analysis:
  Your deployment increased calls to external API
  → External API couldn't handle increased load
  → Your app experienced cascading failures
  → Need to add circuit breaker or rate limiting
```

**Cross-reference with business metrics:**
```
Timeline view:
  2:00 PM - Code deployment (marker)
  2:15 PM - Checkout conversion rate dropped 15%
  2:30 PM - Revenue per hour decreased $2,000

Analysis:
  Technical metrics looked fine (response time OK, errors low)
  → But users abandoning checkout due to UX changes
  → Performance isn't the only deployment success metric
  → Need to monitor business KPIs alongside technical metrics
```

### 2. Using NRQL to Query Deployment Data

New Relic Query Language (NRQL) allows programmatic access to deployment data.

**Find all deployments in a time range:**
```sql
SELECT *
FROM Deployment
WHERE appName = 'YourApp - Production'
SINCE 7 days ago
```

**Count deployments per day:**
```sql
SELECT count(*) as 'Deployments'
FROM Deployment
WHERE appName = 'YourApp - Production'
FACET dateOf(timestamp)
SINCE 30 days ago
```
This shows deployment frequency trends over time.

**Find error count by deployment version:**
```sql
SELECT count(*) as 'Total Errors'
FROM TransactionError
WHERE appName = 'YourApp - Production'
FACET deployment.version
SINCE 7 days ago
ORDER BY count(*) DESC
```
Identifies which deployment versions had the most errors.

**Compare response times across deployments:**
```sql
SELECT average(duration) as 'Avg Response (sec)',
       percentile(duration, 95) as 'p95 Response (sec)'
FROM Transaction
WHERE appName = 'YourApp - Production'
FACET deployment.version
SINCE 3 days ago
```
Shows performance characteristics of each deployment.

**Identify deployments that caused alert violations:**
```sql
SELECT count(*) as 'Alert Violations'
FROM Alert
WHERE entity.name = 'YourApp - Production'
FACET deployment.version
SINCE 14 days ago
```
Correlates deployments with alert frequency.

**Create a deployment success dashboard:**
```sql
-- Panel 1: Deployment count over time
SELECT count(*) FROM Deployment
WHERE appName = 'YourApp - Production'
TIMESERIES AUTO
SINCE 30 days ago

-- Panel 2: Error rate by deployment
SELECT percentage(count(*), WHERE error IS true) as 'Error %'
FROM Transaction
FACET deployment.version
SINCE 7 days ago

-- Panel 3: Apdex by deployment
SELECT apdex(duration, t:0.5) as 'Apdex Score'
FROM Transaction
FACET deployment.version
SINCE 7 days ago

-- Panel 4: Time between deployments
SELECT average(timestamp - lag(timestamp)) / 3600000 as 'Hours Between Deploys'
FROM Deployment
WHERE appName = 'YourApp - Production'
TIMESERIES AUTO
SINCE 30 days ago
```

### 3. Building Deployment Alerts

Create alerts that fire based on deployment impact:

**Post-deployment error spike alert:**
```
NRQL Alert Condition:
  SELECT percentage(count(*), WHERE error IS true)
  FROM Transaction
  WHERE appName = 'YourApp - Production'

Threshold:
  Critical: Error rate > 5% for at least 5 minutes

Condition:
  Only evaluate for 30 minutes after a deployment marker appears

Action:
  Page on-call engineer
  Include: Deployment version, error count, affected transactions
```

**Post-deployment performance degradation alert:**
```
NRQL Alert Condition:
  SELECT percentile(duration, 95)
  FROM Transaction
  WHERE appName = 'YourApp - Production'

Threshold:
  Warning: p95 response time > 1.5x last deployment baseline
  Critical: p95 response time > 2x last deployment baseline

Baseline:
  Dynamic baseline from 1 hour before deployment


Action:
  Slack notification with deployment details and transaction breakdown
```

**Deployment frequency anomaly alert:**
```
NRQL Alert Condition:
  SELECT count(*)
  FROM Deployment
  WHERE appName = 'YourApp - Production'

Threshold:
  Warning: No deployments in 7 days (team might be stuck)
  Warning: >20 deployments in 1 day (possible deployment instability)

Action:
  Notify engineering leadership
```

### 4. Analyzing Multi-Service Deployments

In microservices architectures, deployments often span multiple services.

**Scenario: E-commerce platform with multiple services**
```
Services:
  - Frontend (React SPA)
  - API Gateway (Node.js)
  - User Service (Python)
  - Order Service (Java)
  - Payment Service (Go)
```

**Coordinated deployment:**
```
10:00 AM - Payment Service v3.2.0 deployed
  Impact: Payment processing time improved 40ms

10:15 AM - Order Service v2.8.1 deployed
  Impact: Calls new Payment Service endpoint

10:30 AM - API Gateway v1.5.3 deployed
  Impact: Routes updated for new Order Service behavior

10:45 AM - Frontend v4.1.0 deployed
  Impact: UI updated to show enhanced payment options
```

**Cross-service correlation:**
```
User reports error at 10:50 AM

Investigation:
  ✓ Frontend deployment (10:45): No errors, clean deployment
  ✓ API Gateway deployment (10:30): No errors, clean deployment
  ✗ Order Service deployment (10:15): 234 errors starting at 10:20
  ✓ Payment Service deployment (10:00): No errors, clean deployment

Root cause:
  Order Service had a bug in how it called Payment Service
  Bug was introduced in v2.8.1 deployment at 10:15

Resolution:
  Rollback Order Service to v2.8.0
  Users can complete orders again
  Fix bug and redeploy as v2.8.2
```

**Using Service Maps with deployment markers:**
```
Navigate to: APM → Service Maps

Visual representation:
  Frontend → API Gateway → Order Service → Payment Service

Click each service to see recent deployments
Trace request path across all services
Identify which service in the chain is slow/failing

Example trace:
  Frontend: 50ms (normal)
  API Gateway: 20ms (normal)
  Order Service: 3,200ms (SLOW - deployed 10:15)
  Payment Service: 45ms (normal)

Conclusion: Order Service deployment at 10:15 is the bottleneck
```

## Common Troubleshooting Scenarios

### Scenario 1: Deployment marker shows no impact but users report issues

**Symptoms:**
- Deployment marker at 3:00 PM
- Web Transaction Impacts shows minimal change (<2%)
- Users reporting errors/slowness starting 3:05 PM

**Possible causes and solutions:**

**A) Low traffic during comparison window**
```
Problem:
  "3 hours before" period was overnight (low traffic)
  "3 hours after" period is daytime (high traffic)
  Metrics incomparable due to different traffic patterns

Solution:
  Change comparison window to same time of day
  Example: Compare 3:00-6:00 PM today vs yesterday
  Or: Compare to same day last week at same time
```

**B) Issue affects non-instrumented code paths**
```
Problem:
  Deployment changed background job processing
  New Relic only instruments web transactions
  Background jobs not visible in web transaction metrics

Solution:
  Check custom events or application logs
  Instrument background jobs
  Look for scheduled task failures
```

**C) Business logic error without technical error**
```
Problem:
  Code processes successfully (no exceptions thrown)
  But produces wrong business results
  Example: Checkout calculates tax incorrectly

Solution:
  Review business KPIs (conversion rate, revenue)
  Check application-specific metrics
  Review logic in deployment code changes
  Test scenarios reported by users
```

**D) Gradual resource exhaustion**
```
Problem:
  Deployment introduced memory leak
  First 3 hours look fine (memory slowly filling)
  Hour 4-6: Performance degrades as GC thrashing begins

Solution:
  Check JVM/runtime metrics over longer timeframe
  Look for gradual memory increase
  Monitor garbage collection frequency
  Review object lifecycle in new code
```

### Scenario 2: Multiple deployment markers, can't tell which caused issue

**Symptoms:**
- Error spike at 4:30 PM
- Deployments at 4:00, 4:15, 4:25, 4:35 PM
- Unclear which is responsible

**Investigation approach:**

**Step 1: Review timeline precision**
```
Error spike: 4:30:00 PM
Deployments:
  4:00 PM - Frontend v2.1 (30 min before spike)
  4:15 PM - API v1.8 (15 min before spike)
  4:25 PM - Database service config (5 min before spike)
  4:35 PM - Cache service v3.2 (5 min after spike)

Initial conclusion:
  Cache service deployment AFTER spike, unlikely cause
  Database config change VERY close to spike, likely cause
```

**Step 2: Check error details**
```
Click on error spike, group by error message:

"Connection timeout to database" - 1,240 errors
  Started: 4:30:15 PM

"Cache miss rate exceeded" - 15 errors
  Started: 4:36:00 PM

Conclusion:
  Database errors started immediately (4:30:15)
  Database config change was at 4:25:00
  5 minute gap is reasonable for config propagation
  → Database config change caused the spike
```

**Step 3: Verify by reviewing each deployment**
```
Click deployment marker: Database service config (4:25 PM)
  Related errors: 1,240 ← Matches error spike
  Error message: "Connection timeout" ← Matches error type

Confirmed: Database config change caused the issue
```

### Scenario 3: Deployment marker timestamp doesn't match actual deployment

**Symptoms:**
- Deployment completed at 2:45 PM
- Marker shows 2:32 PM
- Metrics comparison seems off

**Possible causes:**

**A) Marker created at deployment start, not completion**
```
Problem:
  Deployment process:
    2:32 PM - Deployment starts, marker created
    2:32-2:45 PM - Code rolls out across servers
    2:45 PM - Last server deployed, process complete

  Result:
    Marker at 2:32 shows mixture of old/new code
    "3 hours after" includes partial deployment period
    Metrics don't cleanly separate old vs new

Solution:
    Fix deployment automation to create marker after completion
    Or manually adjust analysis window to start at 2:45
```

**B) Time zone confusion**
```
Problem:
  Deployment logs show 2:45 PM EST
  New Relic marker shows 7:45 PM UTC
  Developer in PST sees 11:45 AM PST

  Everyone thinks deployment was at different time

Solution:
  Always use UTC for correlation
  Convert all times to UTC before analyzing
  Set New Relic UI to display UTC
```

**C) Staged/rolling deployment**
```
Problem:
  Canary deployment process:
    2:32 PM - 10% of servers deployed
    2:40 PM - 50% of servers deployed
    2:45 PM - 100% of servers deployed

  Marker created at first deployment (2:32)
  But full impact not visible until 2:45

Solution:
  Create multiple markers for staged deployments
  Label each stage: v2.0-canary-10pct, v2.0-canary-50pct, v2.0-full
  Compare metrics at each stage
```

### Scenario 4: Clean deployment marker but errors increase days later

**Symptoms:**
- Deployment at Monday 2:00 PM
- Marker shows clean deployment (no issues)
- Wednesday 3:00 PM - error rate spikes
- Errors trace back to code from Monday deployment

**Investigation:**

**Step 1: Verify the connection**
```
Check error stack traces:
  Are errors in code paths changed Monday?
  Or in unrelated code?

Example:
  Error: "Index out of bounds in report generation"
  Stack trace shows: generateMonthlyReport() method
  Monday deployment added: Monthly report feature

Connection confirmed: Monday code is responsible
```

**Step 2: Understand the delay**
```
Why did it take 2 days to surface?

Common reasons:

A) Time-based trigger:
  Monthly report runs on first Wednesday of month
  Code never executed until Wednesday

B) Data volume threshold:
  Code works fine with small datasets
  Database grew over 2 days
  Wednesday: Dataset crossed threshold, code fails

C) Rare edge case:
  99% of inputs work fine
  Wednesday: User hit the 1% edge case

D) Resource accumulation:
  Small memory leak in Monday code
  Takes 48 hours to accumulate enough to cause issues
  Wednesday: Memory exhausted, errors start
```

**Step 3: Improve testing**
```
Why didn't testing catch this?

Gap analysis:
  - No scheduled job testing (time-based trigger missed)
  - Test data too small (volume threshold not tested)
  - Test cases don't cover edge cases (rare input missed)
  - No long-running test environments (leak not detected)

Improvements:
  - Add scheduled job tests to CI/CD
  - Use production-sized datasets in staging
  - Increase edge case coverage
  - Run performance tests for 24+ hours
```

### Scenario 5: Deployment shows improvement but users complain

**Symptoms:**
- Deployment marker shows response time improved 20%
- Error rate decreased 50%
- Technical metrics all positive
- Support tickets increased 300%

**Investigation:**

**Check what metrics don't measure:**

**A) Business logic errors**
```
Technical success, business failure:
  Old code: Tax calculated incorrectly (overcharged)
  New code: Tax calculated correctly (charges appropriate amount)

  User perception: "Prices went up 8%"
  Technical metrics: All green (code working correctly)
  Business reality: Users angry about correct pricing
```

**B) User experience changes**
```
Technical success, UX regression:
  Old code: 1-click checkout (response time 800ms)
  New code: 2-step checkout with validation (400ms per step)

  Technical metrics: 50% faster response time!
  User perception: "Checkout takes longer and is more annoying"
  Reality: Extra step adds friction despite faster code
```

**C) Removed functionality**
```
Technical success, feature loss:
  Old code: Advanced search with 20 filters (slow)
  New code: Simple search with 5 filters (fast)

  Technical metrics: 80% faster search!
  User perception: "I can't find products anymore"
  Reality: Performance improvement by removing features users needed
```

**Solution approach:**
```
1. Monitor business KPIs alongside technical metrics:
   - Conversion rates
   - Revenue per user
   - Session duration
   - Feature usage rates

2. Collect qualitative feedback:
   - Support ticket themes
   - User surveys
   - Session replay analysis
   - A/B test results

3. Balance technical and business success:
   - Fast but broken is failure
   - Slow but functional might be acceptable
   - Best: Fast AND meets user needs
```

## Best Practices for Working with Deployment Markers

### 1. Establish Deployment Hygiene

**Create meaningful version identifiers:**
```
Good version formats:
  ✓ Semantic versioning: v2.5.3 (major.minor.patch)
  ✓ Date-based: 2026-01-17.1 (date.sequence)
  ✓ Build number: build-18885
  ✓ Git commit: a3f9c2d
  ✓ Combined: 2026-01-17-build-18885-a3f9c2d

Bad version formats:
  ✗ "Latest"
  ✗ "Production"
  ✗ "Jan deployment"
  ✗ Random strings: "xK8mP2q"

Why it matters:
  - Sortable versions allow chronological ordering
  - Descriptive versions map to source control
  - Unique versions prevent confusion
```

**Populate changelog fields:**
```
Good changelog:
  "JIRA-5432: Optimize database queries for user dashboard (-40ms).
   JIRA-5441: Fix null pointer in payment processing.
   JIRA-5450: Add support for EU tax calculations.
   Breaking change: Removed deprecated /api/v1/legacy endpoint."

Bad changelog:
  "Bug fixes and improvements"
  "Weekly release"
  ""

Why it matters:
  - Future you (6 months later) needs context
  - Incident responders need to know what changed
  - Compliance/audit requires change documentation
```

**Deploy one environment at a time:**
```
Good deployment flow:
  Monday 10 AM: Deploy to dev, monitor 24h
  Tuesday 10 AM: Deploy to staging, monitor 24h
  Wednesday 10 AM: Deploy to production, monitor ongoing

Bad deployment flow:
  Monday 10 AM: Deploy to dev, staging, prod simultaneously

Why it matters:
  - Issues caught in dev don't reach production
  - Staging validates production behavior
  - Rollback doesn't affect multiple environments
  - Clear deployment markers per environment
```

### 2. Build Post-Deployment Rituals

**The 5-Minute Check:**
```
Every deployment, without exception:

☐ Open deployment marker
☐ Check "Related errors" count
☐ Check "Related alerts" status
☐ Scan "Web Transaction Impacts" for regressions >10%
☐ Verify throughput maintained or increased
☐ Quick check of error rate on Errors page
☐ Glance at Apdex score

If all green: Continue monitoring
If any red: Investigate immediately
```

**The 30-Minute Deep Dive:**
```
For significant deployments:

☐ Review all changed transactions in detail
☐ Compare database query performance
☐ Check external service response times
☐ Review error messages and stack traces
☐ Check business metrics (if available)
☐ Review sample of transaction traces
☐ Verify no gradual degradation trends

Document findings for future reference
```

**The 24-Hour Retrospective:**
```
Day after each deployment:

☐ Review full day of metrics post-deployment
☐ Compare to same day previous week
☐ Check for delayed effects (memory leaks, etc.)
☐ Review support tickets related to deployment
☐ Assess whether deployment met goals
☐ Document lessons learned

Feed learnings into next deployment
```

### 3. Build Deployment Intelligence

**Track deployment success metrics:**
```
Maintain a deployment scorecard:

Deployment Frequency:
  - Deployments per week
  - Trend over time (increasing = good)

Deployment Size:
  - Files changed per deployment
  - Trend over time (decreasing = good)

Deployment Success Rate:
  - % with no issues: 85%
  - % with minor issues: 10%
  - % requiring rollback: 5%
  - Target: >80% clean deployments

Mean Time to Detect:
  - How long until issues discovered
  - Target: <5 minutes

Mean Time to Resolve:
  - How long until issues fixed
  - Target: <30 minutes

Change Failure Rate:
  - % of deployments causing incidents
  - DORA metric target: <15%
```

**Learn from patterns:**
```
Review last 20 deployments:

Identify correlations:
  - Day of week: Friday deployments have 3x failure rate
  - Time of day: Deployments during peak traffic risky
  - Team member: Junior devs need more code review
  - Code area: Payment module needs more testing
  - Change type: Database migrations frequently problematic

Adjust process:
  - No Friday deployments (wait until Monday)
  - Deploy during low-traffic hours
  - Pair junior developers with seniors
  - Add payment-specific test suite
  - Require DB migration dry runs in staging
```

### 4. Collaborate Using Deployment Data

**Share context with your team:**
```
In deployment channel (Slack/Teams):

"Deploying v2.5.1 to production at 2:00 PM
 Changes: JIRA-5432 (perf optimization), JIRA-5441 (bug fix)
 Expected impact: 20% faster /dashboard response time
 Monitoring: Will check at 2:05, 2:30, and 3:00 PM
 Rollback plan: v2.5.0 if error rate >2%"

After deployment:

"v2.5.1 deployed successfully at 2:03 PM ✓
 New Relic marker: [link]
 Metrics: Response time -22% (better than expected)
 Error rate: 0.3% (baseline)
 No issues detected"
```

**Escalate with data:**
```
When issues occur, include deployment context:

"Incident: Error rate spike to 5.2%
 Started: 4:32 PM
 Deployment: v2.5.3 at 4:30 PM [New Relic marker link]
 Affected: /api/checkout endpoint specifically
 Impact: ~500 users experiencing checkout failures
 Root cause: Database query timeout (new query added)
 Action: Rollback to v2.5.2 in progress
 ETA: 4:45 PM

 New Relic comparison:
 v2.5.2: 420ms avg, 0.4% errors
 v2.5.3: 850ms avg, 5.2% errors"
```

**Build institutional knowledge:**
```
Document deployment failures:

Incident post-mortem template:

1. What happened?
   "v2.5.3 deployment caused 5% error rate in checkout"

2. Deployment details
   Marker: [link]
   Version: v2.5.3
   Time: Jan 20, 4:30 PM
   Changes: [list]

3. Root cause
   "Added unindexed database query in checkout flow"

4. Detection
   "New Relic alert fired at 4:32 PM (2 min after deployment)"
   "On-call engineer reviewed deployment marker"

5. Resolution
   "Rolled back to v2.5.2 at 4:45 PM"
   "Error rate returned to 0.4% by 4:47 PM"

6. Prevention
   "Add query analysis to code review checklist"
   "Require database execution plan review for new queries"
   "Add database performance tests to CI/CD"
```

## Understanding What Deployment Markers Don't Show

### Limitations to Be Aware Of

**1. Deployment markers are correlation, not always causation:**
```
Example:
  2:00 PM - Your deployment
  2:05 PM - Error spike

Could be:
  ✓ Your deployment caused errors (causation)
  ✗ Upstream service failed coincidentally (correlation)
  ✗ DDoS attack started (unrelated)
  ✗ Infrastructure issue (network, database)

Always verify the causal link by:
  - Checking error details
  - Reviewing code changes
  - Confirming errors are in changed code paths
```

**2. Some changes don't create deployment markers:**
```
Untracked changes:
  - Feature flag toggles
  - Configuration file updates
  - Database migrations (separate from code deploy)
  - Infrastructure changes (server upgrades)
  - CDN cache purges
  - Third-party service updates

Solution:
  Record these as custom events in New Relic
  Create your own markers for significant changes
```

**3. Gradual rollouts appear as single markers:**
```
Canary deployment:
  1:00 PM - Deploy to 5% of servers (marker created)
  1:30 PM - Deploy to 25% of servers (no new marker)
  2:00 PM - Deploy to 100% of servers (no new marker)

Metric changes happen gradually but marker shows only first deployment
User impact spreads over an hour, not instantly at 1:00 PM

Solution:
  Create multiple markers for each rollout stage
  Label clearly: v2.0-5pct, v2.0-25pct, v2.0-100pct
```

**4. Deployment success metrics can be misleading:**
```
False positives:
  Technical metrics green, but business impact negative
  Example: Fast but broken feature, correct pricing that users hate

False negatives:
  Technical metrics show regression, but acceptable trade-off
  Example: Slower response time but critical security fix applied

Solution:
  Balance technical and business metrics
  Context matters: not all "red" metrics are bad
```

## Official Documentation and Resources

### New Relic Documentation

- **[Track changes using NerdGraph (GraphQL)](https://docs.newrelic.com/docs/change-tracking/change-tracking-graphql/)** - Official guide for creating deployment markers via GraphQL API
- **[Introduction to NerdGraph](https://docs.newrelic.com/docs/apis/nerdgraph/get-started/introduction-new-relic-nerdgraph/)** - Getting started with New Relic's GraphQL API
- **[Capture and analyze changes in your systems](https://docs.newrelic.com/docs/change-tracking/change-tracking-introduction/)** - Overview of New Relic's change tracking capabilities
- **[Record and view deployments](https://docs.newrelic.com/docs/apm/apm-ui-pages/events/record-deployments/)** - Legacy REST API documentation (migration to GraphQL recommended)
- **[How to view and analyze your changes](https://docs.newrelic.com/docs/change-tracking/change-tracking-view-analyze/)** - Guide to using the change tracking UI

### Community Resources and Tutorials

- **[Deployment Tracking 101: CI/CD Best Practices](https://newrelic.com/blog/news/change-tracking)** - New Relic blog post on change tracking best practices
- **[Change Tracking for Performance Velocity](https://newrelic.com/blog/how-to-relic/change-tracking-for-performance-velocity)** - Tutorial on using deployment markers for performance analysis
- **[Getting Started With NerdGraph—The New Relic GraphQL API Explorer](https://newrelic.com/blog/how-to-relic/graphql-api)** - Interactive guide to using the GraphQL API explorer

### API Tools

- **[New Relic NerdGraph GraphQL API Collection (Postman)](https://www.postman.com/new-relic/new-relic-graphql-api-collection/documentation/btuxnnc/new-relic-nerdgraph-graphql-api-collection)** - Pre-built Postman collection for testing deployment marker API calls

## Conclusion

Deployment markers transform New Relic from a reactive monitoring tool ("something broke, figure out why") into a proactive validation system ("I deployed this change, did it improve or degrade performance?").

**The key insight**: Every deployment is an experiment. Deployment markers provide the measurement framework to evaluate that experiment objectively.

**Use them to:**
- ✅ Validate that deployments improved what you intended
- ✅ Catch regressions immediately after deployment
- ✅ Build confidence in your deployment process
- ✅ Make data-driven rollback decisions
- ✅ Track team velocity and deployment success rates
- ✅ Correlate code changes with business metrics

**Start simple**: After every deployment, spend 5 minutes reviewing the deployment marker. Check for related errors, related alerts, and transaction impacts. That one habit will catch most issues early and build your deployment confidence over time.
No results found