Skip to content

Instantly share code, notes, and snippets.

@DannyCrews
Created January 17, 2026 23:20
Show Gist options
  • Select an option

  • Save DannyCrews/19e66002285875a37a67949017546229 to your computer and use it in GitHub Desktop.

Select an option

Save DannyCrews/19e66002285875a37a67949017546229 to your computer and use it in GitHub Desktop.
New Relic Deployment Markers: Users Guide
# New Relic Deployment Markers: Users Guide
## What Are Deployment Markers?
Deployment markers are timestamped events in New Relic that record when code changes are released to your applications. They appear as vertical lines or annotations overlaid on your APM performance charts, creating a clear visual boundary between "before the change" and "after the change."
**Think of them as bookmarks in your application's performance timeline** - they help you answer the critical question: "What changed when performance shifted?"
## The Fundamental Problem Deployment Markers Solve
Your application generates thousands of metrics every minute: response times, error rates, throughput, CPU usage, memory consumption. These metrics constantly fluctuate due to traffic patterns, user behavior, and external dependencies.
**The challenge**: When metrics suddenly change, how do you know why?
**Common scenarios:**
```
Scenario 1: Error rate doubles at 3:47 PM
Possible causes:
- Code deployment introduced a bug
- Infrastructure failure (database down)
- Traffic spike (DDoS, viral content)
- External API outage
- Configuration change
- Database query plan changed
Without markers: Check all possibilities manually (30+ minutes)
With markers: See deployment at 3:45 PM immediately (30 seconds)
```
```
Scenario 2: Response time gradually increases over 2 weeks
Possible causes:
- Code gradually getting slower (technical debt)
- Database growing larger (query performance degrading)
- Memory leak accumulating
- Cache hit rate declining
Without markers: Unclear when degradation started
With markers: Each deployment shows incremental impact
```
## Understanding the Deployment Detail View
When you click on a deployment marker in New Relic, you see a detailed breakdown of that deployment's impact:
### Header Information
**Entity**: Which application/service was deployed
- Example: "API Gateway - Production", "Frontend - Staging"
- Critical in microservices: ensures you're looking at the right service
**Timestamp**: Exact moment the deployment completed
- Example: "Jan 17, 2026, 04:59:40.067 PM"
- Precision to the millisecond for accurate correlation
**Version**: Unique identifier for this release
- Example: "v2.5.1", "2026-01-17-build-8885", "commit-a3f9c2d"
- Should map to your source control (git tag, commit hash)
- Allows you to know exactly which code is running
**Deployment ID**: New Relic's internal identifier
- Used for API queries and programmatic access
### Key Impacts Section
Shows high-level changes in transaction volume.
**Example: "12.3K occurrences, +15.2%"**
This means 15.2% more transactions occurred in the comparison period after deployment vs. before.
**Interpreting the change:**
**Positive indicators:**
- ✅ Increased transactions + stable error rate = healthy growth
- ✅ Decreased errors + stable traffic = bug fixes working
- ✅ Decreased slow transactions = performance improvement
**Negative indicators:**
- ❌ Decreased transactions + normal error rate = users can't reach endpoint
- ❌ Increased errors + stable traffic = new bugs introduced
- ❌ Increased slow transactions = performance regression
**Neutral indicators:**
- 📊 Small changes (<5%) = normal variance, likely unrelated to deployment
- 📊 Expected changes = feature removal, traffic shifting to new endpoints
### Related Errors Section
Shows errors that occurred during or immediately after the deployment.
**"235 related errors found"** means:
- Errors occurred in the time window around this deployment
- New Relic detected correlation between deployment timing and error timing
- Click through to see error details, stack traces, and affected users
**"No related errors found"** means:
- No errors during deployment execution
- No error spikes immediately after deployment
- ⚠️ Note: Doesn't guarantee zero errors - check the Errors page separately for ongoing issues
**Using this data:**
```
Clean deployment:
✓ No related errors
✓ Action: Monitor for delayed effects (next 24 hours)
Problematic deployment:
✗ 1,247 related errors
✗ Action: Click through to identify error type
→ If critical: rollback immediately
→ If minor: plan hotfix
```
### Related Alerts Section
Shows if any alert conditions triggered around the deployment.
**"High Error Rate Alert triggered 3 minutes after deployment"** tells you:
- Deployment breached a predefined threshold
- Impact was severe enough to warrant notification
- Automatic incident created for tracking
**"No related alerts found"** means:
- Metrics stayed within acceptable bounds
- No automatic incidents created
- Changes were either positive or within tolerance
**Interpreting alert correlation:**
```
Alert triggered immediately (0-5 min):
→ Strong indication deployment caused the issue
→ Errors/performance degradation in changed code paths
Alert triggered later (30+ min):
→ Possible correlation but verify
→ Could be delayed effect (memory leak, cache warming)
→ Could be coincidental (unrelated traffic spike)
No alert but metrics changed:
→ Change exists but below alert threshold
→ Review whether thresholds need adjustment
→ Good candidate for optimization work
```
### Web Transaction Impacts
This is the most powerful section - shows exactly which endpoints were affected by the deployment.
**Example display:**
```
Transaction: WebTransaction/Action/api/checkout
3 hours before: 285.3 ms average
3 hours after: 312.7 ms average
Impact: +27.4 ms (+9.6% slower)
Transaction: WebTransaction/Action/api/search
3 hours before: 156.8 ms average
3 hours after: 142.1 ms average
Impact: -14.7 ms (-9.4% faster)
```
**What this tells you:**
1. **Specific impact**: Not all endpoints affected equally
- Checkout got slower (regression)
- Search got faster (improvement)
2. **Magnitude**: Quantified change in milliseconds and percentage
- 27ms may seem small but 10% is significant
- Helps prioritize which issues to fix
3. **Transaction isolation**: Identifies exactly where to investigate
- Look at checkout code changes
- Verify search optimization worked
**Analysis patterns:**
**Single transaction slower, others unchanged:**
```
Likely cause: Change specific to that endpoint
Action: Review code changes for that transaction
Look for: New database queries, external API calls, inefficient algorithms
```
**All transactions slower by similar amount:**
```
Likely cause: Global overhead added (middleware, logging, monitoring)
Action: Review infrastructure changes, framework updates
Look for: New request interceptors, increased logging verbosity
```
**Some transactions slower, some faster:**
```
Likely cause: Refactoring shifted performance characteristics
Action: Verify slower transactions are acceptable trade-off
Look for: Shared resources (database, cache, external services)
```
**Specific transaction disappeared from list:**
```
Likely cause: Endpoint removed, renamed, or broken
Action: Check if removal was intentional
Look for: Error logs showing 404s or routing failures
```
### Deployment Attributes
Additional metadata about the deployment:
**Version**: Your application version identifier
- Should be meaningful and sortable
- Best practices: semantic versioning (v1.2.3), date-based (2026-01-17.1)
**Changelog**: Optional field for release notes
- Best practice: Include ticket numbers, feature summaries
- Example: "JIRA-5432: Optimize database queries for user dashboard. JIRA-5441: Fix null pointer in payment processing."
- Makes future troubleshooting much easier
**User**: Who triggered the deployment (if recorded)
- Helpful for accountability and context
- Example: "deploy-bot" vs "john.doe@company.com"
**Description**: Additional context about the deployment
- Use for: deployment type (hotfix, regular release, rollback)
- Environment details (canary, blue-green, rolling)
## How to Use Deployment Markers Effectively
### 1. Post-Deployment Validation (The Critical First Hour)
**Minute 0-5: Immediate Health Check**
As soon as the deployment marker appears:
```
Quick scan checklist:
☐ Related errors: Any new exceptions?
☐ Related alerts: Did thresholds breach?
☐ Key transactions: Response times acceptable?
☐ Throughput: Traffic flowing normally?
☐ Error rate: Within baseline?
```
**What you're looking for:**
**Green light (proceed with confidence):**
- No related errors
- No related alerts
- Transaction times improved or stable (<5% change)
- Throughput stable or increased
- Error rate stable or decreased
**Yellow light (monitor closely):**
- Small number of errors (1-10) in non-critical paths
- Transaction times slightly increased (5-15%)
- Single transaction affected, others fine
- No alerts but metrics approaching thresholds
**Red light (prepare to rollback):**
- Hundreds of related errors
- Critical alerts triggered
- Transaction times doubled or more
- Throughput dropped significantly
- Error rate spiked >3x baseline
**Minute 5-30: Deeper Analysis**
After initial health check passes, dig deeper:
```
1. Review "Top 10 web transactions"
- Click each transaction showing significant change
- Examine transaction traces from after deployment
- Identify specific slow operations (DB queries, external calls)
2. Check transaction distribution
- APM → Transactions
- Sort by throughput
- Verify traffic distribution matches expectations
3. Review database performance
- APM → Databases
- Check for new slow queries
- Verify query counts are reasonable
4. Check external services
- APM → External services
- Verify third-party API response times
- Check for new external calls
5. Review error details
- APM → Errors
- Group by error class
- Read error messages and stack traces
- Verify errors make sense given code changes
```
**Minute 30-60: User Impact Assessment**
After technical validation, assess user experience:
```
1. Check Apdex score
- APM → Summary
- Apdex shows user satisfaction (0.0 = awful, 1.0 = perfect)
- Target: >0.9 for most applications
- Compare pre/post deployment
2. Review browser monitoring (if enabled)
- Browser → Summary
- Check page load times from user perspective
- Verify frontend performance acceptable
3. Check synthetic monitors (if configured)
- Synthetics → Monitors
- Verify scripted checks passing
- Confirm critical user flows working
4. Monitor real user metrics
- Browser → Session traces
- Sample actual user sessions
- Identify any broken workflows
```
### 2. Investigating Performance Issues
**Scenario**: Users report slowness, or you notice degraded metrics
**Step 1: Identify the timeframe**
```
When did the issue start?
- Check alerts for when threshold breached
- Ask users when they first noticed
- Review metric charts for inflection point
```
**Step 2: Find relevant deployment markers**
```
Navigate to: APM → Your Application
Set time range: Start from when issue began, look back 24 hours
Scan timeline: Identify all deployment markers in that window
Example:
Issue reported: 5:15 PM
Time range: 4:00 PM - 5:30 PM
Markers found:
- 4:45 PM - Backend API v2.3.1
- 5:02 PM - Frontend v1.8.2
```
**Step 3: Evaluate each deployment**
```
For each marker, click and review:
Deployment at 4:45 PM (Backend API):
✓ No related errors
✓ Transaction times stable
✗ Throughput decreased 15%
→ Not the smoking gun, but worth noting
Deployment at 5:02 PM (Frontend):
✗ 847 related errors
✗ Page load time increased 200%
✗ Related alert triggered at 5:04 PM
→ FOUND IT: This deployment caused the issue
```
**Step 4: Drill into the problematic deployment**
```
Click the deployment marker at 5:02 PM
Review Web Transaction Impacts:
/checkout: 1.2s → 3.8s (+217% slower) ← Problem here
/search: 0.3s → 0.3s (no change)
/home: 0.5s → 0.5s (no change)
Conclusion: Checkout endpoint specifically affected
```
**Step 5: Identify root cause**
```
Click on the slow transaction: /checkout
View transaction traces:
- Find slowest trace from after 5:02 PM
- Examine trace details
- Identify which segment is slow
Example trace breakdown:
Middleware: 10ms
Controller: 15ms
Database query: 3,200ms ← This is the problem
Rendering: 100ms
Root cause: New database query taking 3+ seconds
```
**Step 6: Correlate with code changes**
```
Check deployment version: v1.8.2
Look up in version control: git show v1.8.2
Review changes to /checkout endpoint
Find: Added new query to fetch user's full order history
Realize: Query has no index, scanning millions of rows
```
**Step 7: Plan remediation**
```
Options:
1. Immediate rollback to v1.8.1
- Restores performance immediately
- Loses new features in v1.8.2
2. Emergency hotfix
- Add database index
- Deploy as v1.8.3
- Takes 20-30 minutes
3. Optimize query
- Rewrite to only fetch recent orders
- Add caching
- Deploy as v1.8.3
- Takes 2-3 hours
Decision: Rollback now, prepare proper fix for tomorrow
```
### 3. Analyzing Trends Over Time
**Scenario**: Understanding long-term performance trajectory
**View deployment history:**
```
Navigate to: APM → Deployments (in left sidebar)
You'll see a chronological list:
Jan 20, 3:00 PM - v2.1.5
Jan 18, 2:15 PM - v2.1.4
Jan 15, 4:30 PM - v2.1.3
Jan 12, 1:45 PM - v2.1.2
Jan 10, 10:00 AM - v2.1.1
...
```
**Track baseline shifts:**
Create a spreadsheet or dashboard tracking:
| Deployment | Date | Avg Response Time | Error Rate | Throughput | Apdex |
|------------|------|-------------------|------------|------------|-------|
| v2.1.1 | Jan 10 | 420ms | 0.3% | 2,100 rpm | 0.94 |
| v2.1.2 | Jan 12 | 435ms | 0.3% | 2,050 rpm | 0.93 |
| v2.1.3 | Jan 15 | 480ms | 0.5% | 2,000 rpm | 0.89 |
| v2.1.4 | Jan 18 | 520ms | 0.8% | 1,950 rpm | 0.85 |
| v2.1.5 | Jan 20 | 440ms | 0.4% | 2,100 rpm | 0.92 |
**Identify patterns:**
```
v2.1.1 → v2.1.2: Slight degradation (acceptable)
v2.1.2 → v2.1.3: Noticeable degradation (10% slower, higher errors)
v2.1.3 → v2.1.4: Continued degradation (trend established)
v2.1.4 → v2.1.5: Significant improvement (optimization work paid off)
Conclusion:
- v2.1.3 introduced performance regression
- v2.1.4 made it worse
- v2.1.5 fixed both and improved beyond v2.1.1 baseline
```
**Investigate the regression:**
```
Click on v2.1.3 deployment marker
Review Web Transaction Impacts
Identify which transactions got slower
Compare with v2.1.2:
- What features were added?
- What refactoring occurred?
- What dependencies were updated?
Common culprits:
- ORM changes (N+1 queries introduced)
- Dependency updates (framework overhead increased)
- New features (expensive operations in critical path)
- Logging changes (excessive debug logging in production)
```
### 4. Comparing Across Environments
Use deployment markers to validate your promotion pipeline:
**The ideal pattern:**
```
Development Environment:
Deploy: Monday 10:00 AM - v2.2.0-dev
Monitor: 24 hours
Result: ✓ No issues, metrics stable
Staging Environment:
Deploy: Tuesday 10:00 AM - v2.2.0-staging
Monitor: 24 hours
Result: ✓ No issues, metrics stable
Production Environment:
Deploy: Wednesday 10:00 AM - v2.2.0
Expected: Similar metrics to staging
Result: ✓ Metrics match staging prediction
```
**When production differs from staging:**
```
Staging: Response time 300ms, 0.2% errors
Production: Response time 800ms, 2.5% errors
Investigation checklist:
☐ Traffic volume: Production has 10x traffic?
☐ Data volume: Production database 100x larger?
☐ External dependencies: Different APIs in prod vs staging?
☐ Infrastructure: Production servers under-provisioned?
☐ Configuration: Environment-specific settings causing issues?
☐ Geographic distribution: Production users globally distributed?
```
**Building confidence through consistency:**
```
Track prediction accuracy:
Deployment #1:
Staging impact: +15ms response time
Predicted prod: +15ms
Actual prod: +18ms
Accuracy: 95%
Deployment #2:
Staging impact: -30ms response time
Predicted prod: -30ms
Actual prod: -28ms
Accuracy: 93%
Deployment #3:
Staging impact: +5ms response time
Predicted prod: +5ms
Actual prod: +85ms
Accuracy: 6% ← INVESTIGATE
Why so different?
→ Found: Production has a caching layer staging doesn't
→ The code change invalidated cache frequently
→ Staging didn't show this because no cache to invalidate
```
### 5. Making Rollback Decisions
Deployment markers provide objective data for rollback decisions.
**Define rollback criteria in advance:**
```
CRITICAL - Rollback immediately, no questions asked:
✗ Error rate > 10x baseline
✗ Availability < 95% (users can't access site)
✗ Critical business function completely broken (payments, login)
✗ Data corruption detected
✗ Security vulnerability exposed
MAJOR - Rollback within 30 minutes unless fix available:
✗ Error rate 3-10x baseline for 15+ minutes
✗ Response time > 2x baseline for 20+ minutes
✗ Significant user complaints (>10 in 10 minutes)
✗ Revenue-impacting feature degraded
MINOR - Monitor closely, fix forward if possible:
✗ Error rate 1.5-3x baseline
✗ Response time 1.25-2x baseline
✗ Non-critical features affected
✗ Small number of users affected
ACCEPTABLE - Monitor but no action needed:
✓ Error rate <1.5x baseline
✓ Response time <1.25x baseline
✓ Known/expected issues with acceptable impact
```
**Using markers to execute rollbacks:**
**Step 1: Identify last known good version**
```
Current deployment: v2.5.3 (causing issues)
Review deployment history in reverse:
v2.5.3 - Jan 20, 5:00 PM:
Error rate: 5.2% (PROBLEM)
Response time: 850ms
v2.5.2 - Jan 18, 2:00 PM:
Error rate: 0.4% (GOOD)
Response time: 420ms
Decision: Rollback to v2.5.2
```
**Step 2: Verify the baseline was healthy**
```
Click v2.5.2 deployment marker
Check "3 hours after" metrics:
✓ Error rate: 0.4%
✓ Response time: 420ms
✓ Apdex: 0.94
✓ No related alerts
Confirm: v2.5.2 is a safe rollback target
```
**Step 3: Execute rollback**
```
Deploy v2.5.2 to production
New deployment marker appears:
"v2.5.2-rollback" or "v2.5.2" (redeployment)
Timestamp: Jan 20, 5:35 PM
```
**Step 4: Validate recovery**
```
Monitor the rollback deployment marker:
Minute 0-5:
☐ Error rate dropping back to baseline?
☐ Response time improving?
☐ Alerts clearing?
Minute 5-15:
☐ Metrics stable at previous baseline?
☐ User complaints stopped?
☐ Business functions restored?
Success criteria:
✓ Error rate back to 0.4%
✓ Response time back to 420ms
✓ Apdex back to 0.94
✓ All alerts cleared
Result: Rollback successful, incident resolved
```
**Step 5: Post-mortem**
```
Review v2.5.3 marker to understand what went wrong:
- What changed between v2.5.2 and v2.5.3?
- Why didn't testing catch this?
- What can prevent this in the future?
- When can we safely retry deploying the new features?
```
## Advanced Analysis Techniques
### 1. Correlating Multiple Signals
Deployment markers are most powerful when combined with other data sources:
**Cross-reference with infrastructure events:**
```
Timeline view:
4:45 PM - Code deployment (marker)
4:46 PM - Auto-scaling added 3 servers (infrastructure event)
4:50 PM - Response time improved 30%
Analysis:
Code deployment triggered traffic increase
→ Auto-scaling responded appropriately
→ System handled load well
→ Deployment successful
```
**Cross-reference with external services:**
```
Timeline view:
3:00 PM - Code deployment (marker)
3:02 PM - External API response time spiked
3:05 PM - Your app error rate spiked
Analysis:
Your deployment increased calls to external API
→ External API couldn't handle increased load
→ Your app experienced cascading failures
→ Need to add circuit breaker or rate limiting
```
**Cross-reference with business metrics:**
```
Timeline view:
2:00 PM - Code deployment (marker)
2:15 PM - Checkout conversion rate dropped 15%
2:30 PM - Revenue per hour decreased $2,000
Analysis:
Technical metrics looked fine (response time OK, errors low)
→ But users abandoning checkout due to UX changes
→ Performance isn't the only deployment success metric
→ Need to monitor business KPIs alongside technical metrics
```
### 2. Using NRQL to Query Deployment Data
New Relic Query Language (NRQL) allows programmatic access to deployment data.
**Find all deployments in a time range:**
```sql
SELECT *
FROM Deployment
WHERE appName = 'YourApp - Production'
SINCE 7 days ago
```
**Count deployments per day:**
```sql
SELECT count(*) as 'Deployments'
FROM Deployment
WHERE appName = 'YourApp - Production'
FACET dateOf(timestamp)
SINCE 30 days ago
```
This shows deployment frequency trends over time.
**Find error count by deployment version:**
```sql
SELECT count(*) as 'Total Errors'
FROM TransactionError
WHERE appName = 'YourApp - Production'
FACET deployment.version
SINCE 7 days ago
ORDER BY count(*) DESC
```
Identifies which deployment versions had the most errors.
**Compare response times across deployments:**
```sql
SELECT average(duration) as 'Avg Response (sec)',
percentile(duration, 95) as 'p95 Response (sec)'
FROM Transaction
WHERE appName = 'YourApp - Production'
FACET deployment.version
SINCE 3 days ago
```
Shows performance characteristics of each deployment.
**Identify deployments that caused alert violations:**
```sql
SELECT count(*) as 'Alert Violations'
FROM Alert
WHERE entity.name = 'YourApp - Production'
FACET deployment.version
SINCE 14 days ago
```
Correlates deployments with alert frequency.
**Create a deployment success dashboard:**
```sql
-- Panel 1: Deployment count over time
SELECT count(*) FROM Deployment
WHERE appName = 'YourApp - Production'
TIMESERIES AUTO
SINCE 30 days ago
-- Panel 2: Error rate by deployment
SELECT percentage(count(*), WHERE error IS true) as 'Error %'
FROM Transaction
FACET deployment.version
SINCE 7 days ago
-- Panel 3: Apdex by deployment
SELECT apdex(duration, t:0.5) as 'Apdex Score'
FROM Transaction
FACET deployment.version
SINCE 7 days ago
-- Panel 4: Time between deployments
SELECT average(timestamp - lag(timestamp)) / 3600000 as 'Hours Between Deploys'
FROM Deployment
WHERE appName = 'YourApp - Production'
TIMESERIES AUTO
SINCE 30 days ago
```
### 3. Building Deployment Alerts
Create alerts that fire based on deployment impact:
**Post-deployment error spike alert:**
```
NRQL Alert Condition:
SELECT percentage(count(*), WHERE error IS true)
FROM Transaction
WHERE appName = 'YourApp - Production'
Threshold:
Critical: Error rate > 5% for at least 5 minutes
Condition:
Only evaluate for 30 minutes after a deployment marker appears
Action:
Page on-call engineer
Include: Deployment version, error count, affected transactions
```
**Post-deployment performance degradation alert:**
```
NRQL Alert Condition:
SELECT percentile(duration, 95)
FROM Transaction
WHERE appName = 'YourApp - Production'
Threshold:
Warning: p95 response time > 1.5x last deployment baseline
Critical: p95 response time > 2x last deployment baseline
Baseline:
Dynamic baseline from 1 hour before deployment
Action:
Slack notification with deployment details and transaction breakdown
```
**Deployment frequency anomaly alert:**
```
NRQL Alert Condition:
SELECT count(*)
FROM Deployment
WHERE appName = 'YourApp - Production'
Threshold:
Warning: No deployments in 7 days (team might be stuck)
Warning: >20 deployments in 1 day (possible deployment instability)
Action:
Notify engineering leadership
```
### 4. Analyzing Multi-Service Deployments
In microservices architectures, deployments often span multiple services.
**Scenario: E-commerce platform with multiple services**
```
Services:
- Frontend (React SPA)
- API Gateway (Node.js)
- User Service (Python)
- Order Service (Java)
- Payment Service (Go)
```
**Coordinated deployment:**
```
10:00 AM - Payment Service v3.2.0 deployed
Impact: Payment processing time improved 40ms
10:15 AM - Order Service v2.8.1 deployed
Impact: Calls new Payment Service endpoint
10:30 AM - API Gateway v1.5.3 deployed
Impact: Routes updated for new Order Service behavior
10:45 AM - Frontend v4.1.0 deployed
Impact: UI updated to show enhanced payment options
```
**Cross-service correlation:**
```
User reports error at 10:50 AM
Investigation:
✓ Frontend deployment (10:45): No errors, clean deployment
✓ API Gateway deployment (10:30): No errors, clean deployment
✗ Order Service deployment (10:15): 234 errors starting at 10:20
✓ Payment Service deployment (10:00): No errors, clean deployment
Root cause:
Order Service had a bug in how it called Payment Service
Bug was introduced in v2.8.1 deployment at 10:15
Resolution:
Rollback Order Service to v2.8.0
Users can complete orders again
Fix bug and redeploy as v2.8.2
```
**Using Service Maps with deployment markers:**
```
Navigate to: APM → Service Maps
Visual representation:
Frontend → API Gateway → Order Service → Payment Service
Click each service to see recent deployments
Trace request path across all services
Identify which service in the chain is slow/failing
Example trace:
Frontend: 50ms (normal)
API Gateway: 20ms (normal)
Order Service: 3,200ms (SLOW - deployed 10:15)
Payment Service: 45ms (normal)
Conclusion: Order Service deployment at 10:15 is the bottleneck
```
## Common Troubleshooting Scenarios
### Scenario 1: Deployment marker shows no impact but users report issues
**Symptoms:**
- Deployment marker at 3:00 PM
- Web Transaction Impacts shows minimal change (<2%)
- Users reporting errors/slowness starting 3:05 PM
**Possible causes and solutions:**
**A) Low traffic during comparison window**
```
Problem:
"3 hours before" period was overnight (low traffic)
"3 hours after" period is daytime (high traffic)
Metrics incomparable due to different traffic patterns
Solution:
Change comparison window to same time of day
Example: Compare 3:00-6:00 PM today vs yesterday
Or: Compare to same day last week at same time
```
**B) Issue affects non-instrumented code paths**
```
Problem:
Deployment changed background job processing
New Relic only instruments web transactions
Background jobs not visible in web transaction metrics
Solution:
Check custom events or application logs
Instrument background jobs
Look for scheduled task failures
```
**C) Business logic error without technical error**
```
Problem:
Code processes successfully (no exceptions thrown)
But produces wrong business results
Example: Checkout calculates tax incorrectly
Solution:
Review business KPIs (conversion rate, revenue)
Check application-specific metrics
Review logic in deployment code changes
Test scenarios reported by users
```
**D) Gradual resource exhaustion**
```
Problem:
Deployment introduced memory leak
First 3 hours look fine (memory slowly filling)
Hour 4-6: Performance degrades as GC thrashing begins
Solution:
Check JVM/runtime metrics over longer timeframe
Look for gradual memory increase
Monitor garbage collection frequency
Review object lifecycle in new code
```
### Scenario 2: Multiple deployment markers, can't tell which caused issue
**Symptoms:**
- Error spike at 4:30 PM
- Deployments at 4:00, 4:15, 4:25, 4:35 PM
- Unclear which is responsible
**Investigation approach:**
**Step 1: Review timeline precision**
```
Error spike: 4:30:00 PM
Deployments:
4:00 PM - Frontend v2.1 (30 min before spike)
4:15 PM - API v1.8 (15 min before spike)
4:25 PM - Database service config (5 min before spike)
4:35 PM - Cache service v3.2 (5 min after spike)
Initial conclusion:
Cache service deployment AFTER spike, unlikely cause
Database config change VERY close to spike, likely cause
```
**Step 2: Check error details**
```
Click on error spike, group by error message:
"Connection timeout to database" - 1,240 errors
Started: 4:30:15 PM
"Cache miss rate exceeded" - 15 errors
Started: 4:36:00 PM
Conclusion:
Database errors started immediately (4:30:15)
Database config change was at 4:25:00
5 minute gap is reasonable for config propagation
→ Database config change caused the spike
```
**Step 3: Verify by reviewing each deployment**
```
Click deployment marker: Database service config (4:25 PM)
Related errors: 1,240 ← Matches error spike
Error message: "Connection timeout" ← Matches error type
Confirmed: Database config change caused the issue
```
### Scenario 3: Deployment marker timestamp doesn't match actual deployment
**Symptoms:**
- Deployment completed at 2:45 PM
- Marker shows 2:32 PM
- Metrics comparison seems off
**Possible causes:**
**A) Marker created at deployment start, not completion**
```
Problem:
Deployment process:
2:32 PM - Deployment starts, marker created
2:32-2:45 PM - Code rolls out across servers
2:45 PM - Last server deployed, process complete
Result:
Marker at 2:32 shows mixture of old/new code
"3 hours after" includes partial deployment period
Metrics don't cleanly separate old vs new
Solution:
Fix deployment automation to create marker after completion
Or manually adjust analysis window to start at 2:45
```
**B) Time zone confusion**
```
Problem:
Deployment logs show 2:45 PM EST
New Relic marker shows 7:45 PM UTC
Developer in PST sees 11:45 AM PST
Everyone thinks deployment was at different time
Solution:
Always use UTC for correlation
Convert all times to UTC before analyzing
Set New Relic UI to display UTC
```
**C) Staged/rolling deployment**
```
Problem:
Canary deployment process:
2:32 PM - 10% of servers deployed
2:40 PM - 50% of servers deployed
2:45 PM - 100% of servers deployed
Marker created at first deployment (2:32)
But full impact not visible until 2:45
Solution:
Create multiple markers for staged deployments
Label each stage: v2.0-canary-10pct, v2.0-canary-50pct, v2.0-full
Compare metrics at each stage
```
### Scenario 4: Clean deployment marker but errors increase days later
**Symptoms:**
- Deployment at Monday 2:00 PM
- Marker shows clean deployment (no issues)
- Wednesday 3:00 PM - error rate spikes
- Errors trace back to code from Monday deployment
**Investigation:**
**Step 1: Verify the connection**
```
Check error stack traces:
Are errors in code paths changed Monday?
Or in unrelated code?
Example:
Error: "Index out of bounds in report generation"
Stack trace shows: generateMonthlyReport() method
Monday deployment added: Monthly report feature
Connection confirmed: Monday code is responsible
```
**Step 2: Understand the delay**
```
Why did it take 2 days to surface?
Common reasons:
A) Time-based trigger:
Monthly report runs on first Wednesday of month
Code never executed until Wednesday
B) Data volume threshold:
Code works fine with small datasets
Database grew over 2 days
Wednesday: Dataset crossed threshold, code fails
C) Rare edge case:
99% of inputs work fine
Wednesday: User hit the 1% edge case
D) Resource accumulation:
Small memory leak in Monday code
Takes 48 hours to accumulate enough to cause issues
Wednesday: Memory exhausted, errors start
```
**Step 3: Improve testing**
```
Why didn't testing catch this?
Gap analysis:
- No scheduled job testing (time-based trigger missed)
- Test data too small (volume threshold not tested)
- Test cases don't cover edge cases (rare input missed)
- No long-running test environments (leak not detected)
Improvements:
- Add scheduled job tests to CI/CD
- Use production-sized datasets in staging
- Increase edge case coverage
- Run performance tests for 24+ hours
```
### Scenario 5: Deployment shows improvement but users complain
**Symptoms:**
- Deployment marker shows response time improved 20%
- Error rate decreased 50%
- Technical metrics all positive
- Support tickets increased 300%
**Investigation:**
**Check what metrics don't measure:**
**A) Business logic errors**
```
Technical success, business failure:
Old code: Tax calculated incorrectly (overcharged)
New code: Tax calculated correctly (charges appropriate amount)
User perception: "Prices went up 8%"
Technical metrics: All green (code working correctly)
Business reality: Users angry about correct pricing
```
**B) User experience changes**
```
Technical success, UX regression:
Old code: 1-click checkout (response time 800ms)
New code: 2-step checkout with validation (400ms per step)
Technical metrics: 50% faster response time!
User perception: "Checkout takes longer and is more annoying"
Reality: Extra step adds friction despite faster code
```
**C) Removed functionality**
```
Technical success, feature loss:
Old code: Advanced search with 20 filters (slow)
New code: Simple search with 5 filters (fast)
Technical metrics: 80% faster search!
User perception: "I can't find products anymore"
Reality: Performance improvement by removing features users needed
```
**Solution approach:**
```
1. Monitor business KPIs alongside technical metrics:
- Conversion rates
- Revenue per user
- Session duration
- Feature usage rates
2. Collect qualitative feedback:
- Support ticket themes
- User surveys
- Session replay analysis
- A/B test results
3. Balance technical and business success:
- Fast but broken is failure
- Slow but functional might be acceptable
- Best: Fast AND meets user needs
```
## Best Practices for Working with Deployment Markers
### 1. Establish Deployment Hygiene
**Create meaningful version identifiers:**
```
Good version formats:
✓ Semantic versioning: v2.5.3 (major.minor.patch)
✓ Date-based: 2026-01-17.1 (date.sequence)
✓ Build number: build-18885
✓ Git commit: a3f9c2d
✓ Combined: 2026-01-17-build-18885-a3f9c2d
Bad version formats:
✗ "Latest"
✗ "Production"
✗ "Jan deployment"
✗ Random strings: "xK8mP2q"
Why it matters:
- Sortable versions allow chronological ordering
- Descriptive versions map to source control
- Unique versions prevent confusion
```
**Populate changelog fields:**
```
Good changelog:
"JIRA-5432: Optimize database queries for user dashboard (-40ms).
JIRA-5441: Fix null pointer in payment processing.
JIRA-5450: Add support for EU tax calculations.
Breaking change: Removed deprecated /api/v1/legacy endpoint."
Bad changelog:
"Bug fixes and improvements"
"Weekly release"
""
Why it matters:
- Future you (6 months later) needs context
- Incident responders need to know what changed
- Compliance/audit requires change documentation
```
**Deploy one environment at a time:**
```
Good deployment flow:
Monday 10 AM: Deploy to dev, monitor 24h
Tuesday 10 AM: Deploy to staging, monitor 24h
Wednesday 10 AM: Deploy to production, monitor ongoing
Bad deployment flow:
Monday 10 AM: Deploy to dev, staging, prod simultaneously
Why it matters:
- Issues caught in dev don't reach production
- Staging validates production behavior
- Rollback doesn't affect multiple environments
- Clear deployment markers per environment
```
### 2. Build Post-Deployment Rituals
**The 5-Minute Check:**
```
Every deployment, without exception:
☐ Open deployment marker
☐ Check "Related errors" count
☐ Check "Related alerts" status
☐ Scan "Web Transaction Impacts" for regressions >10%
☐ Verify throughput maintained or increased
☐ Quick check of error rate on Errors page
☐ Glance at Apdex score
If all green: Continue monitoring
If any red: Investigate immediately
```
**The 30-Minute Deep Dive:**
```
For significant deployments:
☐ Review all changed transactions in detail
☐ Compare database query performance
☐ Check external service response times
☐ Review error messages and stack traces
☐ Check business metrics (if available)
☐ Review sample of transaction traces
☐ Verify no gradual degradation trends
Document findings for future reference
```
**The 24-Hour Retrospective:**
```
Day after each deployment:
☐ Review full day of metrics post-deployment
☐ Compare to same day previous week
☐ Check for delayed effects (memory leaks, etc.)
☐ Review support tickets related to deployment
☐ Assess whether deployment met goals
☐ Document lessons learned
Feed learnings into next deployment
```
### 3. Build Deployment Intelligence
**Track deployment success metrics:**
```
Maintain a deployment scorecard:
Deployment Frequency:
- Deployments per week
- Trend over time (increasing = good)
Deployment Size:
- Files changed per deployment
- Trend over time (decreasing = good)
Deployment Success Rate:
- % with no issues: 85%
- % with minor issues: 10%
- % requiring rollback: 5%
- Target: >80% clean deployments
Mean Time to Detect:
- How long until issues discovered
- Target: <5 minutes
Mean Time to Resolve:
- How long until issues fixed
- Target: <30 minutes
Change Failure Rate:
- % of deployments causing incidents
- DORA metric target: <15%
```
**Learn from patterns:**
```
Review last 20 deployments:
Identify correlations:
- Day of week: Friday deployments have 3x failure rate
- Time of day: Deployments during peak traffic risky
- Team member: Junior devs need more code review
- Code area: Payment module needs more testing
- Change type: Database migrations frequently problematic
Adjust process:
- No Friday deployments (wait until Monday)
- Deploy during low-traffic hours
- Pair junior developers with seniors
- Add payment-specific test suite
- Require DB migration dry runs in staging
```
### 4. Collaborate Using Deployment Data
**Share context with your team:**
```
In deployment channel (Slack/Teams):
"Deploying v2.5.1 to production at 2:00 PM
Changes: JIRA-5432 (perf optimization), JIRA-5441 (bug fix)
Expected impact: 20% faster /dashboard response time
Monitoring: Will check at 2:05, 2:30, and 3:00 PM
Rollback plan: v2.5.0 if error rate >2%"
After deployment:
"v2.5.1 deployed successfully at 2:03 PM ✓
New Relic marker: [link]
Metrics: Response time -22% (better than expected)
Error rate: 0.3% (baseline)
No issues detected"
```
**Escalate with data:**
```
When issues occur, include deployment context:
"Incident: Error rate spike to 5.2%
Started: 4:32 PM
Deployment: v2.5.3 at 4:30 PM [New Relic marker link]
Affected: /api/checkout endpoint specifically
Impact: ~500 users experiencing checkout failures
Root cause: Database query timeout (new query added)
Action: Rollback to v2.5.2 in progress
ETA: 4:45 PM
New Relic comparison:
v2.5.2: 420ms avg, 0.4% errors
v2.5.3: 850ms avg, 5.2% errors"
```
**Build institutional knowledge:**
```
Document deployment failures:
Incident post-mortem template:
1. What happened?
"v2.5.3 deployment caused 5% error rate in checkout"
2. Deployment details
Marker: [link]
Version: v2.5.3
Time: Jan 20, 4:30 PM
Changes: [list]
3. Root cause
"Added unindexed database query in checkout flow"
4. Detection
"New Relic alert fired at 4:32 PM (2 min after deployment)"
"On-call engineer reviewed deployment marker"
5. Resolution
"Rolled back to v2.5.2 at 4:45 PM"
"Error rate returned to 0.4% by 4:47 PM"
6. Prevention
"Add query analysis to code review checklist"
"Require database execution plan review for new queries"
"Add database performance tests to CI/CD"
```
## Understanding What Deployment Markers Don't Show
### Limitations to Be Aware Of
**1. Deployment markers are correlation, not always causation:**
```
Example:
2:00 PM - Your deployment
2:05 PM - Error spike
Could be:
✓ Your deployment caused errors (causation)
✗ Upstream service failed coincidentally (correlation)
✗ DDoS attack started (unrelated)
✗ Infrastructure issue (network, database)
Always verify the causal link by:
- Checking error details
- Reviewing code changes
- Confirming errors are in changed code paths
```
**2. Some changes don't create deployment markers:**
```
Untracked changes:
- Feature flag toggles
- Configuration file updates
- Database migrations (separate from code deploy)
- Infrastructure changes (server upgrades)
- CDN cache purges
- Third-party service updates
Solution:
Record these as custom events in New Relic
Create your own markers for significant changes
```
**3. Gradual rollouts appear as single markers:**
```
Canary deployment:
1:00 PM - Deploy to 5% of servers (marker created)
1:30 PM - Deploy to 25% of servers (no new marker)
2:00 PM - Deploy to 100% of servers (no new marker)
Metric changes happen gradually but marker shows only first deployment
User impact spreads over an hour, not instantly at 1:00 PM
Solution:
Create multiple markers for each rollout stage
Label clearly: v2.0-5pct, v2.0-25pct, v2.0-100pct
```
**4. Deployment success metrics can be misleading:**
```
False positives:
Technical metrics green, but business impact negative
Example: Fast but broken feature, correct pricing that users hate
False negatives:
Technical metrics show regression, but acceptable trade-off
Example: Slower response time but critical security fix applied
Solution:
Balance technical and business metrics
Context matters: not all "red" metrics are bad
```
## Official Documentation and Resources
### New Relic Documentation
- **[Track changes using NerdGraph (GraphQL)](https://docs.newrelic.com/docs/change-tracking/change-tracking-graphql/)** - Official guide for creating deployment markers via GraphQL API
- **[Introduction to NerdGraph](https://docs.newrelic.com/docs/apis/nerdgraph/get-started/introduction-new-relic-nerdgraph/)** - Getting started with New Relic's GraphQL API
- **[Capture and analyze changes in your systems](https://docs.newrelic.com/docs/change-tracking/change-tracking-introduction/)** - Overview of New Relic's change tracking capabilities
- **[Record and view deployments](https://docs.newrelic.com/docs/apm/apm-ui-pages/events/record-deployments/)** - Legacy REST API documentation (migration to GraphQL recommended)
- **[How to view and analyze your changes](https://docs.newrelic.com/docs/change-tracking/change-tracking-view-analyze/)** - Guide to using the change tracking UI
### Community Resources and Tutorials
- **[Deployment Tracking 101: CI/CD Best Practices](https://newrelic.com/blog/news/change-tracking)** - New Relic blog post on change tracking best practices
- **[Change Tracking for Performance Velocity](https://newrelic.com/blog/how-to-relic/change-tracking-for-performance-velocity)** - Tutorial on using deployment markers for performance analysis
- **[Getting Started With NerdGraph—The New Relic GraphQL API Explorer](https://newrelic.com/blog/how-to-relic/graphql-api)** - Interactive guide to using the GraphQL API explorer
### API Tools
- **[New Relic NerdGraph GraphQL API Collection (Postman)](https://www.postman.com/new-relic/new-relic-graphql-api-collection/documentation/btuxnnc/new-relic-nerdgraph-graphql-api-collection)** - Pre-built Postman collection for testing deployment marker API calls
## Conclusion
Deployment markers transform New Relic from a reactive monitoring tool ("something broke, figure out why") into a proactive validation system ("I deployed this change, did it improve or degrade performance?").
**The key insight**: Every deployment is an experiment. Deployment markers provide the measurement framework to evaluate that experiment objectively.
**Use them to:**
- ✅ Validate that deployments improved what you intended
- ✅ Catch regressions immediately after deployment
- ✅ Build confidence in your deployment process
- ✅ Make data-driven rollback decisions
- ✅ Track team velocity and deployment success rates
- ✅ Correlate code changes with business metrics
**Start simple**: After every deployment, spend 5 minutes reviewing the deployment marker. Check for related errors, related alerts, and transaction impacts. That one habit will catch most issues early and build your deployment confidence over time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment