Created
January 17, 2026 23:20
-
-
Save DannyCrews/19e66002285875a37a67949017546229 to your computer and use it in GitHub Desktop.
New Relic Deployment Markers: Users Guide
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # New Relic Deployment Markers: Users Guide | |
| ## What Are Deployment Markers? | |
| Deployment markers are timestamped events in New Relic that record when code changes are released to your applications. They appear as vertical lines or annotations overlaid on your APM performance charts, creating a clear visual boundary between "before the change" and "after the change." | |
| **Think of them as bookmarks in your application's performance timeline** - they help you answer the critical question: "What changed when performance shifted?" | |
| ## The Fundamental Problem Deployment Markers Solve | |
| Your application generates thousands of metrics every minute: response times, error rates, throughput, CPU usage, memory consumption. These metrics constantly fluctuate due to traffic patterns, user behavior, and external dependencies. | |
| **The challenge**: When metrics suddenly change, how do you know why? | |
| **Common scenarios:** | |
| ``` | |
| Scenario 1: Error rate doubles at 3:47 PM | |
| Possible causes: | |
| - Code deployment introduced a bug | |
| - Infrastructure failure (database down) | |
| - Traffic spike (DDoS, viral content) | |
| - External API outage | |
| - Configuration change | |
| - Database query plan changed | |
| Without markers: Check all possibilities manually (30+ minutes) | |
| With markers: See deployment at 3:45 PM immediately (30 seconds) | |
| ``` | |
| ``` | |
| Scenario 2: Response time gradually increases over 2 weeks | |
| Possible causes: | |
| - Code gradually getting slower (technical debt) | |
| - Database growing larger (query performance degrading) | |
| - Memory leak accumulating | |
| - Cache hit rate declining | |
| Without markers: Unclear when degradation started | |
| With markers: Each deployment shows incremental impact | |
| ``` | |
| ## Understanding the Deployment Detail View | |
| When you click on a deployment marker in New Relic, you see a detailed breakdown of that deployment's impact: | |
| ### Header Information | |
| **Entity**: Which application/service was deployed | |
| - Example: "API Gateway - Production", "Frontend - Staging" | |
| - Critical in microservices: ensures you're looking at the right service | |
| **Timestamp**: Exact moment the deployment completed | |
| - Example: "Jan 17, 2026, 04:59:40.067 PM" | |
| - Precision to the millisecond for accurate correlation | |
| **Version**: Unique identifier for this release | |
| - Example: "v2.5.1", "2026-01-17-build-8885", "commit-a3f9c2d" | |
| - Should map to your source control (git tag, commit hash) | |
| - Allows you to know exactly which code is running | |
| **Deployment ID**: New Relic's internal identifier | |
| - Used for API queries and programmatic access | |
| ### Key Impacts Section | |
| Shows high-level changes in transaction volume. | |
| **Example: "12.3K occurrences, +15.2%"** | |
| This means 15.2% more transactions occurred in the comparison period after deployment vs. before. | |
| **Interpreting the change:** | |
| **Positive indicators:** | |
| - ✅ Increased transactions + stable error rate = healthy growth | |
| - ✅ Decreased errors + stable traffic = bug fixes working | |
| - ✅ Decreased slow transactions = performance improvement | |
| **Negative indicators:** | |
| - ❌ Decreased transactions + normal error rate = users can't reach endpoint | |
| - ❌ Increased errors + stable traffic = new bugs introduced | |
| - ❌ Increased slow transactions = performance regression | |
| **Neutral indicators:** | |
| - 📊 Small changes (<5%) = normal variance, likely unrelated to deployment | |
| - 📊 Expected changes = feature removal, traffic shifting to new endpoints | |
| ### Related Errors Section | |
| Shows errors that occurred during or immediately after the deployment. | |
| **"235 related errors found"** means: | |
| - Errors occurred in the time window around this deployment | |
| - New Relic detected correlation between deployment timing and error timing | |
| - Click through to see error details, stack traces, and affected users | |
| **"No related errors found"** means: | |
| - No errors during deployment execution | |
| - No error spikes immediately after deployment | |
| - ⚠️ Note: Doesn't guarantee zero errors - check the Errors page separately for ongoing issues | |
| **Using this data:** | |
| ``` | |
| Clean deployment: | |
| ✓ No related errors | |
| ✓ Action: Monitor for delayed effects (next 24 hours) | |
| Problematic deployment: | |
| ✗ 1,247 related errors | |
| ✗ Action: Click through to identify error type | |
| → If critical: rollback immediately | |
| → If minor: plan hotfix | |
| ``` | |
| ### Related Alerts Section | |
| Shows if any alert conditions triggered around the deployment. | |
| **"High Error Rate Alert triggered 3 minutes after deployment"** tells you: | |
| - Deployment breached a predefined threshold | |
| - Impact was severe enough to warrant notification | |
| - Automatic incident created for tracking | |
| **"No related alerts found"** means: | |
| - Metrics stayed within acceptable bounds | |
| - No automatic incidents created | |
| - Changes were either positive or within tolerance | |
| **Interpreting alert correlation:** | |
| ``` | |
| Alert triggered immediately (0-5 min): | |
| → Strong indication deployment caused the issue | |
| → Errors/performance degradation in changed code paths | |
| Alert triggered later (30+ min): | |
| → Possible correlation but verify | |
| → Could be delayed effect (memory leak, cache warming) | |
| → Could be coincidental (unrelated traffic spike) | |
| No alert but metrics changed: | |
| → Change exists but below alert threshold | |
| → Review whether thresholds need adjustment | |
| → Good candidate for optimization work | |
| ``` | |
| ### Web Transaction Impacts | |
| This is the most powerful section - shows exactly which endpoints were affected by the deployment. | |
| **Example display:** | |
| ``` | |
| Transaction: WebTransaction/Action/api/checkout | |
| 3 hours before: 285.3 ms average | |
| 3 hours after: 312.7 ms average | |
| Impact: +27.4 ms (+9.6% slower) | |
| Transaction: WebTransaction/Action/api/search | |
| 3 hours before: 156.8 ms average | |
| 3 hours after: 142.1 ms average | |
| Impact: -14.7 ms (-9.4% faster) | |
| ``` | |
| **What this tells you:** | |
| 1. **Specific impact**: Not all endpoints affected equally | |
| - Checkout got slower (regression) | |
| - Search got faster (improvement) | |
| 2. **Magnitude**: Quantified change in milliseconds and percentage | |
| - 27ms may seem small but 10% is significant | |
| - Helps prioritize which issues to fix | |
| 3. **Transaction isolation**: Identifies exactly where to investigate | |
| - Look at checkout code changes | |
| - Verify search optimization worked | |
| **Analysis patterns:** | |
| **Single transaction slower, others unchanged:** | |
| ``` | |
| Likely cause: Change specific to that endpoint | |
| Action: Review code changes for that transaction | |
| Look for: New database queries, external API calls, inefficient algorithms | |
| ``` | |
| **All transactions slower by similar amount:** | |
| ``` | |
| Likely cause: Global overhead added (middleware, logging, monitoring) | |
| Action: Review infrastructure changes, framework updates | |
| Look for: New request interceptors, increased logging verbosity | |
| ``` | |
| **Some transactions slower, some faster:** | |
| ``` | |
| Likely cause: Refactoring shifted performance characteristics | |
| Action: Verify slower transactions are acceptable trade-off | |
| Look for: Shared resources (database, cache, external services) | |
| ``` | |
| **Specific transaction disappeared from list:** | |
| ``` | |
| Likely cause: Endpoint removed, renamed, or broken | |
| Action: Check if removal was intentional | |
| Look for: Error logs showing 404s or routing failures | |
| ``` | |
| ### Deployment Attributes | |
| Additional metadata about the deployment: | |
| **Version**: Your application version identifier | |
| - Should be meaningful and sortable | |
| - Best practices: semantic versioning (v1.2.3), date-based (2026-01-17.1) | |
| **Changelog**: Optional field for release notes | |
| - Best practice: Include ticket numbers, feature summaries | |
| - Example: "JIRA-5432: Optimize database queries for user dashboard. JIRA-5441: Fix null pointer in payment processing." | |
| - Makes future troubleshooting much easier | |
| **User**: Who triggered the deployment (if recorded) | |
| - Helpful for accountability and context | |
| - Example: "deploy-bot" vs "john.doe@company.com" | |
| **Description**: Additional context about the deployment | |
| - Use for: deployment type (hotfix, regular release, rollback) | |
| - Environment details (canary, blue-green, rolling) | |
| ## How to Use Deployment Markers Effectively | |
| ### 1. Post-Deployment Validation (The Critical First Hour) | |
| **Minute 0-5: Immediate Health Check** | |
| As soon as the deployment marker appears: | |
| ``` | |
| Quick scan checklist: | |
| ☐ Related errors: Any new exceptions? | |
| ☐ Related alerts: Did thresholds breach? | |
| ☐ Key transactions: Response times acceptable? | |
| ☐ Throughput: Traffic flowing normally? | |
| ☐ Error rate: Within baseline? | |
| ``` | |
| **What you're looking for:** | |
| **Green light (proceed with confidence):** | |
| - No related errors | |
| - No related alerts | |
| - Transaction times improved or stable (<5% change) | |
| - Throughput stable or increased | |
| - Error rate stable or decreased | |
| **Yellow light (monitor closely):** | |
| - Small number of errors (1-10) in non-critical paths | |
| - Transaction times slightly increased (5-15%) | |
| - Single transaction affected, others fine | |
| - No alerts but metrics approaching thresholds | |
| **Red light (prepare to rollback):** | |
| - Hundreds of related errors | |
| - Critical alerts triggered | |
| - Transaction times doubled or more | |
| - Throughput dropped significantly | |
| - Error rate spiked >3x baseline | |
| **Minute 5-30: Deeper Analysis** | |
| After initial health check passes, dig deeper: | |
| ``` | |
| 1. Review "Top 10 web transactions" | |
| - Click each transaction showing significant change | |
| - Examine transaction traces from after deployment | |
| - Identify specific slow operations (DB queries, external calls) | |
| 2. Check transaction distribution | |
| - APM → Transactions | |
| - Sort by throughput | |
| - Verify traffic distribution matches expectations | |
| 3. Review database performance | |
| - APM → Databases | |
| - Check for new slow queries | |
| - Verify query counts are reasonable | |
| 4. Check external services | |
| - APM → External services | |
| - Verify third-party API response times | |
| - Check for new external calls | |
| 5. Review error details | |
| - APM → Errors | |
| - Group by error class | |
| - Read error messages and stack traces | |
| - Verify errors make sense given code changes | |
| ``` | |
| **Minute 30-60: User Impact Assessment** | |
| After technical validation, assess user experience: | |
| ``` | |
| 1. Check Apdex score | |
| - APM → Summary | |
| - Apdex shows user satisfaction (0.0 = awful, 1.0 = perfect) | |
| - Target: >0.9 for most applications | |
| - Compare pre/post deployment | |
| 2. Review browser monitoring (if enabled) | |
| - Browser → Summary | |
| - Check page load times from user perspective | |
| - Verify frontend performance acceptable | |
| 3. Check synthetic monitors (if configured) | |
| - Synthetics → Monitors | |
| - Verify scripted checks passing | |
| - Confirm critical user flows working | |
| 4. Monitor real user metrics | |
| - Browser → Session traces | |
| - Sample actual user sessions | |
| - Identify any broken workflows | |
| ``` | |
| ### 2. Investigating Performance Issues | |
| **Scenario**: Users report slowness, or you notice degraded metrics | |
| **Step 1: Identify the timeframe** | |
| ``` | |
| When did the issue start? | |
| - Check alerts for when threshold breached | |
| - Ask users when they first noticed | |
| - Review metric charts for inflection point | |
| ``` | |
| **Step 2: Find relevant deployment markers** | |
| ``` | |
| Navigate to: APM → Your Application | |
| Set time range: Start from when issue began, look back 24 hours | |
| Scan timeline: Identify all deployment markers in that window | |
| Example: | |
| Issue reported: 5:15 PM | |
| Time range: 4:00 PM - 5:30 PM | |
| Markers found: | |
| - 4:45 PM - Backend API v2.3.1 | |
| - 5:02 PM - Frontend v1.8.2 | |
| ``` | |
| **Step 3: Evaluate each deployment** | |
| ``` | |
| For each marker, click and review: | |
| Deployment at 4:45 PM (Backend API): | |
| ✓ No related errors | |
| ✓ Transaction times stable | |
| ✗ Throughput decreased 15% | |
| → Not the smoking gun, but worth noting | |
| Deployment at 5:02 PM (Frontend): | |
| ✗ 847 related errors | |
| ✗ Page load time increased 200% | |
| ✗ Related alert triggered at 5:04 PM | |
| → FOUND IT: This deployment caused the issue | |
| ``` | |
| **Step 4: Drill into the problematic deployment** | |
| ``` | |
| Click the deployment marker at 5:02 PM | |
| Review Web Transaction Impacts: | |
| /checkout: 1.2s → 3.8s (+217% slower) ← Problem here | |
| /search: 0.3s → 0.3s (no change) | |
| /home: 0.5s → 0.5s (no change) | |
| Conclusion: Checkout endpoint specifically affected | |
| ``` | |
| **Step 5: Identify root cause** | |
| ``` | |
| Click on the slow transaction: /checkout | |
| View transaction traces: | |
| - Find slowest trace from after 5:02 PM | |
| - Examine trace details | |
| - Identify which segment is slow | |
| Example trace breakdown: | |
| Middleware: 10ms | |
| Controller: 15ms | |
| Database query: 3,200ms ← This is the problem | |
| Rendering: 100ms | |
| Root cause: New database query taking 3+ seconds | |
| ``` | |
| **Step 6: Correlate with code changes** | |
| ``` | |
| Check deployment version: v1.8.2 | |
| Look up in version control: git show v1.8.2 | |
| Review changes to /checkout endpoint | |
| Find: Added new query to fetch user's full order history | |
| Realize: Query has no index, scanning millions of rows | |
| ``` | |
| **Step 7: Plan remediation** | |
| ``` | |
| Options: | |
| 1. Immediate rollback to v1.8.1 | |
| - Restores performance immediately | |
| - Loses new features in v1.8.2 | |
| 2. Emergency hotfix | |
| - Add database index | |
| - Deploy as v1.8.3 | |
| - Takes 20-30 minutes | |
| 3. Optimize query | |
| - Rewrite to only fetch recent orders | |
| - Add caching | |
| - Deploy as v1.8.3 | |
| - Takes 2-3 hours | |
| Decision: Rollback now, prepare proper fix for tomorrow | |
| ``` | |
| ### 3. Analyzing Trends Over Time | |
| **Scenario**: Understanding long-term performance trajectory | |
| **View deployment history:** | |
| ``` | |
| Navigate to: APM → Deployments (in left sidebar) | |
| You'll see a chronological list: | |
| Jan 20, 3:00 PM - v2.1.5 | |
| Jan 18, 2:15 PM - v2.1.4 | |
| Jan 15, 4:30 PM - v2.1.3 | |
| Jan 12, 1:45 PM - v2.1.2 | |
| Jan 10, 10:00 AM - v2.1.1 | |
| ... | |
| ``` | |
| **Track baseline shifts:** | |
| Create a spreadsheet or dashboard tracking: | |
| | Deployment | Date | Avg Response Time | Error Rate | Throughput | Apdex | | |
| |------------|------|-------------------|------------|------------|-------| | |
| | v2.1.1 | Jan 10 | 420ms | 0.3% | 2,100 rpm | 0.94 | | |
| | v2.1.2 | Jan 12 | 435ms | 0.3% | 2,050 rpm | 0.93 | | |
| | v2.1.3 | Jan 15 | 480ms | 0.5% | 2,000 rpm | 0.89 | | |
| | v2.1.4 | Jan 18 | 520ms | 0.8% | 1,950 rpm | 0.85 | | |
| | v2.1.5 | Jan 20 | 440ms | 0.4% | 2,100 rpm | 0.92 | | |
| **Identify patterns:** | |
| ``` | |
| v2.1.1 → v2.1.2: Slight degradation (acceptable) | |
| v2.1.2 → v2.1.3: Noticeable degradation (10% slower, higher errors) | |
| v2.1.3 → v2.1.4: Continued degradation (trend established) | |
| v2.1.4 → v2.1.5: Significant improvement (optimization work paid off) | |
| Conclusion: | |
| - v2.1.3 introduced performance regression | |
| - v2.1.4 made it worse | |
| - v2.1.5 fixed both and improved beyond v2.1.1 baseline | |
| ``` | |
| **Investigate the regression:** | |
| ``` | |
| Click on v2.1.3 deployment marker | |
| Review Web Transaction Impacts | |
| Identify which transactions got slower | |
| Compare with v2.1.2: | |
| - What features were added? | |
| - What refactoring occurred? | |
| - What dependencies were updated? | |
| Common culprits: | |
| - ORM changes (N+1 queries introduced) | |
| - Dependency updates (framework overhead increased) | |
| - New features (expensive operations in critical path) | |
| - Logging changes (excessive debug logging in production) | |
| ``` | |
| ### 4. Comparing Across Environments | |
| Use deployment markers to validate your promotion pipeline: | |
| **The ideal pattern:** | |
| ``` | |
| Development Environment: | |
| Deploy: Monday 10:00 AM - v2.2.0-dev | |
| Monitor: 24 hours | |
| Result: ✓ No issues, metrics stable | |
| Staging Environment: | |
| Deploy: Tuesday 10:00 AM - v2.2.0-staging | |
| Monitor: 24 hours | |
| Result: ✓ No issues, metrics stable | |
| Production Environment: | |
| Deploy: Wednesday 10:00 AM - v2.2.0 | |
| Expected: Similar metrics to staging | |
| Result: ✓ Metrics match staging prediction | |
| ``` | |
| **When production differs from staging:** | |
| ``` | |
| Staging: Response time 300ms, 0.2% errors | |
| Production: Response time 800ms, 2.5% errors | |
| Investigation checklist: | |
| ☐ Traffic volume: Production has 10x traffic? | |
| ☐ Data volume: Production database 100x larger? | |
| ☐ External dependencies: Different APIs in prod vs staging? | |
| ☐ Infrastructure: Production servers under-provisioned? | |
| ☐ Configuration: Environment-specific settings causing issues? | |
| ☐ Geographic distribution: Production users globally distributed? | |
| ``` | |
| **Building confidence through consistency:** | |
| ``` | |
| Track prediction accuracy: | |
| Deployment #1: | |
| Staging impact: +15ms response time | |
| Predicted prod: +15ms | |
| Actual prod: +18ms | |
| Accuracy: 95% | |
| Deployment #2: | |
| Staging impact: -30ms response time | |
| Predicted prod: -30ms | |
| Actual prod: -28ms | |
| Accuracy: 93% | |
| Deployment #3: | |
| Staging impact: +5ms response time | |
| Predicted prod: +5ms | |
| Actual prod: +85ms | |
| Accuracy: 6% ← INVESTIGATE | |
| Why so different? | |
| → Found: Production has a caching layer staging doesn't | |
| → The code change invalidated cache frequently | |
| → Staging didn't show this because no cache to invalidate | |
| ``` | |
| ### 5. Making Rollback Decisions | |
| Deployment markers provide objective data for rollback decisions. | |
| **Define rollback criteria in advance:** | |
| ``` | |
| CRITICAL - Rollback immediately, no questions asked: | |
| ✗ Error rate > 10x baseline | |
| ✗ Availability < 95% (users can't access site) | |
| ✗ Critical business function completely broken (payments, login) | |
| ✗ Data corruption detected | |
| ✗ Security vulnerability exposed | |
| MAJOR - Rollback within 30 minutes unless fix available: | |
| ✗ Error rate 3-10x baseline for 15+ minutes | |
| ✗ Response time > 2x baseline for 20+ minutes | |
| ✗ Significant user complaints (>10 in 10 minutes) | |
| ✗ Revenue-impacting feature degraded | |
| MINOR - Monitor closely, fix forward if possible: | |
| ✗ Error rate 1.5-3x baseline | |
| ✗ Response time 1.25-2x baseline | |
| ✗ Non-critical features affected | |
| ✗ Small number of users affected | |
| ACCEPTABLE - Monitor but no action needed: | |
| ✓ Error rate <1.5x baseline | |
| ✓ Response time <1.25x baseline | |
| ✓ Known/expected issues with acceptable impact | |
| ``` | |
| **Using markers to execute rollbacks:** | |
| **Step 1: Identify last known good version** | |
| ``` | |
| Current deployment: v2.5.3 (causing issues) | |
| Review deployment history in reverse: | |
| v2.5.3 - Jan 20, 5:00 PM: | |
| Error rate: 5.2% (PROBLEM) | |
| Response time: 850ms | |
| v2.5.2 - Jan 18, 2:00 PM: | |
| Error rate: 0.4% (GOOD) | |
| Response time: 420ms | |
| Decision: Rollback to v2.5.2 | |
| ``` | |
| **Step 2: Verify the baseline was healthy** | |
| ``` | |
| Click v2.5.2 deployment marker | |
| Check "3 hours after" metrics: | |
| ✓ Error rate: 0.4% | |
| ✓ Response time: 420ms | |
| ✓ Apdex: 0.94 | |
| ✓ No related alerts | |
| Confirm: v2.5.2 is a safe rollback target | |
| ``` | |
| **Step 3: Execute rollback** | |
| ``` | |
| Deploy v2.5.2 to production | |
| New deployment marker appears: | |
| "v2.5.2-rollback" or "v2.5.2" (redeployment) | |
| Timestamp: Jan 20, 5:35 PM | |
| ``` | |
| **Step 4: Validate recovery** | |
| ``` | |
| Monitor the rollback deployment marker: | |
| Minute 0-5: | |
| ☐ Error rate dropping back to baseline? | |
| ☐ Response time improving? | |
| ☐ Alerts clearing? | |
| Minute 5-15: | |
| ☐ Metrics stable at previous baseline? | |
| ☐ User complaints stopped? | |
| ☐ Business functions restored? | |
| Success criteria: | |
| ✓ Error rate back to 0.4% | |
| ✓ Response time back to 420ms | |
| ✓ Apdex back to 0.94 | |
| ✓ All alerts cleared | |
| Result: Rollback successful, incident resolved | |
| ``` | |
| **Step 5: Post-mortem** | |
| ``` | |
| Review v2.5.3 marker to understand what went wrong: | |
| - What changed between v2.5.2 and v2.5.3? | |
| - Why didn't testing catch this? | |
| - What can prevent this in the future? | |
| - When can we safely retry deploying the new features? | |
| ``` | |
| ## Advanced Analysis Techniques | |
| ### 1. Correlating Multiple Signals | |
| Deployment markers are most powerful when combined with other data sources: | |
| **Cross-reference with infrastructure events:** | |
| ``` | |
| Timeline view: | |
| 4:45 PM - Code deployment (marker) | |
| 4:46 PM - Auto-scaling added 3 servers (infrastructure event) | |
| 4:50 PM - Response time improved 30% | |
| Analysis: | |
| Code deployment triggered traffic increase | |
| → Auto-scaling responded appropriately | |
| → System handled load well | |
| → Deployment successful | |
| ``` | |
| **Cross-reference with external services:** | |
| ``` | |
| Timeline view: | |
| 3:00 PM - Code deployment (marker) | |
| 3:02 PM - External API response time spiked | |
| 3:05 PM - Your app error rate spiked | |
| Analysis: | |
| Your deployment increased calls to external API | |
| → External API couldn't handle increased load | |
| → Your app experienced cascading failures | |
| → Need to add circuit breaker or rate limiting | |
| ``` | |
| **Cross-reference with business metrics:** | |
| ``` | |
| Timeline view: | |
| 2:00 PM - Code deployment (marker) | |
| 2:15 PM - Checkout conversion rate dropped 15% | |
| 2:30 PM - Revenue per hour decreased $2,000 | |
| Analysis: | |
| Technical metrics looked fine (response time OK, errors low) | |
| → But users abandoning checkout due to UX changes | |
| → Performance isn't the only deployment success metric | |
| → Need to monitor business KPIs alongside technical metrics | |
| ``` | |
| ### 2. Using NRQL to Query Deployment Data | |
| New Relic Query Language (NRQL) allows programmatic access to deployment data. | |
| **Find all deployments in a time range:** | |
| ```sql | |
| SELECT * | |
| FROM Deployment | |
| WHERE appName = 'YourApp - Production' | |
| SINCE 7 days ago | |
| ``` | |
| **Count deployments per day:** | |
| ```sql | |
| SELECT count(*) as 'Deployments' | |
| FROM Deployment | |
| WHERE appName = 'YourApp - Production' | |
| FACET dateOf(timestamp) | |
| SINCE 30 days ago | |
| ``` | |
| This shows deployment frequency trends over time. | |
| **Find error count by deployment version:** | |
| ```sql | |
| SELECT count(*) as 'Total Errors' | |
| FROM TransactionError | |
| WHERE appName = 'YourApp - Production' | |
| FACET deployment.version | |
| SINCE 7 days ago | |
| ORDER BY count(*) DESC | |
| ``` | |
| Identifies which deployment versions had the most errors. | |
| **Compare response times across deployments:** | |
| ```sql | |
| SELECT average(duration) as 'Avg Response (sec)', | |
| percentile(duration, 95) as 'p95 Response (sec)' | |
| FROM Transaction | |
| WHERE appName = 'YourApp - Production' | |
| FACET deployment.version | |
| SINCE 3 days ago | |
| ``` | |
| Shows performance characteristics of each deployment. | |
| **Identify deployments that caused alert violations:** | |
| ```sql | |
| SELECT count(*) as 'Alert Violations' | |
| FROM Alert | |
| WHERE entity.name = 'YourApp - Production' | |
| FACET deployment.version | |
| SINCE 14 days ago | |
| ``` | |
| Correlates deployments with alert frequency. | |
| **Create a deployment success dashboard:** | |
| ```sql | |
| -- Panel 1: Deployment count over time | |
| SELECT count(*) FROM Deployment | |
| WHERE appName = 'YourApp - Production' | |
| TIMESERIES AUTO | |
| SINCE 30 days ago | |
| -- Panel 2: Error rate by deployment | |
| SELECT percentage(count(*), WHERE error IS true) as 'Error %' | |
| FROM Transaction | |
| FACET deployment.version | |
| SINCE 7 days ago | |
| -- Panel 3: Apdex by deployment | |
| SELECT apdex(duration, t:0.5) as 'Apdex Score' | |
| FROM Transaction | |
| FACET deployment.version | |
| SINCE 7 days ago | |
| -- Panel 4: Time between deployments | |
| SELECT average(timestamp - lag(timestamp)) / 3600000 as 'Hours Between Deploys' | |
| FROM Deployment | |
| WHERE appName = 'YourApp - Production' | |
| TIMESERIES AUTO | |
| SINCE 30 days ago | |
| ``` | |
| ### 3. Building Deployment Alerts | |
| Create alerts that fire based on deployment impact: | |
| **Post-deployment error spike alert:** | |
| ``` | |
| NRQL Alert Condition: | |
| SELECT percentage(count(*), WHERE error IS true) | |
| FROM Transaction | |
| WHERE appName = 'YourApp - Production' | |
| Threshold: | |
| Critical: Error rate > 5% for at least 5 minutes | |
| Condition: | |
| Only evaluate for 30 minutes after a deployment marker appears | |
| Action: | |
| Page on-call engineer | |
| Include: Deployment version, error count, affected transactions | |
| ``` | |
| **Post-deployment performance degradation alert:** | |
| ``` | |
| NRQL Alert Condition: | |
| SELECT percentile(duration, 95) | |
| FROM Transaction | |
| WHERE appName = 'YourApp - Production' | |
| Threshold: | |
| Warning: p95 response time > 1.5x last deployment baseline | |
| Critical: p95 response time > 2x last deployment baseline | |
| Baseline: | |
| Dynamic baseline from 1 hour before deployment | |
| Action: | |
| Slack notification with deployment details and transaction breakdown | |
| ``` | |
| **Deployment frequency anomaly alert:** | |
| ``` | |
| NRQL Alert Condition: | |
| SELECT count(*) | |
| FROM Deployment | |
| WHERE appName = 'YourApp - Production' | |
| Threshold: | |
| Warning: No deployments in 7 days (team might be stuck) | |
| Warning: >20 deployments in 1 day (possible deployment instability) | |
| Action: | |
| Notify engineering leadership | |
| ``` | |
| ### 4. Analyzing Multi-Service Deployments | |
| In microservices architectures, deployments often span multiple services. | |
| **Scenario: E-commerce platform with multiple services** | |
| ``` | |
| Services: | |
| - Frontend (React SPA) | |
| - API Gateway (Node.js) | |
| - User Service (Python) | |
| - Order Service (Java) | |
| - Payment Service (Go) | |
| ``` | |
| **Coordinated deployment:** | |
| ``` | |
| 10:00 AM - Payment Service v3.2.0 deployed | |
| Impact: Payment processing time improved 40ms | |
| 10:15 AM - Order Service v2.8.1 deployed | |
| Impact: Calls new Payment Service endpoint | |
| 10:30 AM - API Gateway v1.5.3 deployed | |
| Impact: Routes updated for new Order Service behavior | |
| 10:45 AM - Frontend v4.1.0 deployed | |
| Impact: UI updated to show enhanced payment options | |
| ``` | |
| **Cross-service correlation:** | |
| ``` | |
| User reports error at 10:50 AM | |
| Investigation: | |
| ✓ Frontend deployment (10:45): No errors, clean deployment | |
| ✓ API Gateway deployment (10:30): No errors, clean deployment | |
| ✗ Order Service deployment (10:15): 234 errors starting at 10:20 | |
| ✓ Payment Service deployment (10:00): No errors, clean deployment | |
| Root cause: | |
| Order Service had a bug in how it called Payment Service | |
| Bug was introduced in v2.8.1 deployment at 10:15 | |
| Resolution: | |
| Rollback Order Service to v2.8.0 | |
| Users can complete orders again | |
| Fix bug and redeploy as v2.8.2 | |
| ``` | |
| **Using Service Maps with deployment markers:** | |
| ``` | |
| Navigate to: APM → Service Maps | |
| Visual representation: | |
| Frontend → API Gateway → Order Service → Payment Service | |
| Click each service to see recent deployments | |
| Trace request path across all services | |
| Identify which service in the chain is slow/failing | |
| Example trace: | |
| Frontend: 50ms (normal) | |
| API Gateway: 20ms (normal) | |
| Order Service: 3,200ms (SLOW - deployed 10:15) | |
| Payment Service: 45ms (normal) | |
| Conclusion: Order Service deployment at 10:15 is the bottleneck | |
| ``` | |
| ## Common Troubleshooting Scenarios | |
| ### Scenario 1: Deployment marker shows no impact but users report issues | |
| **Symptoms:** | |
| - Deployment marker at 3:00 PM | |
| - Web Transaction Impacts shows minimal change (<2%) | |
| - Users reporting errors/slowness starting 3:05 PM | |
| **Possible causes and solutions:** | |
| **A) Low traffic during comparison window** | |
| ``` | |
| Problem: | |
| "3 hours before" period was overnight (low traffic) | |
| "3 hours after" period is daytime (high traffic) | |
| Metrics incomparable due to different traffic patterns | |
| Solution: | |
| Change comparison window to same time of day | |
| Example: Compare 3:00-6:00 PM today vs yesterday | |
| Or: Compare to same day last week at same time | |
| ``` | |
| **B) Issue affects non-instrumented code paths** | |
| ``` | |
| Problem: | |
| Deployment changed background job processing | |
| New Relic only instruments web transactions | |
| Background jobs not visible in web transaction metrics | |
| Solution: | |
| Check custom events or application logs | |
| Instrument background jobs | |
| Look for scheduled task failures | |
| ``` | |
| **C) Business logic error without technical error** | |
| ``` | |
| Problem: | |
| Code processes successfully (no exceptions thrown) | |
| But produces wrong business results | |
| Example: Checkout calculates tax incorrectly | |
| Solution: | |
| Review business KPIs (conversion rate, revenue) | |
| Check application-specific metrics | |
| Review logic in deployment code changes | |
| Test scenarios reported by users | |
| ``` | |
| **D) Gradual resource exhaustion** | |
| ``` | |
| Problem: | |
| Deployment introduced memory leak | |
| First 3 hours look fine (memory slowly filling) | |
| Hour 4-6: Performance degrades as GC thrashing begins | |
| Solution: | |
| Check JVM/runtime metrics over longer timeframe | |
| Look for gradual memory increase | |
| Monitor garbage collection frequency | |
| Review object lifecycle in new code | |
| ``` | |
| ### Scenario 2: Multiple deployment markers, can't tell which caused issue | |
| **Symptoms:** | |
| - Error spike at 4:30 PM | |
| - Deployments at 4:00, 4:15, 4:25, 4:35 PM | |
| - Unclear which is responsible | |
| **Investigation approach:** | |
| **Step 1: Review timeline precision** | |
| ``` | |
| Error spike: 4:30:00 PM | |
| Deployments: | |
| 4:00 PM - Frontend v2.1 (30 min before spike) | |
| 4:15 PM - API v1.8 (15 min before spike) | |
| 4:25 PM - Database service config (5 min before spike) | |
| 4:35 PM - Cache service v3.2 (5 min after spike) | |
| Initial conclusion: | |
| Cache service deployment AFTER spike, unlikely cause | |
| Database config change VERY close to spike, likely cause | |
| ``` | |
| **Step 2: Check error details** | |
| ``` | |
| Click on error spike, group by error message: | |
| "Connection timeout to database" - 1,240 errors | |
| Started: 4:30:15 PM | |
| "Cache miss rate exceeded" - 15 errors | |
| Started: 4:36:00 PM | |
| Conclusion: | |
| Database errors started immediately (4:30:15) | |
| Database config change was at 4:25:00 | |
| 5 minute gap is reasonable for config propagation | |
| → Database config change caused the spike | |
| ``` | |
| **Step 3: Verify by reviewing each deployment** | |
| ``` | |
| Click deployment marker: Database service config (4:25 PM) | |
| Related errors: 1,240 ← Matches error spike | |
| Error message: "Connection timeout" ← Matches error type | |
| Confirmed: Database config change caused the issue | |
| ``` | |
| ### Scenario 3: Deployment marker timestamp doesn't match actual deployment | |
| **Symptoms:** | |
| - Deployment completed at 2:45 PM | |
| - Marker shows 2:32 PM | |
| - Metrics comparison seems off | |
| **Possible causes:** | |
| **A) Marker created at deployment start, not completion** | |
| ``` | |
| Problem: | |
| Deployment process: | |
| 2:32 PM - Deployment starts, marker created | |
| 2:32-2:45 PM - Code rolls out across servers | |
| 2:45 PM - Last server deployed, process complete | |
| Result: | |
| Marker at 2:32 shows mixture of old/new code | |
| "3 hours after" includes partial deployment period | |
| Metrics don't cleanly separate old vs new | |
| Solution: | |
| Fix deployment automation to create marker after completion | |
| Or manually adjust analysis window to start at 2:45 | |
| ``` | |
| **B) Time zone confusion** | |
| ``` | |
| Problem: | |
| Deployment logs show 2:45 PM EST | |
| New Relic marker shows 7:45 PM UTC | |
| Developer in PST sees 11:45 AM PST | |
| Everyone thinks deployment was at different time | |
| Solution: | |
| Always use UTC for correlation | |
| Convert all times to UTC before analyzing | |
| Set New Relic UI to display UTC | |
| ``` | |
| **C) Staged/rolling deployment** | |
| ``` | |
| Problem: | |
| Canary deployment process: | |
| 2:32 PM - 10% of servers deployed | |
| 2:40 PM - 50% of servers deployed | |
| 2:45 PM - 100% of servers deployed | |
| Marker created at first deployment (2:32) | |
| But full impact not visible until 2:45 | |
| Solution: | |
| Create multiple markers for staged deployments | |
| Label each stage: v2.0-canary-10pct, v2.0-canary-50pct, v2.0-full | |
| Compare metrics at each stage | |
| ``` | |
| ### Scenario 4: Clean deployment marker but errors increase days later | |
| **Symptoms:** | |
| - Deployment at Monday 2:00 PM | |
| - Marker shows clean deployment (no issues) | |
| - Wednesday 3:00 PM - error rate spikes | |
| - Errors trace back to code from Monday deployment | |
| **Investigation:** | |
| **Step 1: Verify the connection** | |
| ``` | |
| Check error stack traces: | |
| Are errors in code paths changed Monday? | |
| Or in unrelated code? | |
| Example: | |
| Error: "Index out of bounds in report generation" | |
| Stack trace shows: generateMonthlyReport() method | |
| Monday deployment added: Monthly report feature | |
| Connection confirmed: Monday code is responsible | |
| ``` | |
| **Step 2: Understand the delay** | |
| ``` | |
| Why did it take 2 days to surface? | |
| Common reasons: | |
| A) Time-based trigger: | |
| Monthly report runs on first Wednesday of month | |
| Code never executed until Wednesday | |
| B) Data volume threshold: | |
| Code works fine with small datasets | |
| Database grew over 2 days | |
| Wednesday: Dataset crossed threshold, code fails | |
| C) Rare edge case: | |
| 99% of inputs work fine | |
| Wednesday: User hit the 1% edge case | |
| D) Resource accumulation: | |
| Small memory leak in Monday code | |
| Takes 48 hours to accumulate enough to cause issues | |
| Wednesday: Memory exhausted, errors start | |
| ``` | |
| **Step 3: Improve testing** | |
| ``` | |
| Why didn't testing catch this? | |
| Gap analysis: | |
| - No scheduled job testing (time-based trigger missed) | |
| - Test data too small (volume threshold not tested) | |
| - Test cases don't cover edge cases (rare input missed) | |
| - No long-running test environments (leak not detected) | |
| Improvements: | |
| - Add scheduled job tests to CI/CD | |
| - Use production-sized datasets in staging | |
| - Increase edge case coverage | |
| - Run performance tests for 24+ hours | |
| ``` | |
| ### Scenario 5: Deployment shows improvement but users complain | |
| **Symptoms:** | |
| - Deployment marker shows response time improved 20% | |
| - Error rate decreased 50% | |
| - Technical metrics all positive | |
| - Support tickets increased 300% | |
| **Investigation:** | |
| **Check what metrics don't measure:** | |
| **A) Business logic errors** | |
| ``` | |
| Technical success, business failure: | |
| Old code: Tax calculated incorrectly (overcharged) | |
| New code: Tax calculated correctly (charges appropriate amount) | |
| User perception: "Prices went up 8%" | |
| Technical metrics: All green (code working correctly) | |
| Business reality: Users angry about correct pricing | |
| ``` | |
| **B) User experience changes** | |
| ``` | |
| Technical success, UX regression: | |
| Old code: 1-click checkout (response time 800ms) | |
| New code: 2-step checkout with validation (400ms per step) | |
| Technical metrics: 50% faster response time! | |
| User perception: "Checkout takes longer and is more annoying" | |
| Reality: Extra step adds friction despite faster code | |
| ``` | |
| **C) Removed functionality** | |
| ``` | |
| Technical success, feature loss: | |
| Old code: Advanced search with 20 filters (slow) | |
| New code: Simple search with 5 filters (fast) | |
| Technical metrics: 80% faster search! | |
| User perception: "I can't find products anymore" | |
| Reality: Performance improvement by removing features users needed | |
| ``` | |
| **Solution approach:** | |
| ``` | |
| 1. Monitor business KPIs alongside technical metrics: | |
| - Conversion rates | |
| - Revenue per user | |
| - Session duration | |
| - Feature usage rates | |
| 2. Collect qualitative feedback: | |
| - Support ticket themes | |
| - User surveys | |
| - Session replay analysis | |
| - A/B test results | |
| 3. Balance technical and business success: | |
| - Fast but broken is failure | |
| - Slow but functional might be acceptable | |
| - Best: Fast AND meets user needs | |
| ``` | |
| ## Best Practices for Working with Deployment Markers | |
| ### 1. Establish Deployment Hygiene | |
| **Create meaningful version identifiers:** | |
| ``` | |
| Good version formats: | |
| ✓ Semantic versioning: v2.5.3 (major.minor.patch) | |
| ✓ Date-based: 2026-01-17.1 (date.sequence) | |
| ✓ Build number: build-18885 | |
| ✓ Git commit: a3f9c2d | |
| ✓ Combined: 2026-01-17-build-18885-a3f9c2d | |
| Bad version formats: | |
| ✗ "Latest" | |
| ✗ "Production" | |
| ✗ "Jan deployment" | |
| ✗ Random strings: "xK8mP2q" | |
| Why it matters: | |
| - Sortable versions allow chronological ordering | |
| - Descriptive versions map to source control | |
| - Unique versions prevent confusion | |
| ``` | |
| **Populate changelog fields:** | |
| ``` | |
| Good changelog: | |
| "JIRA-5432: Optimize database queries for user dashboard (-40ms). | |
| JIRA-5441: Fix null pointer in payment processing. | |
| JIRA-5450: Add support for EU tax calculations. | |
| Breaking change: Removed deprecated /api/v1/legacy endpoint." | |
| Bad changelog: | |
| "Bug fixes and improvements" | |
| "Weekly release" | |
| "" | |
| Why it matters: | |
| - Future you (6 months later) needs context | |
| - Incident responders need to know what changed | |
| - Compliance/audit requires change documentation | |
| ``` | |
| **Deploy one environment at a time:** | |
| ``` | |
| Good deployment flow: | |
| Monday 10 AM: Deploy to dev, monitor 24h | |
| Tuesday 10 AM: Deploy to staging, monitor 24h | |
| Wednesday 10 AM: Deploy to production, monitor ongoing | |
| Bad deployment flow: | |
| Monday 10 AM: Deploy to dev, staging, prod simultaneously | |
| Why it matters: | |
| - Issues caught in dev don't reach production | |
| - Staging validates production behavior | |
| - Rollback doesn't affect multiple environments | |
| - Clear deployment markers per environment | |
| ``` | |
| ### 2. Build Post-Deployment Rituals | |
| **The 5-Minute Check:** | |
| ``` | |
| Every deployment, without exception: | |
| ☐ Open deployment marker | |
| ☐ Check "Related errors" count | |
| ☐ Check "Related alerts" status | |
| ☐ Scan "Web Transaction Impacts" for regressions >10% | |
| ☐ Verify throughput maintained or increased | |
| ☐ Quick check of error rate on Errors page | |
| ☐ Glance at Apdex score | |
| If all green: Continue monitoring | |
| If any red: Investigate immediately | |
| ``` | |
| **The 30-Minute Deep Dive:** | |
| ``` | |
| For significant deployments: | |
| ☐ Review all changed transactions in detail | |
| ☐ Compare database query performance | |
| ☐ Check external service response times | |
| ☐ Review error messages and stack traces | |
| ☐ Check business metrics (if available) | |
| ☐ Review sample of transaction traces | |
| ☐ Verify no gradual degradation trends | |
| Document findings for future reference | |
| ``` | |
| **The 24-Hour Retrospective:** | |
| ``` | |
| Day after each deployment: | |
| ☐ Review full day of metrics post-deployment | |
| ☐ Compare to same day previous week | |
| ☐ Check for delayed effects (memory leaks, etc.) | |
| ☐ Review support tickets related to deployment | |
| ☐ Assess whether deployment met goals | |
| ☐ Document lessons learned | |
| Feed learnings into next deployment | |
| ``` | |
| ### 3. Build Deployment Intelligence | |
| **Track deployment success metrics:** | |
| ``` | |
| Maintain a deployment scorecard: | |
| Deployment Frequency: | |
| - Deployments per week | |
| - Trend over time (increasing = good) | |
| Deployment Size: | |
| - Files changed per deployment | |
| - Trend over time (decreasing = good) | |
| Deployment Success Rate: | |
| - % with no issues: 85% | |
| - % with minor issues: 10% | |
| - % requiring rollback: 5% | |
| - Target: >80% clean deployments | |
| Mean Time to Detect: | |
| - How long until issues discovered | |
| - Target: <5 minutes | |
| Mean Time to Resolve: | |
| - How long until issues fixed | |
| - Target: <30 minutes | |
| Change Failure Rate: | |
| - % of deployments causing incidents | |
| - DORA metric target: <15% | |
| ``` | |
| **Learn from patterns:** | |
| ``` | |
| Review last 20 deployments: | |
| Identify correlations: | |
| - Day of week: Friday deployments have 3x failure rate | |
| - Time of day: Deployments during peak traffic risky | |
| - Team member: Junior devs need more code review | |
| - Code area: Payment module needs more testing | |
| - Change type: Database migrations frequently problematic | |
| Adjust process: | |
| - No Friday deployments (wait until Monday) | |
| - Deploy during low-traffic hours | |
| - Pair junior developers with seniors | |
| - Add payment-specific test suite | |
| - Require DB migration dry runs in staging | |
| ``` | |
| ### 4. Collaborate Using Deployment Data | |
| **Share context with your team:** | |
| ``` | |
| In deployment channel (Slack/Teams): | |
| "Deploying v2.5.1 to production at 2:00 PM | |
| Changes: JIRA-5432 (perf optimization), JIRA-5441 (bug fix) | |
| Expected impact: 20% faster /dashboard response time | |
| Monitoring: Will check at 2:05, 2:30, and 3:00 PM | |
| Rollback plan: v2.5.0 if error rate >2%" | |
| After deployment: | |
| "v2.5.1 deployed successfully at 2:03 PM ✓ | |
| New Relic marker: [link] | |
| Metrics: Response time -22% (better than expected) | |
| Error rate: 0.3% (baseline) | |
| No issues detected" | |
| ``` | |
| **Escalate with data:** | |
| ``` | |
| When issues occur, include deployment context: | |
| "Incident: Error rate spike to 5.2% | |
| Started: 4:32 PM | |
| Deployment: v2.5.3 at 4:30 PM [New Relic marker link] | |
| Affected: /api/checkout endpoint specifically | |
| Impact: ~500 users experiencing checkout failures | |
| Root cause: Database query timeout (new query added) | |
| Action: Rollback to v2.5.2 in progress | |
| ETA: 4:45 PM | |
| New Relic comparison: | |
| v2.5.2: 420ms avg, 0.4% errors | |
| v2.5.3: 850ms avg, 5.2% errors" | |
| ``` | |
| **Build institutional knowledge:** | |
| ``` | |
| Document deployment failures: | |
| Incident post-mortem template: | |
| 1. What happened? | |
| "v2.5.3 deployment caused 5% error rate in checkout" | |
| 2. Deployment details | |
| Marker: [link] | |
| Version: v2.5.3 | |
| Time: Jan 20, 4:30 PM | |
| Changes: [list] | |
| 3. Root cause | |
| "Added unindexed database query in checkout flow" | |
| 4. Detection | |
| "New Relic alert fired at 4:32 PM (2 min after deployment)" | |
| "On-call engineer reviewed deployment marker" | |
| 5. Resolution | |
| "Rolled back to v2.5.2 at 4:45 PM" | |
| "Error rate returned to 0.4% by 4:47 PM" | |
| 6. Prevention | |
| "Add query analysis to code review checklist" | |
| "Require database execution plan review for new queries" | |
| "Add database performance tests to CI/CD" | |
| ``` | |
| ## Understanding What Deployment Markers Don't Show | |
| ### Limitations to Be Aware Of | |
| **1. Deployment markers are correlation, not always causation:** | |
| ``` | |
| Example: | |
| 2:00 PM - Your deployment | |
| 2:05 PM - Error spike | |
| Could be: | |
| ✓ Your deployment caused errors (causation) | |
| ✗ Upstream service failed coincidentally (correlation) | |
| ✗ DDoS attack started (unrelated) | |
| ✗ Infrastructure issue (network, database) | |
| Always verify the causal link by: | |
| - Checking error details | |
| - Reviewing code changes | |
| - Confirming errors are in changed code paths | |
| ``` | |
| **2. Some changes don't create deployment markers:** | |
| ``` | |
| Untracked changes: | |
| - Feature flag toggles | |
| - Configuration file updates | |
| - Database migrations (separate from code deploy) | |
| - Infrastructure changes (server upgrades) | |
| - CDN cache purges | |
| - Third-party service updates | |
| Solution: | |
| Record these as custom events in New Relic | |
| Create your own markers for significant changes | |
| ``` | |
| **3. Gradual rollouts appear as single markers:** | |
| ``` | |
| Canary deployment: | |
| 1:00 PM - Deploy to 5% of servers (marker created) | |
| 1:30 PM - Deploy to 25% of servers (no new marker) | |
| 2:00 PM - Deploy to 100% of servers (no new marker) | |
| Metric changes happen gradually but marker shows only first deployment | |
| User impact spreads over an hour, not instantly at 1:00 PM | |
| Solution: | |
| Create multiple markers for each rollout stage | |
| Label clearly: v2.0-5pct, v2.0-25pct, v2.0-100pct | |
| ``` | |
| **4. Deployment success metrics can be misleading:** | |
| ``` | |
| False positives: | |
| Technical metrics green, but business impact negative | |
| Example: Fast but broken feature, correct pricing that users hate | |
| False negatives: | |
| Technical metrics show regression, but acceptable trade-off | |
| Example: Slower response time but critical security fix applied | |
| Solution: | |
| Balance technical and business metrics | |
| Context matters: not all "red" metrics are bad | |
| ``` | |
| ## Official Documentation and Resources | |
| ### New Relic Documentation | |
| - **[Track changes using NerdGraph (GraphQL)](https://docs.newrelic.com/docs/change-tracking/change-tracking-graphql/)** - Official guide for creating deployment markers via GraphQL API | |
| - **[Introduction to NerdGraph](https://docs.newrelic.com/docs/apis/nerdgraph/get-started/introduction-new-relic-nerdgraph/)** - Getting started with New Relic's GraphQL API | |
| - **[Capture and analyze changes in your systems](https://docs.newrelic.com/docs/change-tracking/change-tracking-introduction/)** - Overview of New Relic's change tracking capabilities | |
| - **[Record and view deployments](https://docs.newrelic.com/docs/apm/apm-ui-pages/events/record-deployments/)** - Legacy REST API documentation (migration to GraphQL recommended) | |
| - **[How to view and analyze your changes](https://docs.newrelic.com/docs/change-tracking/change-tracking-view-analyze/)** - Guide to using the change tracking UI | |
| ### Community Resources and Tutorials | |
| - **[Deployment Tracking 101: CI/CD Best Practices](https://newrelic.com/blog/news/change-tracking)** - New Relic blog post on change tracking best practices | |
| - **[Change Tracking for Performance Velocity](https://newrelic.com/blog/how-to-relic/change-tracking-for-performance-velocity)** - Tutorial on using deployment markers for performance analysis | |
| - **[Getting Started With NerdGraph—The New Relic GraphQL API Explorer](https://newrelic.com/blog/how-to-relic/graphql-api)** - Interactive guide to using the GraphQL API explorer | |
| ### API Tools | |
| - **[New Relic NerdGraph GraphQL API Collection (Postman)](https://www.postman.com/new-relic/new-relic-graphql-api-collection/documentation/btuxnnc/new-relic-nerdgraph-graphql-api-collection)** - Pre-built Postman collection for testing deployment marker API calls | |
| ## Conclusion | |
| Deployment markers transform New Relic from a reactive monitoring tool ("something broke, figure out why") into a proactive validation system ("I deployed this change, did it improve or degrade performance?"). | |
| **The key insight**: Every deployment is an experiment. Deployment markers provide the measurement framework to evaluate that experiment objectively. | |
| **Use them to:** | |
| - ✅ Validate that deployments improved what you intended | |
| - ✅ Catch regressions immediately after deployment | |
| - ✅ Build confidence in your deployment process | |
| - ✅ Make data-driven rollback decisions | |
| - ✅ Track team velocity and deployment success rates | |
| - ✅ Correlate code changes with business metrics | |
| **Start simple**: After every deployment, spend 5 minutes reviewing the deployment marker. Check for related errors, related alerts, and transaction impacts. That one habit will catch most issues early and build your deployment confidence over time. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment