Tech Due Diligence OPS-1: Monitoring and Observability
What This Control Requires
The assessor evaluates the monitoring and observability stack, including metrics collection, alerting configuration, distributed tracing, dashboard quality, and the team's ability to detect and diagnose production issues rapidly.
In Plain Language
Can your team tell what is happening in production right now, without waiting for a user to complain? That is the fundamental question here. Observability rests on three pillars: metrics (quantitative measurements of system behaviour), logs (detailed event records), and traces (request flow through distributed systems).
Assessors will dig into infrastructure monitoring (CPU, memory, disk, network), application performance (response times, error rates, throughput), business metrics (signup rates, transaction volumes, feature usage), alerting configuration (thresholds, routing, escalation), dashboard quality (actionable data, not vanity metrics), distributed tracing for multi-service systems, and on-call processes.
No monitoring at all is a critical finding - it means the team is blind to problems until users report them. But alert fatigue from excessive or poorly tuned alerts is also a red flag, because it trains people to ignore alerts, and real incidents get missed. Assessors want to see a balanced, thoughtful approach that provides genuine visibility without overwhelming the team.
How to Implement
Build out all three observability pillars. For metrics, use a time-series database like Prometheus, Datadog, or CloudWatch. For logs, centralise collection with ELK, Datadog, or Grafana Loki. For traces, implement distributed tracing with Jaeger, Zipkin, or Datadog APM to follow requests across service boundaries.
Monitor the Four Golden Signals for each service: latency (time to serve a request), traffic (requests per second), errors (error rate), and saturation (resource utilisation). Together, these give you a comprehensive picture of service health.
Set up actionable alerts. Alert on symptoms (high error rate, slow responses) rather than causes (high CPU). Base thresholds on user impact, not arbitrary numbers. Include context in every alert message - what broke, where, the impact, and a link to the relevant runbook. Route alerts to the right team through the right channel, and eliminate noisy alerts that do not require action.
Build a dashboard hierarchy. Service overview dashboards showing health at a glance. Service-specific dashboards with detailed metrics per component. Investigation dashboards for drilling into problems. Business dashboards tracking key metrics. These should be used daily, not just during incidents.
Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives) for critical services. Pick measurable indicators of quality - availability, latency, error rate - and set target thresholds. Use error budgets to balance reliability investment against feature velocity.
Establish on-call rotations with clear escalation procedures. Set response time expectations by alert severity, make sure on-call engineers have the access and runbooks they need, and compensate on-call time fairly.
Evidence Your Auditor Will Request
- Monitoring stack architecture (metrics, logs, traces)
- Alert configuration with thresholds and routing
- Production dashboards demonstrating operational visibility
- SLI/SLO definitions for critical services
- On-call rotation and escalation procedures
Common Mistakes
- No application-level monitoring; only basic infrastructure metrics
- Alert fatigue: hundreds of alerts with many false positives
- Dashboards exist but are not used; team relies on user reports for issue detection
- No distributed tracing; impossible to diagnose issues across service boundaries
- No on-call rotation; production issues handled ad-hoc by whoever is available
Related Controls Across Frameworks
Frequently Asked Questions
What monitoring platform should we use?
How many alerts is too many?
Track Tech Due Diligence compliance in one place
AuditFront helps you manage every Tech Due Diligence control, collect evidence, and stay audit-ready.
Start Free Assessment