OPS-1 Tech Due Diligence

Tech Due Diligence OPS-1: Monitoring and Observability

What This Control Requires

The assessor evaluates the monitoring and observability stack, including metrics collection, alerting configuration, distributed tracing, dashboard quality, and the team's ability to detect and diagnose production issues rapidly.

In Plain Language

Can your team tell what is happening in production right now, without waiting for a user to complain? That is the fundamental question here. Observability rests on three pillars: metrics (quantitative measurements of system behaviour), logs (detailed event records), and traces (request flow through distributed systems).

Assessors will dig into infrastructure monitoring (CPU, memory, disk, network), application performance (response times, error rates, throughput), business metrics (signup rates, transaction volumes, feature usage), alerting configuration (thresholds, routing, escalation), dashboard quality (actionable data, not vanity metrics), distributed tracing for multi-service systems, and on-call processes.

No monitoring at all is a critical finding - it means the team is blind to problems until users report them. But alert fatigue from excessive or poorly tuned alerts is also a red flag, because it trains people to ignore alerts, and real incidents get missed. Assessors want to see a balanced, thoughtful approach that provides genuine visibility without overwhelming the team.

How to Implement

Build out all three observability pillars. For metrics, use a time-series database like Prometheus, Datadog, or CloudWatch. For logs, centralise collection with ELK, Datadog, or Grafana Loki. For traces, implement distributed tracing with Jaeger, Zipkin, or Datadog APM to follow requests across service boundaries.

Monitor the Four Golden Signals for each service: latency (time to serve a request), traffic (requests per second), errors (error rate), and saturation (resource utilisation). Together, these give you a comprehensive picture of service health.

Set up actionable alerts. Alert on symptoms (high error rate, slow responses) rather than causes (high CPU). Base thresholds on user impact, not arbitrary numbers. Include context in every alert message - what broke, where, the impact, and a link to the relevant runbook. Route alerts to the right team through the right channel, and eliminate noisy alerts that do not require action.

Build a dashboard hierarchy. Service overview dashboards showing health at a glance. Service-specific dashboards with detailed metrics per component. Investigation dashboards for drilling into problems. Business dashboards tracking key metrics. These should be used daily, not just during incidents.

Define SLIs (Service Level Indicators) and SLOs (Service Level Objectives) for critical services. Pick measurable indicators of quality - availability, latency, error rate - and set target thresholds. Use error budgets to balance reliability investment against feature velocity.

Establish on-call rotations with clear escalation procedures. Set response time expectations by alert severity, make sure on-call engineers have the access and runbooks they need, and compensate on-call time fairly.

Evidence Your Auditor Will Request

Monitoring stack architecture (metrics, logs, traces)
Alert configuration with thresholds and routing
Production dashboards demonstrating operational visibility
SLI/SLO definitions for critical services
On-call rotation and escalation procedures

Common Mistakes

No application-level monitoring; only basic infrastructure metrics
Alert fatigue: hundreds of alerts with many false positives
Dashboards exist but are not used; team relies on user reports for issue detection
No distributed tracing; impossible to diagnose issues across service boundaries
No on-call rotation; production issues handled ad-hoc by whoever is available

Related Controls Across Frameworks

Framework	Control ID	Relationship
ISO 27001	A.8.16	Related
SOC 2	CC7.2	Related

Frequently Asked Questions

What monitoring platform should we use?

It depends on budget, team expertise, and scale. Datadog gives you an integrated platform for metrics, logs, and traces but gets expensive at scale. Prometheus/Grafana/Loki is powerful and cost-effective but takes more effort to manage. Cloud-native options like CloudWatch or Azure Monitor integrate well with their respective platforms. For due diligence purposes, the tool matters far less than how comprehensive your coverage is.

How many alerts is too many?

Every alert should require human action - that is the guiding principle. If your team gets more than five to ten actionable alerts per on-call shift, fatigue becomes a real risk. And if the team is routinely ignoring alerts, the system has already failed regardless of how many are configured.

Track Tech Due Diligence compliance in one place

AuditFront helps you manage every Tech Due Diligence control, collect evidence, and stay audit-ready.

Start Free Assessment