Back to overview
Degraded
2025-08-29 prod6 attestation generation failure
Aug 29 at 11:43pm UTC
Affected services
Prod6 VMM
Resolved
Aug 30 at 03:05am UTC
2025-08-29 prod6 attestation generation failure
1. Incident Summary
- Incident Name: prod6 attestation generation failure
- Date and Time: 2025/08/29 23:43:22 - 2025/08/30 03:05:00
- Duration: 201 minutes
- Severity Level: P0
- Impact: All CVMs in Prod6 are unable to get attestation.
- Participants: Leechael, Kevin, Hang Yin
2. Timeline of Events
- Pre-Incident Events:
- 2025/08/29 23:43:22 - qgsd emitted
[QPL] No certificate data for this platform.
in the log.
- 2025/08/29 23:43:22 - qgsd emitted
- Incident Detection:
- 2025/08/30 00:54 - Customer reported attestation service failures
- Incident Response:
- 2025/08/30 01:04 - Engineering team acknowledged the issue and initiated investigation
- 2025/08/30 01:36 - Isolated the impact to CVMs on prod6
- 2025/08/30 01:44 - Escalated to additional engineering resources for deeper investigation
- Incident Resolution:
- 2025/08/30 02:43 - Identified QGS daemon (qgsd) failure and initiated service restart
- 2025/08/30 03:00 - Service restoration confirmed, customer notified
- 2025/08/30 03:05 - Full functionality verified across all affected systems
3. Root Cause Analysis (RCA)
- Primary Cause:
- The primary cause was the unexpected termination of the Quote Generation Service (QGS) on the affected production server. The QGS, a critical Intel service component that hosts the TD Quoting Enclave, experienced a network-related failure that went undetected due to insufficient monitoring coverage. As the QGS must run on the same physical machine as the corresponding Trusted Domain for attestation verification, the service failure created a complete attestation outage for all CVMs on that host.
- When we check the source code, it returns
NO certificate data for this platform
is because GQS sent malformed requests to PCCS and received 404 error.
Contributing Factors:
Insufficient Monitoring and Alerting
The network failure affecting QGS was not detected promptly due to gaps in monitoring and alerting for service health and network connectivity. No automated alerts were triggered for the QGS process termination or its inability to reach the PCCS server.
4. Impact Analysis
- Technical Impact & Business Impact
- All running CVMs in prod6 and guest-agent API.
- No new CVMs can launch successfully while dealing with the incident.
5. Resolution Details
- Steps Taken to Resolve the Incident:
- After restarting the qgsd service, everything is back online.
6. Lessons Learned
- Platform Monitoring Gap: The absence of dedicated monitoring on Cloud platform for the QGS service delayed detection and diagnosis of the failure
Affected services
Created
Aug 29 at 11:43pm UTC
prod6 attestation generation failure
Affected services