Major Outage - Internal Network failure

Incident Report for Teradig

Postmortem

πŸ“„ Incident Postmortem – Major Service Outage (MSO)

Incident Start: 2025-08-29 23:04:53 (Europe/Paris)
Partial Restoration (DNS + critical services): 2025-09-02 12:00:00 (Europe/Paris)
Full Restoration: 2025-09-03 19:40:15 (Europe/Paris)
Duration: ~4 days

Impact: All Teradig LTD systems and services were inaccessible, including client websites, email, and domain services.

πŸ”Ž Summary

Between August 29 and September 3, 2025, Teradig LTD experienced a major service outage (MSO) due to an internal network failure. This outage made all systems inaccessible.

From September 2 at noon, we restored DNS services for critical clients by updating nameservers (NS). We also enabled file and database sharing for clients who needed to remain live. Full restoration of all systems was completed on September 3 at 19:40 (Europe/Paris).

⚠️ Root Cause

The outage originated in our internal networking layer, preventing external traffic from reaching our systems.

  • Nodes were functional but isolated.
  • DNS/NS servers were down, which made hosted domains unreachable.

πŸ› οΈ Resolution

  • Stabilized nodes early in the process.
  • Identified NAT and routing failure as the root cause.
  • Due to the amount of data, a full restoration was required.
  • Provided temporary workarounds: moved some email systems, shared databases/files directly.
  • Restored DNS and live access for priority clients on September 2 at noon.
  • Achieved full system restoration on September 3 at 19:40 (Europe/Paris).

πŸ“Š Impact

  • All hosted services (websites, portals, email, domains) unavailable for multiple days.
  • Key academic clients at the start of the year were significantly affected.
  • Communication delays early on added pressure for affected users.

βœ… Preventive Measures

To avoid recurrence, Teradig LTD has:

  1. Strengthened internal network resilience and NAT redundancy.
  2. Enhanced monitoring and real-time alerting.
  3. Published a dedicated status page (https://teradig.statuspage.io) for transparent communication.
  4. Improved backup and disaster recovery to shorten restoration times.
  5. Introduced clearer client communication protocols during outages.
  6. Version-controlled our infrastructure on GitHub, ensuring faster recovery.
  7. Established new points of recovery to reduce downtime in future incidents.

πŸ™ Closing Note

We sincerely apologize for the disruption and the inconvenience caused. Throughout this incident, our top priority was to preserve client data β€” and we are pleased to confirm that no data was lost.

We thank our clients for their patience and trust, and we are committed to stronger resilience, clearer communication, and faster recovery in the future.

Christophe RENZAHO
Managing Director
Teradig LTD

Posted Sep 03, 2025 - 20:32 CAT

Resolved

This incident has been resolved.
Posted Sep 03, 2025 - 20:19 CAT

Update

βœ… Service Restored – Nominal Operations
We are pleased to confirm that services are back to nominal operation.
Our technical team reports 100% confidence in system stability, and we continue close monitoring.
Posted Sep 03, 2025 - 20:19 CAT

Update

⚠️ Partial Service Restoration
Core systems and email are fully operational, and all data has been preserved.
The Management Portal is still down, and our team is actively working to restore it. We’ll continue to provide updates as progress is made.
Posted Sep 03, 2025 - 19:35 CAT

Monitoring

βœ… Service Restored – Monitoring Ongoing
Our systems are now fully operational, and all services, including email, are back online. We are continuing to monitor performance closely to ensure stability. Thank you for your patience.
Posted Sep 03, 2025 - 18:40 CAT

Update

We are continuing to work on a fix for this issue.
Posted Sep 03, 2025 - 10:00 CAT

Update

Our backup attempt has failed. The IT team is now implementing a new solution and working to complete the process as quickly as possible.
Posted Sep 02, 2025 - 10:03 CAT

Update

We are continuing to work on a fix for this issue.
Posted Sep 01, 2025 - 20:20 CAT

Update

We are continuing to work on a fix for this issue.
Posted Sep 01, 2025 - 16:48 CAT

Update

We have been required to restore our systems, which contain a large amount of data, and this process is taking considerable time.
Posted Sep 01, 2025 - 15:11 CAT

Update

We are continuing to work on a fix for this issue.
Posted Sep 01, 2025 - 15:02 CAT

Update

We are continuing to work on a fix for this issue.
Posted Sep 01, 2025 - 13:10 CAT

Update

We are continuing to work on a fix for this issue.
Posted Sep 01, 2025 - 13:07 CAT

Update

The issue has been identified and a fix is being implemented.
Posted Sep 01, 2025 - 11:00 CAT

Identified

All systems and services, including domain services for hosted clients, are temporarily inaccessible. Internal communications between instances were initially affected
Posted Aug 30, 2025 - 00:15 CAT