Last updated by: daezel, Last updated on: 14/05/2025
After-Action Report: Data Warehouse Crash and Recovery (April 2025)
Summary
This document outlines the events, diagnosis, response, and lessons learned from a critical incident involving the Redback Data Warehouse (DW) VM, where all containers went down and the system became unresponsive due to a combination of user error and a rogue service (Suricata) flooding the disk.
This report serves both as documentation of leadership and technical response, and as a reference for future team members on how to handle similar VM-level issues.
Incident Overview
- Date of Incident: 15 April 2025
- VM Services Affected: All core DW services (MinIO, Dremio, File Upload Service, Postgres, etc.)
- Initial Trigger: Accidental execution of
docker rm -f dremio minioserver
followed bydocker compose up -d
by a teammate - Result: Containers restarted incorrectly, data volumes not attached, and later VM became completely unresponsive due to full disk
Timeline & Root Cause Analysis
Phase 1: User Error
-
A teammate mistakenly believed they were working in an isolated environment and ran the following command:
docker rm -f dremio minioserver
docker compose up -d -
Unfortunately, the wrong Docker Compose file (from the Streamlit file upload app) was used, which does not map to the correct persistent volumes.
-
As a result, the new containers started without links to the historical data — making it appear as if data had been lost.
Phase 2: Leadership Response
- As the DW leader, I called my mentor Ben Stephens immediately, recognizing the severity of the incident.
- Ben advised checking whether volumes still existed — they did.
- We concluded that the data wasn't lost but rather the current containers weren’t mapped correctly.
- I attempted to locate and run the correct Docker Compose setup but couldn’t find the core DW folder.
Phase 3: Secondary Failure - Disk Full
-
While troubleshooting with Sumit (another team lead with data engineering experience), we found:
docker compose up -d
=> Error: no space left on device -
VM disk usage was at 100%, preventing any container restarts.
-
Ben later found the root cause: Suricata service had created 379GB of logs, overwhelming the system.
Phase 4: Mentor Resolution
-
Ben:
- Disabled Suricata using
systemctl
- Removed excess logs and files (including 2GB initially without success)
- Eventually recovered ~400GB of space
- Confirmed it was safe to bring the DW stack back online
- Disabled Suricata using
Phase 5: My Recovery Actions
-
I searched for the
.env
file required to run the core DWdocker-compose.yml
. -
Located it in the previous team leader’s directory.
-
Once
.env
was in place, I ran:docker compose --env-file .env up -d
-
All containers restarted successfully, data volumes were reattached, and DW was fully operational again.
Lessons Learned
Technical
-
Containers in Docker are not user-isolated by default.
- Running
docker rm
on shared containers can affect everyone.
- Running
-
Volumes are persistent only if correctly mapped.
- Wrong compose file = containers without data.
-
Disk usage alerts/log rotation for system services should be monitored — rogue services like Suricata can crash unrelated environments.
Team & Leadership
- Immediate escalation helped avoid damage.
- Involving experienced teammates (e.g., Sumit) accelerated recovery.
- Calm triage under pressure prevented further accidental damage.
- Mentor involvement at the right moment solved a system-level issue.
Recommendations
Technical Safeguards
- Use container labels/namespaces or Docker Compose project names to logically separate projects.
- Establish a read-only directory for critical compose files to prevent accidental edits or usage.
- Set up disk usage monitoring with alerts for system-wide usage > 80%.
Training & Awareness
-
Onboard all users with a clear understanding of:
- Shared nature of the VM
- Which Docker Compose files to use
- Where persistent volumes are declared and mounted
-
Create a guide on "How to safely run Docker containers in the DW VM"
Preventive Measures
To proactively warn all users about the shared nature of the environment, a system-wide login message was added using the VM’s dynamic MOTD (Message of the Day) system. This banner appears every time a user logs in via SSH and highlights the risk of running destructive Docker commands like docker rm -f
. This aims to prevent accidental container deletion and reinforces the importance of checking with the Data Warehouse leader before making changes.
-
Added a disclaimer to critical command guides like:
DO NOT run `docker rm -f` on shared containers unless confirmed with mentor or DW lead.
Conclusion
This incident highlighted the importance of:
- Good communication
- Team collaboration
- Calm leadership
- Technical clarity and documentation
Despite being outside my technical comfort zone, I led the response by:
- Immediately escalating to our mentor
- Bringing in experienced help
- Coordinating a recovery plan
- Taking ownership of getting DW back online
This documentation ensures that future leaders and contributors do not repeat the same mistakes and can follow a clear recovery path if a similar issue arises again.
Document prepared by Daezel Goyal, Data Warehouse Leader – Redback Operations, May 2025