Project Name: CAAT Distributed Storage System for Video Streaming Storage
Focus Component: Fault-Tolerant Design: Streaming continues even if one or more storage nodes fail.
Team Members: Team: Sudo !! (Sudo Bang-Bang)
- Amilcar Armmand - System Architect
- Tiffany Choe - System Architect
Document Version: Draft v1.0 Last Updated: 11-05-25
Distributed video streaming systems experience storage node failures and have to be designed to avoid service interruptions, causing buffering and playback errors for users. Centralized storage systems create single points of failure that can interrupt video streaming when storage nodes fail, leading to buffering, playback errors, and poor user experience.
Implement a distributed, redundant storage architecture that ensures video content remains available and streaming continues uninterrupted even when multiple storage nodes experience failures.
Primary User Persona:
- Name & Role: "Arun, System Administrator"
- Needs: Monitor and maintain storage cluster health with minimal downtime.
- Pain Points: Storage node failures causing service interruptions
- Goals: 99.99% availability despite hardware failures
Secondary User Persona:
- Name & Role: "Taylor, Viewer"
- Needs: Uninterrupted video streaming experience.
- Pain Points: Buffering and playback failures during system issues.
- Goals: Continuous streaming even during storage node maintenance
US001: As a system admin, I want to detect storage node failures within 30 seconds so that recovery can begin immediately.
Acceptance Criteria:
- Heartbeat monitoring with 10-second intervals
- Failure detection within 3 missed heartbeats
- Automatic node status update to "suspected"
- Alert generation for administrator notification
US002: As a system admin, I want to reconstruct video data from surviving fragments when storage nodes fail so that streaming continues uninterrupted. Acceptance Criteria:
- Erasure coding with 6+3 configuration (6 data, 3 parity)
- Reconstruction from any 6 of 9 fragments.
- Reconstruction completes within 2 seconds.
- No user-visible impact during reconstruction.
US003: As a viewer, I want to stream videos without interruption when up to 3 storage nodes fail simultaneously so that my viewing experience isn't affected.
Acceptance Criteria:
- Video playback continues during multiple node failures.
- No buffering or quality degradation.
- Automatic failover to healthy nodes.
- Progress saving continues to work
US004: As an administrator, I want to view real-time storage cluster health so that I can monitor system status.
Acceptance Criteria:
- Dashboard showing node status (healthy/suspected/failed)
- Fragment distribution visualization
- Reconstruction events counter
- Performance metrics display
US005: As a system admin, I want to automatically regenerate lost fragments when nodes fail permanently so that redundancy is maintained.
Acceptance Criteria:
- Background fragment regeneration
- Priority regeneration for frequently accessed content
- Storage rebalancing across healthy nodes
- Completion within 1 hour for 1GB files
- Distributed Chunk Storage - Video chunks distributed across multiple storage nodes with redundancy
- Erasure Coding - Efficient fault tolerance with 50% storage overhead (6+3 configuration)
- Automatic Failure Recovery - Detection and reconstruction without manual intervention
- Intelligent Fragment Distribution - Smart placement across failure domains
- Health Monitoring - Real-time node health and performance metrics
- Heartbeat monitoring with 10-second intervals
- Erasure coding with Reed-Solomon (6+3) configuration
- Parallel fragment retrieval from multiple nodes
- Automatic fragment regeneration
- Cross-zone fragment distribution
- Chunk Size: 10MB
- Replication Strategy: Erasure coding (6+3) + cross-zone distribution
- Failure Detection: Heartbeat + timeout (30-second detection)
- Recovery Time: < 2 seconds for reconstruction
- Multi-Tier Architecture
- Tier 1: Global Server Load Balancer (GSLB)
- Routes users to nearest geographic region
- DNS-based routing with health checks
Tier 2: Application Load Balancer
- Routes to appropriate microservices (User Service, Upload Service, Streaming Service)
- Uses weighted round-robin + least connections
Tier 3: Storage Node Selection
- Health-weighted random selection for fragment retrieval
- Requests to multiple storage nodes
- 99.99% availability for video streaming
- Survive up to 3 simultaneous storage node failures
- Automatic recovery without administrator intervention
- Data durability: 99.999999999% (11x9s)
- Fragment retrieval time: < 100ms
- Reconstruction time: < 2 seconds
- Failure detection: < 30 seconds
- Storage overhead: < 55% (including metadata)
- Real-time health dashboard
- Reconstruction event tracking
- Performance degradation alerts
- Storage efficiency metrics
Storage Layer:
- Chunk Storage: Node.js with Express + local file system
- Erasure Coding: Reed-Solomon implementation
- Metadata Storage: ???? (Redis) for fast fragment mapping
- Health Monitoring: Custom heartbeat service
Infrastructure:
- Storage Nodes: Multiple instances (9+ for demonstration)
- Coordination Service: Central coordinator for fragment mapping
- Monitoring: Custom dashboard with WebSocket updates
Deployment:
- Google Cloud Platform (GCP)
- PM2 for process management
- Nginx for reverse proxy
Third-Party Services:
- Google OAuth 2.0
- ??
Core Components:
- Storage Node Service - Manages chunks, heartbeats, fragment serving
- Coordinator Service - Maintains fragment mapping, handles reconstruction
- Health Monitor - Tracks node status, triggers recovery
- Reconstruction Engine - Manages erasure coding encode/decode
- Monitoring Dashboard - Real-time system visualization
Entity: User
{
_id: ObjectId,
googleId: String,
email: String,
name: String,
profilePicture: String,
createdAt: Date,
lastLogin: Date
}
Entity: [Your Main Entity]
{
_id: ObjectId,
[field]: [type],
[field]: [type],
createdBy: ObjectId (ref: User),
createdAt: Date,
updatedAt: Date
}
Tip: Add all your main data entities with their fields and relationships.
Architecture Pattern: MVC (Model-View-Controller)
Request Flow:
- User makes request (browser)
- Nginx reverse proxy forwards to Express
- Express routes request to appropriate controller
- Controller interacts with database models
- Data rendered through EJS templates
- Response sent back to user
TODO: Include an architecture diagram showing how components connect.
Timeline: [Week 7-8] Goal: Basic distributed storage with heartbeats
User Stories:
- US-001: Storage node failure detection
- Basic chunk storage across multiple nodes
- Heartbeat monitoring system
Deliverables:
- Working authentication system
- User dashboard (basic version)
- Deployed to GCP
Timeline: [Week 9-10] Goal: [Main feature implementation]
User Stories:
- US003: [Core feature]
- US004: [Core feature]
- US005: [Core feature]
Deliverables:
- [List specific working features]
- Database models for core entities
- Basic UI for main workflows
Timeline: [Week 11-12] Goal: [Complete remaining features and polish]
User Stories:
- US006-US008: [Remaining features]
- UI/UX improvements
- Error handling
Deliverables:
- All core features complete
- Responsive design implementation
- Improved user experience
Timeline: Goal: Production-ready application
Tasks:
- User acceptance testing
- Bug fixes and refinements
- Performance optimization
- Final deployment and documentation
Deliverables:
- Fully tested application
- Production deployment
- User documentation
- Presentation materials
[Team Member 1]:
- Authentication system
- User profile management
- [Other responsibilities]
[Team Member 2]:
- [Core feature] implementation
- Database design and models
- [Other responsibilities]
[Team Member 3]:
- UI/UX design and frontend
- [Core feature] implementation
- [Other responsibilities]
Tip: Assign responsibilities based on team member strengths and learning goals.
Testing Approaches:
- Manual Testing: Test all user workflows before each sprint review
- User Acceptance Testing: Get feedback from potential users
- Browser Testing: Verify functionality on Chrome, Firefox, Safari
- Mobile Testing: Ensure responsive design works on various devices
- Security Testing: Verify authentication and data protection
Success Criteria:
- All user stories meet acceptance criteria
- No critical bugs in core workflows
- Application loads and performs within requirements
- Positive feedback from user testing
Development Environment:
- Local development with hot reload
- SQLite/MongoDB local database
Production Environment:
- Google Cloud Platform VM
- PM2 process manager
- Nginx reverse proxy
- Production database (MongoDB Atlas / PostgreSQL)
Deployment Process:
- Test locally
- Commit to GitHub
- Pull on GCP VM
- Install dependencies
- Restart PM2 process
- Verify deployment
- Term 1: [Definition]
- Term 2: [Definition]
| Date | Version | Changes | Author |
|---|---|---|---|
| [Date] | v1.0 | Initial draft | [Team] |
| [Date] | v1.1 | [Description] | [Name] |
Document Status: Draft / Review / Final Next Review Date: [Date]