Skip to content

Instantly share code, notes, and snippets.

@AmilcarArmmand
Last active November 22, 2025 05:10
Show Gist options
  • Select an option

  • Save AmilcarArmmand/29c220a2727e4ef2c8a306ff3bc01562 to your computer and use it in GitHub Desktop.

Select an option

Save AmilcarArmmand/29c220a2727e4ef2c8a306ff3bc01562 to your computer and use it in GitHub Desktop.
Software Requirements Specification (SRS) for Fault-Tolerant Storage Design for Video Streaming

Software Requirements Specification (SRS)

Project Name: CAAT Distributed Storage System for Video Streaming Storage

Focus Component: Fault-Tolerant Design: Streaming continues even if one or more storage nodes fail.

Team Members: Team: Sudo !! (Sudo Bang-Bang)

  • Amilcar Armmand - System Architect
  • Tiffany Choe - System Architect

Document Version: Draft v1.0 Last Updated: 11-05-25


1. Project Overview

1.1 Problem Statement

Distributed video streaming systems experience storage node failures and have to be designed to avoid service interruptions, causing buffering and playback errors for users. Centralized storage systems create single points of failure that can interrupt video streaming when storage nodes fail, leading to buffering, playback errors, and poor user experience.

1.2 Solution Overview

Implement a distributed, redundant storage architecture that ensures video content remains available and streaming continues uninterrupted even when multiple storage nodes experience failures.

1.3 Target Users

Primary User Persona:

  • Name & Role: "Arun, System Administrator"
  • Needs: Monitor and maintain storage cluster health with minimal downtime.
  • Pain Points: Storage node failures causing service interruptions
  • Goals: 99.99% availability despite hardware failures

Secondary User Persona:

  • Name & Role: "Taylor, Viewer"
  • Needs: Uninterrupted video streaming experience.
  • Pain Points: Buffering and playback failures during system issues.
  • Goals: Continuous streaming even during storage node maintenance

2. User Stories

Core User Stories (Must Have)

Fault-Tolerant Storage

US001: As a system admin, I want to detect storage node failures within 30 seconds so that recovery can begin immediately.

Acceptance Criteria:

  • Heartbeat monitoring with 10-second intervals
  • Failure detection within 3 missed heartbeats
  • Automatic node status update to "suspected"
  • Alert generation for administrator notification

US002: As a system admin, I want to reconstruct video data from surviving fragments when storage nodes fail so that streaming continues uninterrupted. Acceptance Criteria:

  • Erasure coding with 6+3 configuration (6 data, 3 parity)
  • Reconstruction from any 6 of 9 fragments.
  • Reconstruction completes within 2 seconds.
  • No user-visible impact during reconstruction.

US003: As a viewer, I want to stream videos without interruption when up to 3 storage nodes fail simultaneously so that my viewing experience isn't affected.

Acceptance Criteria:

  • Video playback continues during multiple node failures.
  • No buffering or quality degradation.
  • Automatic failover to healthy nodes.
  • Progress saving continues to work

Storage Management

US004: As an administrator, I want to view real-time storage cluster health so that I can monitor system status.

Acceptance Criteria:

  • Dashboard showing node status (healthy/suspected/failed)
  • Fragment distribution visualization
  • Reconstruction events counter
  • Performance metrics display

US005: As a system admin, I want to automatically regenerate lost fragments when nodes fail permanently so that redundancy is maintained.

Acceptance Criteria:

  • Background fragment regeneration
  • Priority regeneration for frequently accessed content
  • Storage rebalancing across healthy nodes
  • Completion within 1 hour for 1GB files

3. Features & Requirements

3.1 Core Features

  1. Distributed Chunk Storage - Video chunks distributed across multiple storage nodes with redundancy
  2. Erasure Coding - Efficient fault tolerance with 50% storage overhead (6+3 configuration)
  3. Automatic Failure Recovery - Detection and reconstruction without manual intervention
  4. Intelligent Fragment Distribution - Smart placement across failure domains
  5. Health Monitoring - Real-time node health and performance metrics

3.2 Technical Requirements

Fault Tolerance Mechanisms

  • Heartbeat monitoring with 10-second intervals
  • Erasure coding with Reed-Solomon (6+3) configuration
  • Parallel fragment retrieval from multiple nodes
  • Automatic fragment regeneration
  • Cross-zone fragment distribution

Storage Architecture

  • Chunk Size: 10MB
  • Replication Strategy: Erasure coding (6+3) + cross-zone distribution
  • Failure Detection: Heartbeat + timeout (30-second detection)
  • Recovery Time: < 2 seconds for reconstruction

Load Balancing Strategy

  • Multi-Tier Architecture
  • Tier 1: Global Server Load Balancer (GSLB)
  • Routes users to nearest geographic region
  • DNS-based routing with health checks

Tier 2: Application Load Balancer

  • Routes to appropriate microservices (User Service, Upload Service, Streaming Service)
  • Uses weighted round-robin + least connections

Tier 3: Storage Node Selection

  • Health-weighted random selection for fragment retrieval
  • Requests to multiple storage nodes

3.3 Non-Functional Requirements

Reliability & Availability

  • 99.99% availability for video streaming
  • Survive up to 3 simultaneous storage node failures
  • Automatic recovery without administrator intervention
  • Data durability: 99.999999999% (11x9s)

Performance

  • Fragment retrieval time: < 100ms
  • Reconstruction time: < 2 seconds
  • Failure detection: < 30 seconds
  • Storage overhead: < 55% (including metadata)

Monitoring & Observability

  • Real-time health dashboard
  • Reconstruction event tracking
  • Performance degradation alerts
  • Storage efficiency metrics

4. System Design

4.1 Technology Stack

Storage Layer:

  • Chunk Storage: Node.js with Express + local file system
  • Erasure Coding: Reed-Solomon implementation
  • Metadata Storage: ???? (Redis) for fast fragment mapping
  • Health Monitoring: Custom heartbeat service

Infrastructure:

  • Storage Nodes: Multiple instances (9+ for demonstration)
  • Coordination Service: Central coordinator for fragment mapping
  • Monitoring: Custom dashboard with WebSocket updates

Deployment:

  • Google Cloud Platform (GCP)
  • PM2 for process management
  • Nginx for reverse proxy

Third-Party Services:

  • Google OAuth 2.0
  • ??

4.2 System Architecture

Core Components:

  1. Storage Node Service - Manages chunks, heartbeats, fragment serving
  2. Coordinator Service - Maintains fragment mapping, handles reconstruction
  3. Health Monitor - Tracks node status, triggers recovery
  4. Reconstruction Engine - Manages erasure coding encode/decode
  5. Monitoring Dashboard - Real-time system visualization

4.3 Database Schema

Entity: User

{
  _id: ObjectId,
  googleId: String,
  email: String,
  name: String,
  profilePicture: String,
  createdAt: Date,
  lastLogin: Date
}

Entity: [Your Main Entity]

{
  _id: ObjectId,
  [field]: [type],
  [field]: [type],
  createdBy: ObjectId (ref: User),
  createdAt: Date,
  updatedAt: Date
}

Tip: Add all your main data entities with their fields and relationships.

4.4 System Architecture

Architecture Pattern: MVC (Model-View-Controller)

Request Flow:

  1. User makes request (browser)
  2. Nginx reverse proxy forwards to Express
  3. Express routes request to appropriate controller
  4. Controller interacts with database models
  5. Data rendered through EJS templates
  6. Response sent back to user

TODO: Include an architecture diagram showing how components connect.


5. Implementation Plan

5.1 Sprint Breakdown (3 Sprints)

Sprint 1: Foundation & Basic Storage

Timeline: [Week 7-8] Goal: Basic distributed storage with heartbeats

User Stories:

  • US-001: Storage node failure detection
  • Basic chunk storage across multiple nodes
  • Heartbeat monitoring system

Deliverables:

  • Working authentication system
  • User dashboard (basic version)
  • Deployed to GCP

Sprint 2: Core Feature Development

Timeline: [Week 9-10] Goal: [Main feature implementation]

User Stories:

  • US003: [Core feature]
  • US004: [Core feature]
  • US005: [Core feature]

Deliverables:

  • [List specific working features]
  • Database models for core entities
  • Basic UI for main workflows

Sprint 3: Feature Completion & Enhancement

Timeline: [Week 11-12] Goal: [Complete remaining features and polish]

User Stories:

  • US006-US008: [Remaining features]
  • UI/UX improvements
  • Error handling

Deliverables:

  • All core features complete
  • Responsive design implementation
  • Improved user experience

Sprint 4: Testing, Polish & Deployment

Timeline: Goal: Production-ready application

Tasks:

  • User acceptance testing
  • Bug fixes and refinements
  • Performance optimization
  • Final deployment and documentation

Deliverables:

  • Fully tested application
  • Production deployment
  • User documentation
  • Presentation materials

5.2 Team Member Responsibilities

[Team Member 1]:

  • Authentication system
  • User profile management
  • [Other responsibilities]

[Team Member 2]:

  • [Core feature] implementation
  • Database design and models
  • [Other responsibilities]

[Team Member 3]:

  • UI/UX design and frontend
  • [Core feature] implementation
  • [Other responsibilities]

Tip: Assign responsibilities based on team member strengths and learning goals.

5.3 Testing Strategy

Testing Approaches:

  • Manual Testing: Test all user workflows before each sprint review
  • User Acceptance Testing: Get feedback from potential users
  • Browser Testing: Verify functionality on Chrome, Firefox, Safari
  • Mobile Testing: Ensure responsive design works on various devices
  • Security Testing: Verify authentication and data protection

Success Criteria:

  • All user stories meet acceptance criteria
  • No critical bugs in core workflows
  • Application loads and performs within requirements
  • Positive feedback from user testing

5.4 Deployment Strategy

Development Environment:

  • Local development with hot reload
  • SQLite/MongoDB local database

Production Environment:

  • Google Cloud Platform VM
  • PM2 process manager
  • Nginx reverse proxy
  • Production database (MongoDB Atlas / PostgreSQL)

Deployment Process:

  1. Test locally
  2. Commit to GitHub
  3. Pull on GCP VM
  4. Install dependencies
  5. Restart PM2 process
  6. Verify deployment

6. Risk Assessment


Appendix

A. Glossary

  • Term 1: [Definition]
  • Term 2: [Definition]

B. References

C. Change Log

Date Version Changes Author
[Date] v1.0 Initial draft [Team]
[Date] v1.1 [Description] [Name]

Document Status: Draft / Review / Final Next Review Date: [Date]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment