# multi-agent-fault-diagnosis
**Repository Path**: chenchen_2/multi-agent-fault-diagnosis
## Basic Information
- **Project Name**: multi-agent-fault-diagnosis
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-08
- **Last Updated**: 2025-12-08
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# ๐ง Intelligent Fault Diagnosis Multi-Agent System
[](https://python.org)
[](https://crewai.com)
[](https://langchain-ai.github.io/langgraph/)
[](https://aws.amazon.com/bedrock/)
[](https://choosealicense.com/licenses/mit/)
> **Enterprise-grade intelligent fault diagnosis system leveraging multi-agent AI, stateful workflows, and RAG-powered knowledge retrieval for automated network troubleshooting.**
An advanced AI-powered system that demonstrates cutting-edge multi-agent orchestration, intelligent workflow management, and retrieval-augmented generation for automated fault diagnosis in telecommunications networks. This project showcases enterprise-ready AI engineering with production-quality architecture, comprehensive validation, and deterministic demonstration capabilities.
## ๐ฌ **Watch It In Action**

*Multi-agent coordination โข RAG pipeline โข Real-time validation โข Artifact generation*
## ๐ฏ **Project Highlights**
### **๐ค Advanced Multi-Agent Architecture**
- **CrewAI Orchestration**: Four specialized AI agents with distinct roles and personas
- **Intelligent Collaboration**: Coordinated handoffs and shared state management
- **Domain Expertise**: Telecom-specific knowledge and reasoning capabilities
### **๐ง Sophisticated AI Pipeline**
- **RAG-Powered Knowledge**: AWS Bedrock with Claude 3.5 Sonnet and Titan embeddings
- **Stateful Workflows**: LangGraph for complex decision routing and escalation handling
- **Hypothesis Validation**: Multiple specialized validators with confidence scoring
- **Intelligent Routing**: Dynamic workflow paths based on confidence thresholds
### **๐๏ธ Production-Ready Engineering**
- **Comprehensive Error Handling**: Graceful fallbacks and robust error recovery
- **Session Management**: Artifact generation with complete audit trails
- **Observability**: Real-time monitoring and detailed logging
- **Deterministic Demos**: Reproducible presentations for stakeholders
---
## ๐๏ธ **System Architecture**
```mermaid
graph TB
A[Alert Intake] --> B[Intelligent Router]
B --> C[Evidence Gathering]
C --> D[Multi-Agent Crew]
subgraph "CrewAI Agents"
D1[NOC Sentinel
Planner]
D2[Core Network
Analyst]
D3[Hypothesis
Chair]
D4[Postmortem
Writer]
end
D --> D1
D1 --> D2
D2 --> D3
D3 --> D4
subgraph "RAG Pipeline"
E[AWS Bedrock
Claude 3.5]
F[Titan Embeddings]
G[ChromaDB
Vector Store]
end
D2 --> E
E --> F
F --> G
subgraph "Validation Layer"
H[Traffic Probe
Validator]
I[Config Diff
Validator]
J[Topology
Validator]
end
D3 --> H
D3 --> I
D3 --> J
K[LangGraph
State Machine] --> L[Resolution
Engine]
L --> M[Artifact
Generation]
style A fill:#e1f5fe
style D fill:#f3e5f5
style E fill:#fff3e0
style K fill:#e8f5e8
```
## ๐ **Quick Start**
### **Prerequisites**
- **Python 3.9+** with pip
- **AWS Account** with Bedrock access (optional for demo mode)
- **Git** for version control
### **โก Instant Demo**
```bash
# Clone and run in 30 seconds
git clone
cd fault-diagnosis-multi-agent
./run_demo.bat # Windows
# or
python -m venv venv && source venv/bin/activate # Linux/macOS
pip install -r requirements.txt
python -m src.fault_diagnosis.cli fault-diagnosis --session demo
```
### **๐ง Full Setup**
1. **Environment Setup**
```bash
# Create virtual environment
python -m venv venv
# Activate environment
venv\Scripts\activate # Windows
source venv/bin/activate # Linux/macOS
# Install dependencies
pip install -r requirements.txt
```
2. **AWS Configuration** (Optional - system works without AWS)
```bash
# Create .env file
cp .env.example .env
# Edit .env with your AWS credentials
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=us-east-1
```
3. **Run Demonstration**
```bash
# Full featured demo with RAG
python -m src.fault_diagnosis.cli fault-diagnosis --session alpha
# Quick demo without RAG
python -m src.fault_diagnosis.cli fault-diagnosis --no-rag --session beta
```
---
## ๐ผ **Core Capabilities**
### **๐ญ Multi-Agent Crew**
| Agent | Role | Specialization |
|-------|------|----------------|
| **NOC Sentinel** | ๐ฏ **Planner** | Alert triage, workflow coordination, objective alignment |
| **Core Network Analyst** | ๐ **Retriever** | Evidence gathering, RAG queries, knowledge search |
| **Hypothesis Chair** | ๐ง **Reasoner** | Root cause analysis, hypothesis generation, validation |
| **Postmortem Writer** | ๐ **Reporter** | Documentation, remediation plans, stakeholder reports |
### **๐ Intelligent Workflow States**
```python
# Workflow progression with automatic routing
Alert Intake โ Evidence Gathering โ Hypothesis Generation
โ โ โ
Routing Decision โ Validation Loop โ Resolution Planning
โ โ โ
Remediation โ Post-Mortem โ Knowledge Update
```
### **๐ฏ Validation Framework**
- **Traffic Probe Validator**: Network performance claim verification
- **Config Diff Validator**: Configuration change impact analysis
- **Topology Validator**: Network topology reference checking
- **Confidence Scoring**: Probabilistic validation with threshold-based routing
---
## ๐ช **Demo Experience**
### **๐ฅ Live Execution**
The system provides real-time multi-agent coordination, RAG queries, and validation in action through the interactive CLI demo.
### **Command Examples**
```bash
# Standard demo run
python -m src.fault_diagnosis.cli fault-diagnosis --session production_demo
# Quiet mode for presentations
python -m src.fault_diagnosis.cli fault-diagnosis --quiet --session stakeholder_demo
# Test mode without external dependencies
python -m src.fault_diagnosis.cli fault-diagnosis --no-rag --session test_run
```
### **Sample Output**
```
Starting Simple Fault Diagnosis MVP
========================================
Component Status:
[OK] CrewAI Agents: Available
[OK] LangGraph Workflow: Available
[OK] RAG Pipeline: Available
[OK] Validation Framework: Available
Running workflow...
[CrewAI] NOC Sentinel analyzing alert FD-ALRT-017...
[RAG] Retrieved 3 relevant documents from knowledge base
[LangGraph] Routing to hypothesis generation (confidence: 0.87)
[Validator] Traffic probe validation: PASSED
[Workflow] Generating remediation plan...
MVP Demo complete!
Results: Session artifacts saved to outputs/session_production_demo_20241219/
```
---
## ๐งช **Technology Deep Dive**
### **๐ค CrewAI Integration**
```python
# Sophisticated agent orchestration
class FaultDiagnosisCrew:
def __init__(self):
self.agents = FaultDiagnosisAgents()
self.workflow = FaultDiagnosisWorkflow()
def execute_sequential_process(self):
# Coordinated multi-agent execution
return self.crew.kickoff()
```
### **๐ง LangGraph State Management**
```python
# Stateful workflow with intelligent routing
class FaultDiagnosisWorkflow:
def route_decision(self, state):
confidence = state.get("confidence_score", 0.0)
if confidence > 0.7:
return "remediation_planning"
else:
return "escalation_queue"
```
### **๐ RAG Pipeline Architecture**
```python
# AWS Bedrock integration with fallback
class BedrockRAGPipeline:
def __init__(self):
self.embeddings = BedrockEmbeddings(
model_id="amazon.titan-embed-text-v1"
)
self.llm = BedrockLLM(
model_id="anthropic.claude-3-5-sonnet-20241022-v2:0"
)
self.vector_store = ChromaDB()
```
---
## ๐ **Project Structure**
```
fault-diagnosis-multi-agent/
โโโ ๐ฏ src/fault_diagnosis/ # Core system implementation
โ โโโ ๐ค agents/ # Multi-agent orchestration
โ โ โโโ crew_orchestration.py # CrewAI crew setup and management
โ โ โโโ factory.py # Agent factory with role definitions
โ โ โโโ tasks.py # Task definitions and coordination
โ โ โโโ crew.py # Agent personas and capabilities
โ โโโ ๐ workflow/ # Stateful workflow management
โ โ โโโ orchestrator.py # Main workflow coordinator
โ โ โโโ state_machine.py # LangGraph state transitions
โ โ โโโ workflow.py # Workflow execution logic
โ โโโ ๐ rag/ # RAG pipeline implementation
โ โ โโโ pipeline.py # AWS Bedrock RAG integration
โ โโโ ๐ฃ๏ธ routing/ # Intelligent decision routing
โ โ โโโ intelligent_router.py # Dynamic workflow routing
โ โโโ โ
validation/ # Hypothesis validation framework
โ โ โโโ validators.py # Specialized domain validators
โ โโโ ๐ monitoring/ # System observability
โ โ โโโ observability.py # Metrics and monitoring
โ โโโ ๐๏ธ data/ # Data management and fixtures
โ โ โโโ fixtures.py # Test data and scenarios
โ โโโ ๐จ artifacts/ # Report and artifact generation
โ โ โโโ generators.py # Output formatting and reports
โ โโโ ๐ง shared/ # Shared utilities
โ โ โโโ console.py # CLI output formatting
โ โ โโโ files.py # File system operations
โ โโโ ๐ฅ๏ธ cli.py # Command-line interface
โโโ ๐ fixtures/ # Demo data and test scenarios
โโโ ๐ outputs/ # Generated session artifacts
โโโ ๐ docs/ # Technical documentation
โโโ ๐ง requirements.txt # Python dependencies
โโโ ๐ run_demo.bat # Quick demo launcher
โโโ ๐ README.md # This file
```
---
## โ๏ธ **Configuration**
### **Environment Variables**
```bash
# AWS Bedrock Configuration
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_DEFAULT_REGION=us-east-1
# Model Configuration
BEDROCK_LLM_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0
BEDROCK_EMBEDDING_MODEL=amazon.titan-embed-text-v1
# System Configuration
FAULT_DIAGNOSIS_CONFIDENCE_THRESHOLD=0.7
FAULT_DIAGNOSIS_VERBOSE=true
```
### **Component Features**
- **๐ RAG Pipeline**: Semantic search with AWS Bedrock embeddings
- **๐ค Multi-Agent**: CrewAI orchestration with specialized roles
- **๐ State Management**: LangGraph workflow with decision routing
- **โ
Validation**: Multi-layer hypothesis verification
- **๐ Monitoring**: Real-time observability and metrics
- **๐ฏ Routing**: Intelligent workflow path selection
---
## ๐ **Generated Artifacts**
Each session produces comprehensive outputs:
### **๐ Reports & Documentation**
- `fault_diagnosis_report.html` - Executive stakeholder report
- `fault_diagnosis_report.pdf` - Printable documentation
- `hypothesis_board.md` - Detailed technical analysis
- `remediation_plan.md` - Step-by-step action guide
### **๐ Technical Artifacts**
- `session.log` - Complete execution transcript
- `alert_context.json` - Processed alert data
- `validation_trace.json` - Validator results and decisions
- `rag_index.json` - Knowledge retrieval citations
### **๐ Visualizations**
- KPI trend plots and network metrics
- Confidence score distributions
- Validation outcome summaries
- Session timeline visualizations
---
## ๐งช **Testing & Validation**
### **Smoke Tests**
```bash
# Quick system validation
python -m src.fault_diagnosis.cli fault-diagnosis --quiet --session smoke_test
# Component isolation testing
python -m src.fault_diagnosis.cli fault-diagnosis --no-rag --session component_test
```
### **Demo Scenarios**
```bash
# Stakeholder presentation mode
python -m src.fault_diagnosis.cli fault-diagnosis --session stakeholder_demo
# Technical deep-dive mode
python -m src.fault_diagnosis.cli fault-diagnosis --session technical_demo
```
---
## ๐ ๏ธ **Development & Extension**
### **Adding New Agents**
```python
# Extend the agent factory
class CustomFaultAgent:
def __init__(self):
self.role = "Custom Specialist"
self.backstory = "Domain-specific expertise..."
self.goal = "Specialized analysis objective"
```
### **Custom Validators**
```python
# Implement domain-specific validation
class CustomValidator:
def validate_hypothesis(self, hypothesis: str) -> ValidationResult:
# Custom validation logic
return ValidationResult(passed=True, confidence=0.85)
```
### **Workflow Extensions**
```python
# Add new workflow states
@workflow.step
def custom_analysis_step(state: WorkflowState):
# Custom processing logic
return updated_state
```
---
## ๐ฏ **Business Value & Use Cases**
### **๐ฌ Demonstration Modes**
|
**๐ For Executives & Decision Makers**
```bash
python -m src.fault_diagnosis.cli fault-diagnosis --quiet --session stakeholder_demo
```
*ROI focus โข Business impact โข Cost reduction*
|
**๐จโ๐ป For Technical Teams**
```bash
python -m src.fault_diagnosis.cli fault-diagnosis --session technical_demo
```
*Architecture โข Code quality โข Implementation*
|
### **Enterprise Applications**
- **๐ข Network Operations Centers**: Automated first-level incident response
- **โ๏ธ Cloud Infrastructure**: Multi-cloud fault diagnosis and remediation
- **๐ญ Industrial IoT**: Equipment failure prediction and root cause analysis
- **๐ DevOps**: Application performance issue diagnosis and resolution
### **Technical Advantages**
- **โก Rapid Deployment**: Minutes from clone to running demo
- **๐ง Modular Architecture**: Easy component swapping and extension
- **๐ Rich Observability**: Complete audit trails and session recordings
- **๐ฏ Domain Adaptable**: Easily customizable for different industries
### **Stakeholder Benefits**
- **๐ Executives**: ROI demonstration through automated incident response
- **๐จโ๐ป Engineers**: Advanced AI tooling for complex problem solving
- **๐ Operations**: Reduced MTTR and improved service reliability
- **๐ Learning**: Comprehensive example of production AI engineering
---
## ๐ค **Contributing**
This project demonstrates advanced AI engineering patterns and welcomes contributions:
```bash
# Development setup
git clone
cd fault-diagnosis-multi-agent
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
# Run tests
python -m src.fault_diagnosis.cli fault-diagnosis --session test
# Submit improvements
git checkout -b feature/enhancement
# Make changes...
git commit -m "feat: add new capability"
git push origin feature/enhancement
```
---
## ๐ **Documentation**
- **๐ [Technical Deep Dive](docs/Fault_Diagnosis.md)** - Comprehensive system documentation
- **๐๏ธ [Architecture Guide](docs/Architecture.md)** - System design and patterns
- **๐ง [Configuration Reference](docs/Configuration.md)** - Setup and customization
- **๐ฏ [Demo Scripts](docs/Demos.md)** - Presentation scenarios and examples
## ๐ฎ **Interactive Demo Features**
### **๐ Available Demo Modes**
| Feature | Command | Description |
|---------|---------|-------------|
| **๐ Quick Start** | `./run_demo.bat` | Zero to running demo in 30 seconds |
| **๐ค Multi-Agent** | `--session alpha` | Full CrewAI agents coordination |
| **๐๏ธ Architecture** | `--session technical_demo` | Technical design walkthrough |
| **๐ RAG Pipeline** | `--session production_demo` | AWS Bedrock knowledge retrieval |
| **๐ฏ Live Terminal** | `--session demo` | Real-time execution footage |
| **๐ Executive Mode** | `--quiet --session stakeholder_demo` | Business value & ROI focus |
| **๐จโ๐ป Technical Mode** | `--session technical_demo` | Code review & implementation |
| **๐งช Test Mode** | `--no-rag --session test_run` | Offline demonstration |
---
## ๐ **Portfolio Highlights**
This project demonstrates mastery of:
### **๐ค Advanced AI Engineering**
- Multi-agent system orchestration with CrewAI
- Stateful workflow management with LangGraph
- Production RAG implementation with AWS Bedrock
- Intelligent routing and decision making
### **๐๏ธ Software Architecture**
- Clean, modular, and extensible design patterns
- Comprehensive error handling and resilience
- Professional logging and observability
- Session management and artifact generation
### **โ๏ธ Cloud & Enterprise**
- AWS Bedrock integration for production AI
- Scalable vector database architecture
- Configuration management and environment handling
- Enterprise-ready security and monitoring
### **๐ Data & Analytics**
- Vector embeddings and semantic search
- Hypothesis validation and confidence scoring
- Real-time monitoring and metrics collection
- Comprehensive reporting and visualization
---
## ๐ **License**
This project is developed as a portfolio demonstration of advanced AI engineering capabilities. See the project structure and documentation for detailed implementation patterns and best practices.
---
## ๐ฎ **Future Enhancements**
- **๐ Web Interface**: React-based dashboard for real-time monitoring
- **๐ฑ Mobile App**: iOS/Android client for field operations
- **๐ API Gateway**: RESTful API for system integration
- **๐งช A/B Testing**: Hypothesis validation strategy optimization
- **๐ Advanced Analytics**: Machine learning model performance tracking
- **๐ Auto-Scaling**: Kubernetes deployment with auto-scaling capabilities
---
**Built with โค๏ธ using cutting-edge AI and modern software engineering practices**
[๐ฏ **Live Demo**](./run_demo.bat) | [๐ **Documentation**](docs/) | [๐ค **Contribute**](#contributing)