# multi-agent-fault-diagnosis **Repository Path**: chenchen_2/multi-agent-fault-diagnosis ## Basic Information - **Project Name**: multi-agent-fault-diagnosis - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-08 - **Last Updated**: 2025-12-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ๐Ÿ”ง Intelligent Fault Diagnosis Multi-Agent System [![Python](https://img.shields.io/badge/Python-3.9+-3776ab.svg?style=flat&logo=python&logoColor=white)](https://python.org) [![CrewAI](https://img.shields.io/badge/CrewAI-0.186+-ff6b6b.svg?style=flat)](https://crewai.com) [![LangGraph](https://img.shields.io/badge/LangGraph-0.2.55+-1f77b4.svg?style=flat)](https://langchain-ai.github.io/langgraph/) [![AWS Bedrock](https://img.shields.io/badge/AWS_Bedrock-Claude_3.5-ff9900.svg?style=flat&logo=amazon-aws)](https://aws.amazon.com/bedrock/) [![License](https://img.shields.io/badge/License-Portfolio-blue.svg)](https://choosealicense.com/licenses/mit/) > **Enterprise-grade intelligent fault diagnosis system leveraging multi-agent AI, stateful workflows, and RAG-powered knowledge retrieval for automated network troubleshooting.** An advanced AI-powered system that demonstrates cutting-edge multi-agent orchestration, intelligent workflow management, and retrieval-augmented generation for automated fault diagnosis in telecommunications networks. This project showcases enterprise-ready AI engineering with production-quality architecture, comprehensive validation, and deterministic demonstration capabilities. ## ๐ŸŽฌ **Watch It In Action**
![Fault Diagnosis Multi-Agent System Demo](demo.gif) *Multi-agent coordination โ€ข RAG pipeline โ€ข Real-time validation โ€ข Artifact generation*
## ๐ŸŽฏ **Project Highlights** ### **๐Ÿค– Advanced Multi-Agent Architecture** - **CrewAI Orchestration**: Four specialized AI agents with distinct roles and personas - **Intelligent Collaboration**: Coordinated handoffs and shared state management - **Domain Expertise**: Telecom-specific knowledge and reasoning capabilities ### **๐Ÿง  Sophisticated AI Pipeline** - **RAG-Powered Knowledge**: AWS Bedrock with Claude 3.5 Sonnet and Titan embeddings - **Stateful Workflows**: LangGraph for complex decision routing and escalation handling - **Hypothesis Validation**: Multiple specialized validators with confidence scoring - **Intelligent Routing**: Dynamic workflow paths based on confidence thresholds ### **๐Ÿ—๏ธ Production-Ready Engineering** - **Comprehensive Error Handling**: Graceful fallbacks and robust error recovery - **Session Management**: Artifact generation with complete audit trails - **Observability**: Real-time monitoring and detailed logging - **Deterministic Demos**: Reproducible presentations for stakeholders --- ## ๐Ÿ›๏ธ **System Architecture** ```mermaid graph TB A[Alert Intake] --> B[Intelligent Router] B --> C[Evidence Gathering] C --> D[Multi-Agent Crew] subgraph "CrewAI Agents" D1[NOC Sentinel
Planner] D2[Core Network
Analyst] D3[Hypothesis
Chair] D4[Postmortem
Writer] end D --> D1 D1 --> D2 D2 --> D3 D3 --> D4 subgraph "RAG Pipeline" E[AWS Bedrock
Claude 3.5] F[Titan Embeddings] G[ChromaDB
Vector Store] end D2 --> E E --> F F --> G subgraph "Validation Layer" H[Traffic Probe
Validator] I[Config Diff
Validator] J[Topology
Validator] end D3 --> H D3 --> I D3 --> J K[LangGraph
State Machine] --> L[Resolution
Engine] L --> M[Artifact
Generation] style A fill:#e1f5fe style D fill:#f3e5f5 style E fill:#fff3e0 style K fill:#e8f5e8 ``` ## ๐Ÿš€ **Quick Start** ### **Prerequisites** - **Python 3.9+** with pip - **AWS Account** with Bedrock access (optional for demo mode) - **Git** for version control ### **โšก Instant Demo** ```bash # Clone and run in 30 seconds git clone cd fault-diagnosis-multi-agent ./run_demo.bat # Windows # or python -m venv venv && source venv/bin/activate # Linux/macOS pip install -r requirements.txt python -m src.fault_diagnosis.cli fault-diagnosis --session demo ``` ### **๐Ÿ”ง Full Setup** 1. **Environment Setup** ```bash # Create virtual environment python -m venv venv # Activate environment venv\Scripts\activate # Windows source venv/bin/activate # Linux/macOS # Install dependencies pip install -r requirements.txt ``` 2. **AWS Configuration** (Optional - system works without AWS) ```bash # Create .env file cp .env.example .env # Edit .env with your AWS credentials AWS_ACCESS_KEY_ID=your_access_key AWS_SECRET_ACCESS_KEY=your_secret_key AWS_DEFAULT_REGION=us-east-1 ``` 3. **Run Demonstration** ```bash # Full featured demo with RAG python -m src.fault_diagnosis.cli fault-diagnosis --session alpha # Quick demo without RAG python -m src.fault_diagnosis.cli fault-diagnosis --no-rag --session beta ``` --- ## ๐Ÿ’ผ **Core Capabilities** ### **๐ŸŽญ Multi-Agent Crew** | Agent | Role | Specialization | |-------|------|----------------| | **NOC Sentinel** | ๐ŸŽฏ **Planner** | Alert triage, workflow coordination, objective alignment | | **Core Network Analyst** | ๐Ÿ” **Retriever** | Evidence gathering, RAG queries, knowledge search | | **Hypothesis Chair** | ๐Ÿง  **Reasoner** | Root cause analysis, hypothesis generation, validation | | **Postmortem Writer** | ๐Ÿ“ **Reporter** | Documentation, remediation plans, stakeholder reports | ### **๐Ÿ”„ Intelligent Workflow States** ```python # Workflow progression with automatic routing Alert Intake โ†’ Evidence Gathering โ†’ Hypothesis Generation โ†“ โ†“ โ†“ Routing Decision โ†’ Validation Loop โ†’ Resolution Planning โ†“ โ†“ โ†“ Remediation โ†’ Post-Mortem โ†’ Knowledge Update ``` ### **๐ŸŽฏ Validation Framework** - **Traffic Probe Validator**: Network performance claim verification - **Config Diff Validator**: Configuration change impact analysis - **Topology Validator**: Network topology reference checking - **Confidence Scoring**: Probabilistic validation with threshold-based routing --- ## ๐ŸŽช **Demo Experience** ### **๐ŸŽฅ Live Execution** The system provides real-time multi-agent coordination, RAG queries, and validation in action through the interactive CLI demo. ### **Command Examples** ```bash # Standard demo run python -m src.fault_diagnosis.cli fault-diagnosis --session production_demo # Quiet mode for presentations python -m src.fault_diagnosis.cli fault-diagnosis --quiet --session stakeholder_demo # Test mode without external dependencies python -m src.fault_diagnosis.cli fault-diagnosis --no-rag --session test_run ``` ### **Sample Output** ``` Starting Simple Fault Diagnosis MVP ======================================== Component Status: [OK] CrewAI Agents: Available [OK] LangGraph Workflow: Available [OK] RAG Pipeline: Available [OK] Validation Framework: Available Running workflow... [CrewAI] NOC Sentinel analyzing alert FD-ALRT-017... [RAG] Retrieved 3 relevant documents from knowledge base [LangGraph] Routing to hypothesis generation (confidence: 0.87) [Validator] Traffic probe validation: PASSED [Workflow] Generating remediation plan... MVP Demo complete! Results: Session artifacts saved to outputs/session_production_demo_20241219/ ``` --- ## ๐Ÿงช **Technology Deep Dive** ### **๐Ÿค– CrewAI Integration** ```python # Sophisticated agent orchestration class FaultDiagnosisCrew: def __init__(self): self.agents = FaultDiagnosisAgents() self.workflow = FaultDiagnosisWorkflow() def execute_sequential_process(self): # Coordinated multi-agent execution return self.crew.kickoff() ``` ### **๐Ÿง  LangGraph State Management** ```python # Stateful workflow with intelligent routing class FaultDiagnosisWorkflow: def route_decision(self, state): confidence = state.get("confidence_score", 0.0) if confidence > 0.7: return "remediation_planning" else: return "escalation_queue" ``` ### **๐Ÿ“š RAG Pipeline Architecture** ```python # AWS Bedrock integration with fallback class BedrockRAGPipeline: def __init__(self): self.embeddings = BedrockEmbeddings( model_id="amazon.titan-embed-text-v1" ) self.llm = BedrockLLM( model_id="anthropic.claude-3-5-sonnet-20241022-v2:0" ) self.vector_store = ChromaDB() ``` --- ## ๐Ÿ“Š **Project Structure** ``` fault-diagnosis-multi-agent/ โ”œโ”€โ”€ ๐ŸŽฏ src/fault_diagnosis/ # Core system implementation โ”‚ โ”œโ”€โ”€ ๐Ÿค– agents/ # Multi-agent orchestration โ”‚ โ”‚ โ”œโ”€โ”€ crew_orchestration.py # CrewAI crew setup and management โ”‚ โ”‚ โ”œโ”€โ”€ factory.py # Agent factory with role definitions โ”‚ โ”‚ โ”œโ”€โ”€ tasks.py # Task definitions and coordination โ”‚ โ”‚ โ””โ”€โ”€ crew.py # Agent personas and capabilities โ”‚ โ”œโ”€โ”€ ๐Ÿ”„ workflow/ # Stateful workflow management โ”‚ โ”‚ โ”œโ”€โ”€ orchestrator.py # Main workflow coordinator โ”‚ โ”‚ โ”œโ”€โ”€ state_machine.py # LangGraph state transitions โ”‚ โ”‚ โ””โ”€โ”€ workflow.py # Workflow execution logic โ”‚ โ”œโ”€โ”€ ๐Ÿ“š rag/ # RAG pipeline implementation โ”‚ โ”‚ โ””โ”€โ”€ pipeline.py # AWS Bedrock RAG integration โ”‚ โ”œโ”€โ”€ ๐Ÿ›ฃ๏ธ routing/ # Intelligent decision routing โ”‚ โ”‚ โ””โ”€โ”€ intelligent_router.py # Dynamic workflow routing โ”‚ โ”œโ”€โ”€ โœ… validation/ # Hypothesis validation framework โ”‚ โ”‚ โ””โ”€โ”€ validators.py # Specialized domain validators โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š monitoring/ # System observability โ”‚ โ”‚ โ””โ”€โ”€ observability.py # Metrics and monitoring โ”‚ โ”œโ”€โ”€ ๐Ÿ—‚๏ธ data/ # Data management and fixtures โ”‚ โ”‚ โ””โ”€โ”€ fixtures.py # Test data and scenarios โ”‚ โ”œโ”€โ”€ ๐ŸŽจ artifacts/ # Report and artifact generation โ”‚ โ”‚ โ””โ”€โ”€ generators.py # Output formatting and reports โ”‚ โ”œโ”€โ”€ ๐Ÿ”ง shared/ # Shared utilities โ”‚ โ”‚ โ”œโ”€โ”€ console.py # CLI output formatting โ”‚ โ”‚ โ””โ”€โ”€ files.py # File system operations โ”‚ โ””โ”€โ”€ ๐Ÿ–ฅ๏ธ cli.py # Command-line interface โ”œโ”€โ”€ ๐Ÿ“ fixtures/ # Demo data and test scenarios โ”œโ”€โ”€ ๐Ÿ“ outputs/ # Generated session artifacts โ”œโ”€โ”€ ๐Ÿ“ docs/ # Technical documentation โ”œโ”€โ”€ ๐Ÿ”ง requirements.txt # Python dependencies โ”œโ”€โ”€ ๐Ÿš€ run_demo.bat # Quick demo launcher โ””โ”€โ”€ ๐Ÿ“– README.md # This file ``` --- ## โš™๏ธ **Configuration** ### **Environment Variables** ```bash # AWS Bedrock Configuration AWS_ACCESS_KEY_ID=your_access_key_here AWS_SECRET_ACCESS_KEY=your_secret_key_here AWS_DEFAULT_REGION=us-east-1 # Model Configuration BEDROCK_LLM_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0 BEDROCK_EMBEDDING_MODEL=amazon.titan-embed-text-v1 # System Configuration FAULT_DIAGNOSIS_CONFIDENCE_THRESHOLD=0.7 FAULT_DIAGNOSIS_VERBOSE=true ``` ### **Component Features** - **๐Ÿ”„ RAG Pipeline**: Semantic search with AWS Bedrock embeddings - **๐Ÿค– Multi-Agent**: CrewAI orchestration with specialized roles - **๐Ÿ“Š State Management**: LangGraph workflow with decision routing - **โœ… Validation**: Multi-layer hypothesis verification - **๐Ÿ“ˆ Monitoring**: Real-time observability and metrics - **๐ŸŽฏ Routing**: Intelligent workflow path selection --- ## ๐Ÿ“ˆ **Generated Artifacts** Each session produces comprehensive outputs: ### **๐Ÿ“Š Reports & Documentation** - `fault_diagnosis_report.html` - Executive stakeholder report - `fault_diagnosis_report.pdf` - Printable documentation - `hypothesis_board.md` - Detailed technical analysis - `remediation_plan.md` - Step-by-step action guide ### **๐Ÿ” Technical Artifacts** - `session.log` - Complete execution transcript - `alert_context.json` - Processed alert data - `validation_trace.json` - Validator results and decisions - `rag_index.json` - Knowledge retrieval citations ### **๐Ÿ“Š Visualizations** - KPI trend plots and network metrics - Confidence score distributions - Validation outcome summaries - Session timeline visualizations --- ## ๐Ÿงช **Testing & Validation** ### **Smoke Tests** ```bash # Quick system validation python -m src.fault_diagnosis.cli fault-diagnosis --quiet --session smoke_test # Component isolation testing python -m src.fault_diagnosis.cli fault-diagnosis --no-rag --session component_test ``` ### **Demo Scenarios** ```bash # Stakeholder presentation mode python -m src.fault_diagnosis.cli fault-diagnosis --session stakeholder_demo # Technical deep-dive mode python -m src.fault_diagnosis.cli fault-diagnosis --session technical_demo ``` --- ## ๐Ÿ› ๏ธ **Development & Extension** ### **Adding New Agents** ```python # Extend the agent factory class CustomFaultAgent: def __init__(self): self.role = "Custom Specialist" self.backstory = "Domain-specific expertise..." self.goal = "Specialized analysis objective" ``` ### **Custom Validators** ```python # Implement domain-specific validation class CustomValidator: def validate_hypothesis(self, hypothesis: str) -> ValidationResult: # Custom validation logic return ValidationResult(passed=True, confidence=0.85) ``` ### **Workflow Extensions** ```python # Add new workflow states @workflow.step def custom_analysis_step(state: WorkflowState): # Custom processing logic return updated_state ``` --- ## ๐ŸŽฏ **Business Value & Use Cases** ### **๐ŸŽฌ Demonstration Modes**
**๐Ÿ‘” For Executives & Decision Makers** ```bash python -m src.fault_diagnosis.cli fault-diagnosis --quiet --session stakeholder_demo ``` *ROI focus โ€ข Business impact โ€ข Cost reduction* **๐Ÿ‘จโ€๐Ÿ’ป For Technical Teams** ```bash python -m src.fault_diagnosis.cli fault-diagnosis --session technical_demo ``` *Architecture โ€ข Code quality โ€ข Implementation*
### **Enterprise Applications** - **๐Ÿข Network Operations Centers**: Automated first-level incident response - **โ˜๏ธ Cloud Infrastructure**: Multi-cloud fault diagnosis and remediation - **๐Ÿญ Industrial IoT**: Equipment failure prediction and root cause analysis - **๐Ÿš€ DevOps**: Application performance issue diagnosis and resolution ### **Technical Advantages** - **โšก Rapid Deployment**: Minutes from clone to running demo - **๐Ÿ”ง Modular Architecture**: Easy component swapping and extension - **๐Ÿ“Š Rich Observability**: Complete audit trails and session recordings - **๐ŸŽฏ Domain Adaptable**: Easily customizable for different industries ### **Stakeholder Benefits** - **๐Ÿ‘” Executives**: ROI demonstration through automated incident response - **๐Ÿ‘จโ€๐Ÿ’ป Engineers**: Advanced AI tooling for complex problem solving - **๐Ÿ“‹ Operations**: Reduced MTTR and improved service reliability - **๐ŸŽ“ Learning**: Comprehensive example of production AI engineering --- ## ๐Ÿค **Contributing** This project demonstrates advanced AI engineering patterns and welcomes contributions: ```bash # Development setup git clone cd fault-diagnosis-multi-agent python -m venv venv source venv/bin/activate # or venv\Scripts\activate on Windows pip install -r requirements.txt # Run tests python -m src.fault_diagnosis.cli fault-diagnosis --session test # Submit improvements git checkout -b feature/enhancement # Make changes... git commit -m "feat: add new capability" git push origin feature/enhancement ``` --- ## ๐Ÿ“š **Documentation** - **๐Ÿ“– [Technical Deep Dive](docs/Fault_Diagnosis.md)** - Comprehensive system documentation - **๐Ÿ—๏ธ [Architecture Guide](docs/Architecture.md)** - System design and patterns - **๐Ÿ”ง [Configuration Reference](docs/Configuration.md)** - Setup and customization - **๐ŸŽฏ [Demo Scripts](docs/Demos.md)** - Presentation scenarios and examples ## ๐ŸŽฎ **Interactive Demo Features** ### **๐Ÿ“Š Available Demo Modes** | Feature | Command | Description | |---------|---------|-------------| | **๐Ÿš€ Quick Start** | `./run_demo.bat` | Zero to running demo in 30 seconds | | **๐Ÿค– Multi-Agent** | `--session alpha` | Full CrewAI agents coordination | | **๐Ÿ—๏ธ Architecture** | `--session technical_demo` | Technical design walkthrough | | **๐Ÿ“š RAG Pipeline** | `--session production_demo` | AWS Bedrock knowledge retrieval | | **๐ŸŽฏ Live Terminal** | `--session demo` | Real-time execution footage | | **๐Ÿ‘” Executive Mode** | `--quiet --session stakeholder_demo` | Business value & ROI focus | | **๐Ÿ‘จโ€๐Ÿ’ป Technical Mode** | `--session technical_demo` | Code review & implementation | | **๐Ÿงช Test Mode** | `--no-rag --session test_run` | Offline demonstration | --- ## ๐Ÿ† **Portfolio Highlights** This project demonstrates mastery of: ### **๐Ÿค– Advanced AI Engineering** - Multi-agent system orchestration with CrewAI - Stateful workflow management with LangGraph - Production RAG implementation with AWS Bedrock - Intelligent routing and decision making ### **๐Ÿ—๏ธ Software Architecture** - Clean, modular, and extensible design patterns - Comprehensive error handling and resilience - Professional logging and observability - Session management and artifact generation ### **โ˜๏ธ Cloud & Enterprise** - AWS Bedrock integration for production AI - Scalable vector database architecture - Configuration management and environment handling - Enterprise-ready security and monitoring ### **๐Ÿ“Š Data & Analytics** - Vector embeddings and semantic search - Hypothesis validation and confidence scoring - Real-time monitoring and metrics collection - Comprehensive reporting and visualization --- ## ๐Ÿ“„ **License** This project is developed as a portfolio demonstration of advanced AI engineering capabilities. See the project structure and documentation for detailed implementation patterns and best practices. --- ## ๐Ÿ”ฎ **Future Enhancements** - **๐ŸŒ Web Interface**: React-based dashboard for real-time monitoring - **๐Ÿ“ฑ Mobile App**: iOS/Android client for field operations - **๐Ÿ”— API Gateway**: RESTful API for system integration - **๐Ÿงช A/B Testing**: Hypothesis validation strategy optimization - **๐Ÿ“Š Advanced Analytics**: Machine learning model performance tracking - **๐Ÿ”„ Auto-Scaling**: Kubernetes deployment with auto-scaling capabilities ---
**Built with โค๏ธ using cutting-edge AI and modern software engineering practices** [๐ŸŽฏ **Live Demo**](./run_demo.bat) | [๐Ÿ“š **Documentation**](docs/) | [๐Ÿค **Contribute**](#contributing)