AI-powered automation platform for DevOps teams

Industry

AI & ML

Location

USA

Platform

Web

Cooperation

4+ years

About the project

Our client, a company developing a DevOps platform, required a tool able to analyze Kubernetes clusters and spot, fix, and predict the issues. The goal was to maximize the detection rate of potential issues while minimizing false positives, ensuring uniterrupted Kubernetes cluster health monitoring without human involvement.

Challenge

The original system relied on predefined tests, meaning it could only detect issues that had been explicitly accounted for. The challenge was that new or unexpected issues could not be identified unless a specific test had already been created for them. To address this limitation, the client envisioned integrating AI-powered automation to analyze all cluster resources, identify potential issues, and provide engineers with actionable insights — all without requiring predefined problem definitions.

Development process

The project required balancing cost and accuracy. We experimented with multiple Machine Learning models, testing their performance and pricing. GPT models demonstrated nearly 99% precision but were too expensive for real-time, large-scale processing. After extensive evaluation, we selected Llama 3.1 (8B parameters) as the most cost-effective model, providing high accuracy while optimizing operational costs.

Another major challenge was ensuring synchronous, parallelized processing across Kubernetes clusters, which can contain thousands of diverse resources. Analyzing each resource sequentially would have been inefficient and costly. To achieve optimal efficiency, we:

Implemented a step-limited AI model, ensuring precise issue identification without excessive processing overhead.
Developed an AI agent capable of executing Kubernetes CLI commands, enabling real-time interaction with cluster resources.
Leveraged LangChain and LangGraph to streamline workflow orchestration, result visualization, and structured AI-driven conclusions.

The system was designed to process resources with varying levels of complexity. High-complexity resources (Pods, Nodes) required in-depth analysis. Lower-complexity resources (ConfigMaps, Services, Ingress, PVC, CronJobs) were analyzed using lightweight models for cost-efficiency.

To automate model selection and cost management, we implemented a dynamic model-switching mechanism. The AI system automatically determined the optimal model based on resource complexity and error probability.

This configuration eliminated the need for manual intervention, allowing the system to operate fully autonomously. The solution was further optimized using a multi-agent parallelization approach, where one agent identified resources for analysis, another agent executed Kubernetes commands and diagnostics, and a third agent generated structured reports and issue explanations.

The system was rigorously tested using a client-provided Kubernetes test cluster, with our team refining its performance based on key metrics, including issue detection accuracy, response validation, and processing speed. Our target was to achieve 90% precision at an optimized cost, which we successfully reached.

Technologies

Python

LangChain

LanGraph

LangSmith

OpenAI

Ollama

Streamlit

Kubernetes

CLI

Business value

By automating Kubernetes health assessments, our solution reduced engineers’ manual workload, enabling them to focus on critical infrastructure improvements rather than routine diagnostics. Traditionally, Kubernetes failures are only detected when a system component crashes, or error notifications are triggered. This AI-driven tool allows for continuous, proactive health monitoring, identifying potential issues before they escalate into service disruptions.
 When an issue is detected, the system provides:

Detailed diagnostics, including root cause analysis.
Actionable remediation steps for engineers.
A transparent, AI-driven reasoning process, allowing users to understand how conclusions were reached.

The impact of this solution extends beyond operational efficiency — it ensures business continuity by preventing unexpected downtime. For companies whose platforms rely on uninterrupted availability, this tool acts as an AI-powered DevOps engineer, automating health checks and optimizing Kubernetes stability.

Contact us

Daryna Chorna

Customer success manager