We are looking for a seasoned AI/AIOps Engineer with deep expertise in building intelligent operational systems powered by modern LLMs, data engineering, and automation frameworks. The ideal candidate will lead the design and deployment of AI-driven solutions for event correlation, anomaly detection, incident prediction, and operational insights. This role requires strong technical leadership, hands-on development capability, and a proven track record of delivering measurable improvements in IT operations.
Responsibilities:
- Design, develop, and deploy AI/ML solutions for event correlation, log analysis, root cause prediction, and observability enhancement.
- Build and optimize LLM-powered systems, including RAG pipelines, MoE models, and vector-search architectures.
- Implement multi-model orchestration layers or MCP frameworks to manage diverse AI components and workflows.
- Collaborate with engineering, DevOps, and SRE teams to integrate AIOps capabilities into operational environments.
- Develop scalable APIs and backend services using Python and FastAPI.
- Leverage statistics, ML algorithms, and analytical techniques to generate actionable operational insights.
- Work with MLOps and AIOps toolchains to automate model deployment, monitoring, and maintenance.
- Evaluate system performance, drive continuous improvement initiatives, and deliver quantifiable operational benefits.
- Prepare documentation, architecture diagrams, and best practices for AI implementations.
Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Data Science, or a related technical field.
- 5+ years of experience in AI/ML, Data Engineering, or AIOps domains.
- Demonstrated success implementing LLM-based solutions for log/event correlation, incident prediction, or operational intelligence.
- Strong understanding of AIOps concepts including event correlation, anomaly detection, incident prediction, and topology mapping.
- Hands-on experience with RAG workflows, MoE models, and vector databases (FAISS, Pinecone, Milvus).
- Proficient in Python, FastAPI, and AI integration frameworks such as LangChain, LlamaIndex, or Transformers.
- Experience building MCP or multi-model orchestration layers.
- Solid grounding in data science, statistics, and core ML algorithms.
- Familiarity with AIOps platforms like Moogsoft, BigPanda, Dynatrace, or IBM Watson AIOps.
- Proven impact with measurable outcomes (e.g., reduced alert noise, faster incident resolution).