The Platform Reliability Engineer (PRE) ensures the reliability, observability, and operational excellence of KnowledgeCity’s cloud infrastructure and internal platforms.
This role focuses on building and maintaining monitoring systems, dashboards, and status pages that provide real-time visibility into infrastructure health and performance across environments.
The PRE is responsible for maintaining the inventory of infrastructure nodes, developing observability tools, coordinating incident response, and driving post-incident reviews and data based optimization initiatives.
Through a deep understanding of system reliability, automation, and monitoring technologies, the Platform Reliability Engineer bridges development, DevOps, and support teams — enabling proactive issue detection, transparent communication, and continuous improvement of our SaaS platforms.
Key Responsibilities
Infrastructure Visibility and Monitoring
- Design, implement, and maintain end-to-end observability solutions using Prometheus, Grafana, Loki, Alertmanager, or other.
- Develop infrastructure inventory systems to track all nodes, environments, and services.
- Ensure system health metrics (CPU, memory, disk, latency, response time) are consistently collected and visualized.
- Create and maintain Grafana dashboards tailored for developers, operations, and clients.
Incident Management and Status Reporting
- Lead incident response processes: detection, escalation, resolution, and post-mortem analysis.
- Manage and update internal and external status pages reflecting real-time service health and historical uptime.
- Publish incident reports and root cause summaries to communicate clearly with stakeholders.
- Define and monitor SLIs, SLOs, and SLAs to measure and improve service reliability.
Reliability Engineering and Automation
- Automate reliability checks, uptime probes, and health verifications using scripting or infrastructure-as-code tools.
- Implement synthetic monitoring and proactive alerting for critical paths (API, LMS, Portal, Database etc.).
- Identify and eliminate recurring incidents by implementing preventive monitoring and self-healing automation.
Collaboration and Knowledge Sharing
- Partner with DevOps, Development, QA, and Support teams to enhance observability and response processes.
- Provide data insights and reliability reports that support development and optimization decisions.
- Maintain documentation and runbooks for monitoring, alerts, and incident handling.
- Advocate for a data-driven reliability culture across all technical teams.
Continuous Improvement
- Continuously refine monitoring dashboards, alerting rules, and uptime metrics for better accuracy and usability.
- Evaluate and integrate emerging observability and reliability tools to enhance the monitoring stack.
- Conduct reliability reviews after major releases or infrastructure changes.
Qualifications
Technical Expertise
- Solid experience in observability and monitoring systems: Grafana, Prometheus, Loki, Alertmanager, ELK Stack, or similar.
- Strong knowledge of cloud infrastructure (AWS, GCP, Oracle or Azure) and Linux systems administration.
- Familiarity with Infrastructure as Code (Terraform, Ansible) and CI/CD pipelines.
- Understanding of incident response, post-mortem analysis, and reliability metrics (SLI/SLO/SLA).
- Competency in scripting (Bash, Python, PHP) for automation and tool integration.
- Experience with container orchestration (Docker, Kubernetes) and associated monitoring.
- Knowledge of network performance monitoring and synthetic testing tools.
Problem-Solving
- Analytical mindset with the ability to transform raw metrics into actionable insights.
- Capable of diagnosing performance bottlenecks and improving system reliability through automation.
- Skilled in root-cause analysis and preventive design.
Communication
- Excellent communication and documentation skills for incident summaries and cross-team updates.
- Ability to collaborate effectively with technical and non-technical teams.
- Advanced English proficiency, both written and spoken, to report incidents and produce public-facing updates.