Job Description
We are seeking an experienced DevOps Engineer to join our Data and Commercialization team. This role will help bolster the reliability and performance of the data and AI platforms. You will build and operate infrastructure in GCP, implementing modern DevOps practices to keep services resilient, observable, and scalable. In this role, you will design CI/CD pipelines, manage infrastructure-as-code, and implement monitoring and alerting tools to ensure the smooth delivery of critical internal and user-facing workloads.
The DevOps Engineer will carry out root-cause analysis across distributed systems, manage networking configurations, and harden deployments for security and compliance. The ideal candidate brings strong experience running production workloads in GCP, particularly Cloud Run and GKE, along with hands-on support skills like debugging and incident management.
Experience Required: 4+ years
Key Responsibilities:
- Operate and enhance production AI and data systems, ensuring high availability, reliability, and cost-efficient performance.
- Design, implement, and maintain CI/CD pipelines to streamline and control the deployment of AI products.
- Build and manage infrastructure-as-code (Terraform preferred) to support scalable, secure, and repeatable GCP environments.
- Develop observability frameworks, extending the capabilities of Cloud Monitoring and Logging to fit the needs of our team’s solutions.
- Support deployed applications and incident response by contributing to runbooks and alerting strategies that maintain service continuity.
- Perform root cause analysis and advanced troubleshooting across distributed systems, and use that knowledge to bolster live applications.
Qualifications:
- Strong, hands-on experience with Google Cloud Platform, including Cloud Run, GKE, Cloud
- Monitoring, Cloud Logging, Cloud Build, and related services. Familiarity with AWS is a plus.
- Experience deploying infrastructure-as-code with Terraform or another IaC tool.
- Proven ability to debug and troubleshoot distributed applications in production environments, including identifying and remediating networking issues, IAM misconfigurations, container performance, and service-to-service communication.
- Experience implementing CI/CD pipelines for application and/or infrastructure code.
- Strong background in observability, including monitoring, logging, tracing, and error reporting across cloud architectures.
- Solid understanding of cloud security, including IAM role design, service accounts, secrets management, and networking fundamentals.
- Driven mindset, ready to establish secure and efficient patterns that will streamline the development process of a team at the cutting edge of AI and data.