Your mission: build and maintain a secure, automated, and observable AWS foundation so engineers can ship faster, safer, and cheaper. You’ll be the owner of deployment velocity, system uptime, and cloud cost sanity across our ECS-based microservices.
What You’ll Own:
1. Platform Reliability:
- Design and maintain ECS clusters (Fargate/EC2) for multi-service workloads.
- Implement autoscaling, health checks, and blue/green rollouts for zero-downtime deployments.
- Build observability into everything — logs, metrics, traces — to shorten MTTR.
2. Delivery Automation:
- Architect and maintain CI/CD pipelines using GitHub Actions + CodePipeline/CodeBuild.
- Enforce testing, security scanning, and deployment gates as part of every release.
- Move from semi-manual deploys to fully automated pipelines across environments.
3. Network & Security:
- Manage VPC architectures (subnets, routing, gateways, VPN, endpoints).
- Handle Route 53 for internal/external DNS, SSL/TLS, health checks, and routing policies.
- Maintain multi-account setup with IAM least privilege, KMS encryption, and security baselines.
4. Infrastructure as Code:
- Define all infra in Terraform/CDK; no console drift.
- Use IaC reviews and environments for repeatable, compliant infrastructure.
5. Data Layer Operations:
- Operate and optimize ClickHouse and PostgreSQL clusters — backups, replication, partitioning, and tuning.
- Ensure RTO/RPO objectives are met and documented.
6. Monitoring & Debugging:
- Aggregate logs (CloudWatch, FireLens, OpenTelemetry).
- Build dashboards and alerts that highlight anomalies, not noise.
- Lead root-cause investigations across network, container, and app layers.
Core Tech Stack:
- AWS: ECS (Fargate/EC2), EC2, S3, VPC, Route 53, CloudWatch, CodePipeline, CodeBuild
- CI/CD: GitHub Actions, Docker, Terraform/CDK
- Databases: ClickHouse, PostgreSQL
- Languages (plus): FastAPI (Python), Node.js
- Networking: DNS, VPN, load balancers, private link, peering, NAT, IGW
- Security: Multi-account strategy, IAM roles/policies, KMS, AWS Config, GuardDuty
Requirements:
- 5+ years running production workloads on AWS.
- Deep knowledge of ECS, CodePipeline, EC2/VPC, S3, and Docker.
- Proven track record of shipping secure automated deployments.
- Strong understanding of networking and DNS fundamentals.
- Experience managing databases in production.
- Strong debugging and observability mindset.
- Clear written communication and operational discipline.
Nice to Have:
- Familiarity with FastAPI or Node.js applications to optimize deployment flows.
- Hands-on with cost-optimization and cross-account automation (Organizations, Control Tower).
- Experience setting up VPNs, Bastion, or SSO integration.
What Success Looks Like:
- All ECS services deployed via automated pipelines.
- CloudWatch dashboards and alerts in place for core systems.
- Verified ClickHouse and PostgreSQL backups/restores.
- Documented multi-account/VPC network topology.
- No manual deploys, no console changes.
Why This Role Matters:
This role defines the foundation for everything we build. The more you automate, the faster teams deliver.
You’ll directly impact uptime, developer productivity, and cloud spend — three metrics that define operational excellence.