Descrição do trabalho
Mission
Join the engineering team to support our cloud migration roadmap to a modern, scalable, and compliant ML infrastructure on Google Cloud Platform (GCP).
You will learn how to operate production-grade ML systems under the mentorship of a Lead MLOps Engineer, while contributing directly to infrastructure modernization, CI/CD pipelines, and MLOps workflows.
Responsibilities
- Support our cloud migration roadmap to modern GCP architecture.
- Implement security & performance improvements: Cloud Armor, CDN, KMS encryption, IAM policies.
- Build observability infrastructure: Prometheus, Grafana dashboards, distributed tracing, SLO monitoring.
- Deploy containerized workloads to Cloud Run and GKE Autopilot with autoscaling and GPU support.
- Set up async messaging with Pub/Sub queues, idempotency, and dead-letter queues.
- Build Infrastructure as Code with Terraform/Terraspace and CI/CD pipelines (GitHub Actions).
- Optimize storage & costs: lifecycle policies, tiered storage, resource labeling, budget monitoring.
- Support compliance initiatives: audit logs, retention policies, least-privilege access (FDA/HIPAA).
- Write clear documentation and operational runbooks for all infrastructure components.
Stack You’ll Work With
- Cloud: GCP (Cloud Run, GKE Autopilot, Cloud Armor, Pub/Sub, GCS, Cloud CDN, KMS)
- IaC & CI/CD: Terraform, Terraspace, GitHub Actions, OPA/Conftest
- Languages: Python, Node.js, Bash, YAML/HCL
- Containers: Docker, Kubernetes
- Database: MongoDB, Cloud SQL
Ideal Profile
- 0–2 years of experience in DevOps or Cloud.
- Basic understanding of cloud services (GCP preferred, AWS/Azure transferable).
- Familiarity with Linux, Docker, and Git.
- Curious about ML infrastructure and automation.
- Eager to learn under mentorship from a Lead MLOps Engineer.
- Good communication and documentation skills in English.
What You’ll Learn
- Design and deploy production-grade infrastructure on GCP.
- Build CI/CD pipelines and Infrastructure as Code with Terraform.
- Implement observability with SLOs, distributed tracing, and monitoring dashboards.
- Manage MLOps workflows (experiment tracking, model versioning, GPU workloads).
- Apply cloud security best practices (IAM, encryption, compliance).
- Optimize costs and performance in a regulated medical AI environment.