Descrição do trabalho

Mission

Join the engineering team to support our cloud migration roadmap to a modern, scalable, and compliant ML infrastructure on Google Cloud Platform (GCP).

You will learn how to operate production-grade ML systems under the mentorship of a Lead MLOps Engineer, while contributing directly to infrastructure modernization, CI/CD pipelines, and MLOps workflows.

Responsibilities

  • Support our cloud migration roadmap to modern GCP architecture.
  • Implement security & performance improvements: Cloud Armor, CDN, KMS encryption, IAM policies.
  • Build observability infrastructure: Prometheus, Grafana dashboards, distributed tracing, SLO monitoring.
  • Deploy containerized workloads to Cloud Run and GKE Autopilot with autoscaling and GPU support.
  • Set up async messaging with Pub/Sub queues, idempotency, and dead-letter queues.
  • Build Infrastructure as Code with Terraform/Terraspace and CI/CD pipelines (GitHub Actions).
  • Optimize storage & costs: lifecycle policies, tiered storage, resource labeling, budget monitoring.
  • Support compliance initiatives: audit logs, retention policies, least-privilege access (FDA/HIPAA).
  • Write clear documentation and operational runbooks for all infrastructure components.

Stack You’ll Work With

  • Cloud: GCP (Cloud Run, GKE Autopilot, Cloud Armor, Pub/Sub, GCS, Cloud CDN, KMS)
  • IaC & CI/CD: Terraform, Terraspace, GitHub Actions, OPA/Conftest
  • Languages: Python, Node.js, Bash, YAML/HCL
  • Containers: Docker, Kubernetes
  • Database: MongoDB, Cloud SQL

Ideal Profile

  • 0–2 years of experience in DevOps or Cloud.
  • Basic understanding of cloud services (GCP preferred, AWS/Azure transferable).
  • Familiarity with Linux, Docker, and Git.
  • Curious about ML infrastructure and automation.
  • Eager to learn under mentorship from a Lead MLOps Engineer.
  • Good communication and documentation skills in English.

What You’ll Learn

  • Design and deploy production-grade infrastructure on GCP.
  • Build CI/CD pipelines and Infrastructure as Code with Terraform.
  • Implement observability with SLOs, distributed tracing, and monitoring dashboards.
  • Manage MLOps workflows (experiment tracking, model versioning, GPU workloads).
  • Apply cloud security best practices (IAM, encryption, compliance).
  • Optimize costs and performance in a regulated medical AI environment.