DevOps Engineer

Descrição do trabalho

Role Summary

We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services.

  • Key Responsibilities
  • Design, implement, and maintain highly available and scalable infrastructure on AWS.
  • Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets).
  • Build and manage CI/CD pipelines to support fast and safe software delivery.
  • Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc.
  • Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm).
  • Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk).
  • Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement.
  • Improve system performance, capacity planning, scaling strategies, and disaster recovery processes.
  • Collaborate closely with development teams to improve deployment strategies and system resilience.
  • Implement security best practices (IAM, secret management, vulnerability scanning, patching).
  • Define operational standards, runbooks, documentation, and best practices for platform reliability.
  • Participate in on-call rotation and provide senior-level support for critical production issues.
  • Required Skills & Qualifications
  • 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering.
  • Strong expertise in Linux systems administration and troubleshooting.
  • Proven experience with Kubernetes in production environments.
  • Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps).
  • Solid knowledge of Infrastructure as Code (Terraform highly preferred).
  • Experience with cloud platforms: AWS, Azure, or Google Cloud.
  • Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies).
  • Experience with observability tools: monitoring, metrics, logging, tracing.
  • Strong scripting skills (Bash, Python, or similar).
  • english proeficiency level