Descrição do trabalho
Role Summary
We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services.
- Key Responsibilities
- Design, implement, and maintain highly available and scalable infrastructure on AWS.
- Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets).
- Build and manage CI/CD pipelines to support fast and safe software delivery.
- Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc.
- Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm).
- Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk).
- Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement.
- Improve system performance, capacity planning, scaling strategies, and disaster recovery processes.
- Collaborate closely with development teams to improve deployment strategies and system resilience.
- Implement security best practices (IAM, secret management, vulnerability scanning, patching).
- Define operational standards, runbooks, documentation, and best practices for platform reliability.
- Participate in on-call rotation and provide senior-level support for critical production issues.
- Required Skills & Qualifications
- 5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering.
- Strong expertise in Linux systems administration and troubleshooting.
- Proven experience with Kubernetes in production environments.
- Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps).
- Solid knowledge of Infrastructure as Code (Terraform highly preferred).
- Experience with cloud platforms: AWS, Azure, or Google Cloud.
- Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies).
- Experience with observability tools: monitoring, metrics, logging, tracing.
- Strong scripting skills (Bash, Python, or similar).
- english proeficiency level