Descrição do trabalho
About Sybilion
Sybilion builds AI-driven market forecasting for process industries (chemicals, packaging, pulp & paper, textiles, and broader manufacturing). We help procurement, supply chain, and commercial teams make better buy/sell decisions by turning messy external signals and internal operational data into clear, defensible forecasts that teams trust and act on.
Our stack includes Python-based microservices, PostgreSQL data infrastructure, and ML/AI workflows that support forecasting models and decision tooling.
About the Role
We’re hiring someone to own both our platform and data infrastructure: Kubernetes administration, Linux systems, CI/CD, observability, and PostgreSQL administration for our data lakes and ML pipelines. You’ll keep production reliable, fast, secure, and scalable, while supporting the day-to-day needs of our engineers and ML workflows.
This is an on-site role in Maia (Porto). We value in-person collaboration and move quickly.
What You’ll Do
- Platform / Kubernetes / Systems
- Design, deploy, and operate Kubernetes clusters in production (networking, storage, security)
- Operate Linux server infrastructure (Ubuntu/RHEL), patching, hardening, and reliability
- Manage Docker image lifecycle (builds, optimisation, registry management, security scanning)
- Implement and maintain CI/CD pipelines for microservices deployments and infrastructure changes
- Build and maintain Infrastructure as Code (Terraform, Ansible, Helm) and Git workflows
- Operate and improve monitoring, logging, and alerting (Prometheus/Grafana, ELK/EFK/Loki, etc.)
- Manage secrets and credentials securely (Vault, Sealed Secrets, or equivalent)
- Ensure high availability, capacity planning, incident response, and disaster recovery readiness
- Support GPU-enabled workloads and ML/LLM deployments (resource allocation, utilisation, scaling)
- PostgreSQL / Data Infrastructure
- Administer and optimise PostgreSQL databases and data lake infrastructure (performance, reliability, cost)
- Own backup/recovery and disaster recovery procedures (including point-in-time recovery)
- Design schemas, indexing strategies, and query optimisation approaches; analyse execution plans
- Manage migrations and versioning (schema changes, rollout strategies, rollback plans)
- Implement replication/failover/clustering patterns for high availability
- Own database security: access controls, encryption at rest/in transit, audit logging, compliance needs
- Python Microservices / Data Pipelines / ML Workflows
- Support deployment and troubleshooting of Python microservices (FastAPI/Flask/Django or similar)
- Help maintain Python environments and dependency management (pip/poetry/conda/mamba)
- Support ETL/ELT pipelines feeding our data lake and ML training workflows
- Implement data quality checks and validation where needed
- Partner with engineers and ML team to improve runtime performance, reliability, and operational visibility
- Must-Have Experience (Required)
- 5+ years of hands-on production experience in: Linux, Docker, Kubernetes, and PostgreSQL
- Strong Kubernetes administration skills (clusters, networking, ingress, storage, RBAC, security)
- Strong PostgreSQL administration skills (performance tuning, backups, replication/HA, security)
- Strong Linux systems skills (operations, troubleshooting, hardening)
- CI/CD experience (GitHub Actions/GitLab CI/Jenkins or similar)
- Infrastructure as Code experience (Terraform and/or Ansible; Helm for Kubernetes)
- Observability experience (metrics, logs, alerting; root-cause analysis)
- Solid Python literacy for debugging services and automating operational tasks
- Strong communication skills in English and comfort working independently end-to-end
- Willingness to participate in an on-call rotation for critical systems
- Preferred (Nice to Have)
- Startup background (you’ve worked in small teams, moved fast, and owned outcomes end-to-end)
- Experience running ML infrastructure (MLflow, Kubeflow, Airflow, KServe/TorchServe, etc.)
- GPU cluster experience (NVIDIA GPU Operator or similar) and model serving optimisation
- Experience with service mesh (Istio/Linkerd)
- Experience with cloud managed databases (AWS RDS, GCP Cloud SQL, Azure Database)
- Familiarity with data lake / warehouse patterns and data versioning (DVC/MLflow tracking)
- Experience with Redis/MongoDB or other complementary data systems
- Soft Skills We Value
- Strong problem-solving and analytical mindset
- Calm, structured incident handling and good judgement under pressure
- Proactive improvement orientation (you spot issues before they become outage