Site Reliability Engineer – Mistral AI
Location: London, UK (Paris also possible)
Visa Sponsorship: Available (Skilled Worker)
About Mistral AI
At Mistral AI, we believe AI should simplify tasks, save time, and enhance learning and creativity. Our high-performance, open-source AI models and solutions integrate seamlessly into daily work. Our platform includes le Chat, the AI assistant for life and work. We are a collaborative, low-ego, and innovative team with offices in France, USA, UK, Germany, and Singapore. Learn more about our culture: Mistral AI Careers.
Role Summary
We’re seeking an experienced Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our platform and customer-facing applications. You’ll work closely with software engineers and AI/ML teams to maintain top-tier operational standards.
Responsibilities
Operations (50%)
- Build and maintain scalable, fault-tolerant infrastructures for web services and ML workloads
- Ensure high availability of platforms, training environments, and HPC clusters
- Troubleshoot production issues, participate in on-call rotations, and perform root-cause analysis
- Implement monitoring, alerting, and incident response systems
- Manage CI/CD pipelines, containerization, orchestration, and logging systems
Development (50%)
- Automate infrastructure deployment and orchestration using Kubernetes, Terraform, Flux, etc.
- Collaborate with researchers to enable safe, reproducible model-training experiments
- Build a cloud-agnostic platform bridging science and infrastructure
- Develop workflows, dashboards, scripts, and APIs to improve system reliability and performance
- Ensure security compliance and document best practices
- Contribute to open-source projects and technical publications
Requirements
- Master’s in Computer Science, Engineering, or related field
- 7+ years in DevOps/SRE roles with experience in highly available distributed systems
- Strong knowledge of cloud computing, monitoring, CI/CD, containerization (Docker/Kubernetes), scripting (Python/Go/Bash), and infrastructure-as-code (Terraform/CloudFormation)
- Hands-on experience with observability tools (Prometheus, Grafana, ELK, Datadog)
- Excellent problem-solving, communication skills, and ability to thrive in a startup environment
Bonus Skills
- Experience in AI/ML environments
- High-performance computing (HPC) systems and workload managers (Slurm)
- Familiarity with AI-oriented platforms like Fluidstack, Coreweave, or Vast
Location & Remote Work
- Primarily based in Paris HQ, with flexibility for remote work
- Remote candidates may be considered in France, UK, Germany, Netherlands, Spain, or Italy with required office visits for onboarding and monthly collaboration
What We Offer
- Competitive salary + equity
- Health insurance, transport & sport allowances, meal vouchers
- Private pension plan & generous parental leave
- Visa sponsorship
Apply Now
Join a pioneering AI company shaping the future of technology