Site Reliability Engineer

Site Reliability Engineer – Mistral AI

Location: London, UK (Paris also possible)
Visa Sponsorship: Available (Skilled Worker)

About Mistral AI

At Mistral AI, we believe AI should simplify tasks, save time, and enhance learning and creativity. Our high-performance, open-source AI models and solutions integrate seamlessly into daily work. Our platform includes le Chat, the AI assistant for life and work. We are a collaborative, low-ego, and innovative team with offices in France, USA, UK, Germany, and Singapore. Learn more about our culture: Mistral AI Careers.

Role Summary

We’re seeking an experienced Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our platform and customer-facing applications. You’ll work closely with software engineers and AI/ML teams to maintain top-tier operational standards.

Responsibilities

Operations (50%)

  • Build and maintain scalable, fault-tolerant infrastructures for web services and ML workloads
  • Ensure high availability of platforms, training environments, and HPC clusters
  • Troubleshoot production issues, participate in on-call rotations, and perform root-cause analysis
  • Implement monitoring, alerting, and incident response systems
  • Manage CI/CD pipelines, containerization, orchestration, and logging systems

Development (50%)

  • Automate infrastructure deployment and orchestration using Kubernetes, Terraform, Flux, etc.
  • Collaborate with researchers to enable safe, reproducible model-training experiments
  • Build a cloud-agnostic platform bridging science and infrastructure
  • Develop workflows, dashboards, scripts, and APIs to improve system reliability and performance
  • Ensure security compliance and document best practices
  • Contribute to open-source projects and technical publications

Requirements

  • Master’s in Computer Science, Engineering, or related field
  • 7+ years in DevOps/SRE roles with experience in highly available distributed systems
  • Strong knowledge of cloud computing, monitoring, CI/CD, containerization (Docker/Kubernetes), scripting (Python/Go/Bash), and infrastructure-as-code (Terraform/CloudFormation)
  • Hands-on experience with observability tools (Prometheus, Grafana, ELK, Datadog)
  • Excellent problem-solving, communication skills, and ability to thrive in a startup environment

Bonus Skills

  • Experience in AI/ML environments
  • High-performance computing (HPC) systems and workload managers (Slurm)
  • Familiarity with AI-oriented platforms like Fluidstack, Coreweave, or Vast

Location & Remote Work

  • Primarily based in Paris HQ, with flexibility for remote work
  • Remote candidates may be considered in France, UK, Germany, Netherlands, Spain, or Italy with required office visits for onboarding and monthly collaboration

What We Offer

  • Competitive salary + equity
  • Health insurance, transport & sport allowances, meal vouchers
  • Private pension plan & generous parental leave
  • Visa sponsorship

Apply Now

Join a pioneering AI company shaping the future of technology

 

Related Posts

Verified by MonsterInsights