Find your dream job faster with JobLogr
AI-powered job search, resume help, and more.
Try for Free
JO

Jobgether

via Workable

Apply Now
All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Senior Cluster Site Reliability Engineer

Anywhere
full-time
Posted 10/3/2025
Direct Apply
Key Skills:
Site Reliability Engineering
DevOps
HPC
Cloud Infrastructure
Automation
Monitoring
Scripting
Infrastructure-as-Code
Distributed Computing
Machine Learning
Observability
Security
Containerization
Kubernetes
Batch Compute
Networking

Compensation

Salary Range

$205K - 235K a year

Responsibilities

Ensure the reliability, scalability, and performance of critical research compute clusters while maintaining and optimizing both on-premises and cloud infrastructure. Act as a first responder to cluster outages or performance issues and collaborate with engineering and research teams to drive systemic improvements.

Requirements

Candidates should have 5+ years of experience in SRE, DevOps, or similar roles, with expertise in HPC/batch compute frameworks and ML training systems. Proficiency in scripting and hands-on experience with cloud platforms and distributed storage systems are also required.

Full Description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA). This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health. Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team. Accountabilities: Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently. Maintain high uptime and define, track, and report on SLAs to quantify reliability. Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams. Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health. Support policy design for fair cluster usage and implement enforcement mechanisms for research teams. Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions. Collaborate with software and research teams to support distributed computing and machine learning workflows. 5+ years of experience in SRE, DevOps, or similar senior engineering roles. Expertise in HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or ML training systems (Kubeflow, MLflow, Horovod). Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code/configuration management (Terraform, Ansible). Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3). Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry). Bachelor’s degree in Computer Science or equivalent experience. Systematic, automation-driven mindset with a focus on reliability engineering. Experience with HPC frameworks, Kubernetes-based job orchestrators, and distributed computing frameworks (Ray, Dask, Spark). Knowledge of ML frameworks (PyTorch, TensorFlow, JAX, Horovod, DeepSpeed). Experience with hybrid or on-prem/cloud environments and HPC networking (InfiniBand, RDMA). Strong security/IAM understanding, including Zero Trust and cloud IAM. Proficiency with containerization (Docker, Podman, Singularity) for HPC/batch compute environments. Benefits: Base salary: $205,000 – $235,000 (depending on experience and location). Comprehensive benefits package: medical, dental, and vision coverage; life and AD&D insurance. Paid time off: 20 vacation days and 9 sick days annually. Retirement plan: 401(k) with company match. Opportunities to work on cutting-edge HPC and ML infrastructure at scale. Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching. When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly. 🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements. 📊 It compares your profile to the job’s core requirements and past success factors to determine your match score. 🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role. 🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed. The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team. Thank you for your interest! #LI-CL1

This job posting was last updated on 10/4/2025

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »
JobLogr badgeTinyLaunch BadgeJobLogr - AI Job Search Tools to Land Your Next Job Faster than Ever | Product Hunt