via Workable
$131K - 201K a year
Design and implement observability and automation practices to ensure secure, compliant, and highly available cloud production systems, participate in incident management, and establish SRE culture.
5+ years in software engineering, SRE, or DevOps with experience in cloud environments, infrastructure as code, Kubernetes, observability tools, incident management, and preferably regulated environments.
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer - Reliability (Remote) in California (USA). We are seeking a Senior Software Engineer specializing in Reliability to help design, implement, and operate systems that ensure cloud-based production environments remain secure, compliant, and highly available. In this role, you will be a foundational member of a new Site Reliability Engineering (SRE) team, building processes and infrastructure to support mission-critical workloads in regulated environments. You will collaborate with engineering, product, and operational teams to define service-level objectives, develop monitoring and automation, and improve overall system reliability. The ideal candidate is experienced in cloud infrastructure, automation, and observability, and enjoys solving complex distributed system challenges. This role offers the opportunity to shape the SRE culture and practices from the ground up, while contributing to high-impact projects that support regulated and commercial operations. Accountabilities · Design and implement observability practices including metrics, traces, dashboards, logs, and alerting for production systems. · Partner with engineering, product, and lab teams to define SLIs/SLOs, error budgets, and incident response procedures. · Develop and maintain operational playbooks and runbooks for reliability and compliance. · Participate in on-call rotations, championing automation and self-healing for production systems. · Contribute to deployment processes and infrastructure automation using Infrastructure as Code (IaC). · Collaborate on incident reviews, postmortems, and disaster recovery exercises to improve system reliability. · Mentor peers, promote best practices, and help establish the SRE culture and strategy. · Bachelor’s degree in Computer Science, Engineering, or equivalent experience. · 5+ years of experience in software engineering, SRE, or DevOps roles (Python or Go preferred). · Hands-on experience deploying and operating production workloads in cloud environments (AWS, GCP, or Azure). · Expertise in Infrastructure as Code (Terraform, Pulumi, Bicep/ARM). · Experience with incident management platforms (e.g., Incident.io, ServiceNow, Opsgenie, PagerDuty). · Strong knowledge of Kubernetes (AKS, GKE, EKS) and cloud networking. · Proficiency with observability platforms such as DataDog, Prometheus/Grafana, or OpenTelemetry. · Excellent troubleshooting, root-cause analysis, and automation skills. · Ability to work autonomously and collaborate effectively with cross-functional teams. · Experience in regulated environments (healthcare, biotech) and familiarity with compliance-driven change management is a plus. · Competitive salary: $131,325–$201,000 USD, with potential for pre-IPO equity and cash bonuses. · Comprehensive medical, dental, and vision coverage. · Paid time off and holidays. · Remote work flexibility. · Opportunities for professional growth, mentorship, and leadership in a foundational SRE team. · Participation in shaping processes for high-reliability systems in regulated environments. Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching. When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly. 🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements. 📊 It compares your profile to the job’s core requirements and past success factors to determine your match score. 🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role. 🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed. The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team. Thank you for your interest! #LI-CL1
This job posting was last updated on 11/26/2025