via Remote Rocketship
$200K - 250K a year
Developing and maintaining large-scale systems, improving observability, and implementing SRE principles.
Extensive experience in full-stack web development, cloud platforms, and modern JavaScript frameworks, but lacking specific SRE, Python, and systems engineering expertise required for this role.
Job Description: • Develop and maintain large-scale systems supporting critical use-cases including frontier model training for AI Infrastructure, driving reliability, operability, and scalability across global public and private clouds. • Collaborate on tooling for HPC, GPU Training, and AI Model training workflows. • Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution, driving continuous improvement in system performance. • Establish frameworks for operational maturity, lead sustainable incident response protocols, and conduct blameless postmortems to improve team efficiency and system resilience. • Implement SRE fundamentals, including incident management, monitoring, and performance optimization, while designing automation tools to reduce manual processes and operational overhead. • Work with engineering teams to deliver innovative solutions, uphold high standards for code and infrastructure, and contribute to hiring for a diverse, high-performing team. Requirements: • Degree in Computer Science or related field, or equivalent experience with 5+ years in Software Development, SRE, or Production Engineering. • Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby). • Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, Azure, GCP, or OCI). • Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK). • Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab). • Strong communication skills with the ability to convey technical concepts effectively to diverse audiences. • Commitment to fostering a culture of diversity, curiosity, and continuous improvement. Benefits: • comprehensive benefits package • equity
This job posting was last updated on 2/18/2026