$120K - 180K a year
Manage platform operational aspects including availability, latency, throughput, monitoring, incident response, and capacity planning while collaborating with a global engineering team.
7+ years software engineering experience with C++, Java, Python, or Go, Linux and storage technology expertise, configuration management skills, strong analytical and communication abilities, and a Bachelor's degree or equivalent.
Description: • Expertise with Linux engineering and administration for thousands of bare metal servers and virtual machines • Responsible for troubleshooting server hardware issues • Responsible for all operational aspects of our platform - Availability, Latency, Throughput, Monitoring, Issue Response (analysis, remediation, deployment) and Capacity Planning with respect to Latency and Throughput • Work in a team of highly motivated engineers distributed across the globe • Use your passion for technology, automation, and tooling to ensure our platform operates 24x7 • Obsess about learning, and champion the newest technologies & tricks with others, raising the technical IQ of the team. • Have broad exposure to our entire architecture and become one of our experts in our overall process flow • Have an intrinsic drive to make things better • Have experience with modern monitoring and telemetry stacks (ELK, Prometheus, Grafana) • Gather and analyze metrics from both operating systems and applications to assist in performance tuning • Ability to lead incident analysis for incidents, champion incident response practices and assist in correlating incidents to systemic problems, and drive towards resolution. Requirements: • Bachelors degree and/or equivalent experience in Computer Science • A minimum of 7 years of experience in software engineering • A minimum of 7 years of experience in one or more of: C++, Java, Python, Go • Experience with storage technologies (Examples: SAN, NAS, NFS, Object Storage, FreeNAS, iSCSI) • Experience with Infrastructure technologies (Examples: Linux, Windows, VMware, Docker, Kubernetes, etc.) • Experience writing technical documentation • Configuration management experience with one or more tools such as Puppet, Chef, Ansible • Solid understanding of application design, including operational trade-offs of various designs • Analytical skills coupled with a strong sense of urgency, ownership, and drive • Ability to work with well in a team-focused environment with other SREs and Engineers • Ability to broadly communicate and present recommended conventions defined by the reliability team broadly Benefits: • Remote-friendly and flexible work culture • Market leader in compensation and equity awards • Comprehensive physical and mental wellness programs • Competitive vacation and holidays for recharge • Paid parental and adoption leaves • Professional development opportunities for all employees regardless of level or role • Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections • Vibrant office culture with world class amenities • Great Place to Work Certified™ across the globe
This job posting was last updated on 10/11/2025