$120K - 180K a year
Lead design and optimization of cloud-native services using Kubernetes, Terraform, and GitOps, develop high-performance distributed systems, and improve infrastructure reliability and scalability.
5+ years managing large-scale cloud-native distributed systems with expertise in Kubernetes, Terraform, GitOps, Golang, observability, and incident response.
Description: • Lead the design and optimization of cloud-native services using Kubernetes, Terraform, and GitOps tools like ArgoCD • Develop high-performance integration patterns and manage scalable, distributed systems handling extensive data volumes • Dive into Golang and TypeScript codebases to identify and resolve performance bottlenecks at scale • Optimize infrastructure and application code to achieve aggressive performance and reliability targets, with a focus on chess programming at the bits level • Work closely with development teams to refine cloud service integration architectures and implement best practices • Monitor and enhance system reliability and performance through effective collaboration and innovative solutions • Participate in incident response for critical infrastructure issues, ensuring rapid resolution and minimal downtime • Drive improvements in infrastructure reliability, scalability, and operational efficiency • Utilize Terraform and Kubernetes to manage and scale our cloud infrastructure, ensuring robust, automated deployment processes Requirements: • 5+ years of experience managing and scaling large-scale, cloud-native distributed systems • Deep understanding of Kubernetes, Terraform, and GitOps practices • Expert in observability practices and ability to support incident response / on call • Extensive experience in high-performance service development with Golang • Proven ability to profile and optimize applications for high throughput and reliable operation • Strong knowledge of distributed systems design, failure modes, and robust architectural principles • Experience with data modeling and indexing strategies to support efficient service operations • Demonstrated experience improving system reliability and performance through deep code-level and architectural analysis • Excellent written and verbal communication skills • Experience working in globally distributed teams Benefits: • 100% remote (work from anywhere!)
This job posting was last updated on 10/11/2025