Deepgram

via Remote Rocketship

All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Anywhere

Full-time

Posted 3/9/2026

Verified Source

Key Skills:

AWS

Cloud Infrastructure

CI/CD Pipelines

Python

Bash

System Architecture

Compensation

Salary Range

$85K - 130K a year

Responsibilities

Architect and maintain Kubernetes-based AI/ML infrastructure with Terraform and manage HPC GPU resources.

Requirements

Requires 5+ years in SRE/Platform Engineering with expertise in Kubernetes, Terraform, Slurm, bare metal management, and scripting.

Full Description

Job Description: • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments Requirements: • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE) • Proven, hands-on experience building and managing production infrastructure with Terraform • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management • Strong scripting and automation skills (e.g., Python, Go, Bash) Benefits: • Medical, dental, vision benefits • Annual wellness stipend • Mental health support • Life, STD, LTD Income Insurance Plans • Unlimited PTO • Generous paid parental leave • Flexible schedule • 12 Paid US company holidays • Quarterly personal productivity stipend • One-time stipend for home office upgrades • 401(k) plan with company match • Tax Savings Programs • Learning / Education stipend • Participation in talks and conferences • Employee Resource Groups • AI enablement workshops / sessions

This job posting was last updated on 3/10/2026

JobLogr gets you hired faster

Save $15k

in lost income

Get back 54 hrs + hired 3.5x faster

than average job search

Try for Free

No credit card required

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »