Find your dream job faster with JobLogr
AI-powered job search, resume help, and more.
Try for Free
Deepgram

Deepgram

via Remote Rocketship

All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Anywhere
Full-time
Posted 3/9/2026
Verified Source
Key Skills:
AWS
Cloud Infrastructure
CI/CD Pipelines
Python
Bash
System Architecture

Compensation

Salary Range

$85K - 130K a year

Responsibilities

Architect and maintain Kubernetes-based AI/ML infrastructure with Terraform and manage HPC GPU resources.

Requirements

Requires 5+ years in SRE/Platform Engineering with expertise in Kubernetes, Terraform, Slurm, bare metal management, and scripting.

Full Description

Job Description: • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments Requirements: • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE) • Proven, hands-on experience building and managing production infrastructure with Terraform • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management • Strong scripting and automation skills (e.g., Python, Go, Bash) Benefits: • Medical, dental, vision benefits • Annual wellness stipend • Mental health support • Life, STD, LTD Income Insurance Plans • Unlimited PTO • Generous paid parental leave • Flexible schedule • 12 Paid US company holidays • Quarterly personal productivity stipend • One-time stipend for home office upgrades • 401(k) plan with company match • Tax Savings Programs • Learning / Education stipend • Participation in talks and conferences • Employee Resource Groups • AI enablement workshops / sessions

This job posting was last updated on 3/10/2026

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »
JobLogr badgeTinyLaunch BadgeJobLogr - AI Job Search Tools to Land Your Next Job Faster than Ever | Product Hunt