$120K - 160K a year
Manage and ensure reliability, scalability, and performance of NVIDIA DGX and Cisco UCS HPC infrastructure, automate operations using Python, Ansible, Terraform, and Go, and deliver automation via CI/CD pipelines.
Experience with NVIDIA DGX or equivalent HPC clusters, Cisco UCS C885A, Docker, automation tools like Python, Ansible, Terraform, Go, and CI/CD pipelines.
Position: AI Infra SRE Engineer – DGX Location: Remote Duration: Fulltime Must-have • NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM) • Cisco UCS C885A • Docker Good to have • DevOps Automation • CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins) • Terraform, Ansible, Jenkins • Python • GoLang, C/C++ • Enterprise Grade Kubernetes cluster (RedHat OpenShift – preferred) and/or Google Anthos • Software development lifecycle includes design, development, testing, packaging, and deployment using Golang Roles & Responsibilities • Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System. • Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure • by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches. • Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements. • Automate operational capabilities using Python, Ansible, Terraform, Go etc. • Deliver automation through CI/CD pipeline and chatbot etc. • Implement metrics driven processes to ensure service quality targets are met.
This job posting was last updated on 10/11/2025