1 open position available
Manage and ensure reliability, scalability, and performance of NVIDIA DGX and Cisco UCS HPC infrastructure, automate operations using Python, Ansible, Terraform, and Go, and deliver automation via CI/CD pipelines. | Experience with NVIDIA DGX or equivalent HPC clusters, Cisco UCS C885A, Docker, automation tools like Python, Ansible, Terraform, Go, and CI/CD pipelines. | Position: AI Infra SRE Engineer – DGX Location: Remote Duration: Fulltime Must-have • NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM) • Cisco UCS C885A • Docker Good to have • DevOps Automation • CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins) • Terraform, Ansible, Jenkins • Python • GoLang, C/C++ • Enterprise Grade Kubernetes cluster (RedHat OpenShift – preferred) and/or Google Anthos • Software development lifecycle includes design, development, testing, packaging, and deployment using Golang Roles & Responsibilities • Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System. • Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure • by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches. • Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements. • Automate operational capabilities using Python, Ansible, Terraform, Go etc. • Deliver automation through CI/CD pipeline and chatbot etc. • Implement metrics driven processes to ensure service quality targets are met.
Create tailored applications specifically for Nastech Global with our AI-powered resume builder
Get Started for Free