NVIDIA

via LinkedIn

All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Product Manager, Health Automation and Resilience, Product Manager, Health Automation and Resilience

Redmond, WA

Full-time

Posted 2/20/2026

Verified Source

Key Skills:

Agile Product Management

Cloud-Native Solutions

React

API Development

Data Analytics

Compensation

Salary Range

$168K - 328K a year

Responsibilities

Lead product backlog prioritization and collaboration with cross-functional teams to deliver digital product solutions.

Requirements

Over 10 years of product management experience focused on enterprise software and digital transformation, but no demonstrated experience in cloud infrastructure, distributed systems, or GPU hardware.

Full Description

NVIDIA DGX Cloud is searching for a highly technical Product Manager to guide Health Automation and Resilience efforts for AI infrastructure. This role is responsible for developing products for fault detection, failure classification, automated repair workflows, and resilience tooling that enables consistent GPU fleet performance. You will build the next generation of health automation capabilities including detection pipelines, classification mechanisms, repair automation, and distributed resilience methods. The position lies at the crossroads of distributed systems, observability, GPU hardware, and cloud operations. You will collaborate with engineering teams to transform signals, telemetry, and operational lessons into automation infrastructure that improves cloud provider efficiency and end-user experience at scale. If you are motivated by building foundational systems that enable large AI clusters to operate dependably and efficiently, we would love to hear from you. What You Will Be Doing • Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets. • Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components. • Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention. • Work with cloud providers and enterprise operators to understand failure modes and operational challenges. • Develop product specifications, technical requirements, and validation criteria for both internal and open-source components. • Support go-to-market activities including documentation, demos, partner enablement, and release readiness. • Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy. • Lead product technical reviews, customer conversations, and planning sessions. What We Need To See • Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience. • 8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields. • Track record defining multi-quarter strategy and leading execution with multiple engineering teams. • Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows. • Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems. • Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments. • Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments. • Experience working with open-source technologies or products for software developers. • Excellent communication skills across engineering, customers, and executives. Ways To Stand Out From The Crowd • Experience with GPU-accelerated compute, HPC systems, or large-scale AI clusters. • Knowledge of Kubernetes operators, node health workflows, autoscaling, or control-plane automation. • Experience with modern observability and diagnostics technologies such as Prometheus, OpenTelemetry, eBPF, or distributed tracing. • Contributions to infrastructure or reliability open-source communities. • Experience writing detailed build documents for software agents, distributed services, or platform-level components. NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 258,750 USD for Level 4, and 208,000 USD - 327,750 USD for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until January 13, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. , , JR2009290

This job posting was last updated on 2/24/2026

JobLogr gets you hired faster

Save $15k

in lost income

Get back 54 hrs + hired 3.5x faster

than average job search

Try for Free

No credit card required

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »