Luma AI

via Gem

All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Software Engineer - Cloud FinOps & Reliability

Anywhere

Full-time

Posted 12/6/2025

Direct Apply

Key Skills:

Python

Cloud Cost Management

AWS

GCP

Kubernetes

Docker

SQL

Automation

Data Analysis

Compensation

Salary Range

$120K - 180K a year

Responsibilities

Manage and optimize multi-cloud GPU infrastructure costs through automation, forecasting, and collaboration with SRE and research teams.

Requirements

5+ years in SRE, DevOps, or Cloud Cost Engineering with strong Python skills and deep knowledge of cloud infrastructure and cost models.

Full Description

About Luma AI Luma's mission is to build multimodal AI to expand human imagination and capabilities. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, and this role is a critical new function within it, dedicated to ensuring the financial efficiency of the supercomputer we are building. Where You Come In This is a foundational engineering position for a technical, data-driven expert who gets excited about optimization at a massive scale. As a foundational member of our SRE team, you will specialize in FinOps and cloud cost management, owning the financial health of one of the world's largest multi-cloud GPU infrastructures. You will be an SRE who applies a deep understanding of cloud architecture and pricing models to find and eliminate inefficiency. You will use your software engineering skills to build the tools and automation required to govern our cloud spend, providing critical insights that allow us to scale our AI research and products sustainably. What You'll Do Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning. Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets. Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately. Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs. Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders. Who You Are You have 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer. You have deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others. You are proficient in Python for the purpose of scripting, data analysis, and building automation tooling. You have a strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage. You are not an accountant; you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale. You are a tenacious troubleshooter and a data-driven decision-maker who thrives on finding the "why" behind the numbers. What Sets You Apart (Bonus Points) Experience managing a monthly cloud spend in excess of $1 million. Relevant certifications, such as the FinOps Certified Practitioner (FOCP). Experience building custom cost allocation, showback, or chargeback systems from scratch. A background working with large-scale GPU clusters for AI/ML workloads.

This job posting was last updated on 12/8/2025

JobLogr gets you hired faster

Save $15k

in lost income

Get back 54 hrs + hired 3.5x faster

than average job search

Try for Free

No credit card required

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »