$120K - 180K a year
Build and maintain distributed cloud services for LLM inference including deployment, monitoring, and optimization.
Experience with cloud distributed systems, LLM hosting and inference, benchmarking tools, GPU software stacks, and distributed inference optimization.
Job Description: >> Looking for devs with general cloud services distributed services experience, with LLM experience as a secondary skill. GPU experience is now low on the list of preferred skills: Dedicated Inference Service Requirements: >> Deep experience building services in modern cloud environments on distributed systems (i.e., containerization (Kubernetes, Docker), infrastructure as code, CICD pipelines, APIs, authentication and authorization, data storage, deployment, logging, monitoring, alerting, etc.) >> Experience working with Large Language Models (LLMs), particularly hosting them to run inference >> Strong verbal and written communication skills. Your job will involve communicating with local and remote colleagues about technical subjects and writing detailed documentation. >> Experience with building or using benchmarking tools for evaluating LLM inference for various models, engine, and GPU combinations. >> Familiarity with various LLM performance metrics such as prefill throughput, decode throughput, TPOT, and TTFT >> Experience with one or more inference engines: e.g., vLLM, SGLang, and Modular Max >> Familiarity with one or more distributed inference serving frameworks: e.g., llm-d, NVIDIA Dynamo, and Ray Serve etc. >> Experience with AMD and NVIDIA GPUs, using software like CUDA, ROCm, AITER, NCCL, RCCL, etc. >> Knowledge of distributed inference optimization techniques - tensordata parallelism, KV cache optimizations, smart routing etc.
This job posting was last updated on 10/18/2025