$130K - 180K a year
Build and maintain scalable AI/ML infrastructure with observability, performance optimization, governance, and collaboration with data scientists and engineers.
7+ years software engineering with 3+ years deploying ML/AI systems, strong Python and Go skills, container orchestration, cloud platform expertise, CI/CD pipelines, and ML lifecycle tools experience.
About Start2Scale Start2Scale is a talent acquisition advisory and recruitment consultancy, built for tech companies during their startup and scaleup stage. Our founding team brings experience supporting world-class tech companies including Atlassian, Microsoft, Google, AWS and multiple tech startups globally. We are proud to be the preferred TA support for numerous VC-backed scaleups across Australia, United States and internationally. About Our Client Our client is a Gartner Magic Quadrant Leader in its market vertical, powering its proprietary tech for thousands of companies worldwide. They're part of the next generation of innovative companies creating developer-friendly APIs that are better than building from scratch. Their customers include household names like major retail chains, global sports brands, and Fortune 500 companies who rely on their AI-powered discovery technology to connect users with what matters most. The Opportunity Join an elite 4-person team reporting directly to the CxO level, tackling the highest priority internal problems for capability and organizational posture uplift. This isn't just another engineering role—you'll have direct access to executive leadership and organization-wide impact on the future of their AI-powered technology. As a Senior AI/MLOps Engineer on this critical team, you'll bridge the gap between cutting-edge data science research and robust, scalable AI services that power experiences for millions of users. You'll be responsible for converting research prototypes into production-ready systems that operate at massive scale with sub-second latency requirements. This role offers exceptional visibility into strategic initiatives and the opportunity to shape the technical direction of a market-leading platform used by thousands of businesses globally. What You'll Do Observability & Reliability: Define and monitor SLIs/SLOs for model latency, throughput, accuracy, drift, and cost, and Integrate logging, tracing, and metrics (Datadog etc.) and establish alerting & on-call practices. Data & Feature Engineering: Collaborate with data engineers to create scalable pipelines that ingest clickstream logs, catalog metadata, images, and user signals, and Implement real-time and offline feature extraction, validation, and lineage tracking. Performance & Cost Optimization: Profile models and services; leverage hardware acceleration (GPU, TPU), libraries (ONNX, OpenVINO), and caching strategies (Redis,Faiss) to meet aggressive latency targets, and Right-size clusters and workloads to balance performance with cloud spend. Governance & Compliance: Embed security, privacy, and responsible-AI checks in pipelines; manage secrets, IAM roles, and data-access controls via Terraform or CloudFormation, and Ensure auditability and reproducibility through comprehensive documentation and artifact tracking. Collaboration & Mentorship: Partner closely with Data Scientists, Product Owners, and Site Reliability Engineers to align technical solutions with business goals, and Coach junior engineers on MLOps best practices and contribute to internal knowledge-sharing sessions. You Might Be a Fit If You Have Essential Experience • 7+ years in software engineering with 3+ years deploying ML/AI systems at enterprise scale • Strong coding skills in Python and at least one statically typed language (Go preferred) • Hands-on expertise with containerization (Docker), Kubernetes orchestration, and cloud platforms (AWS/GCP/Azure) • Proven track record building CI/CD pipelines and automated testing frameworks for ML workloads • Deep understanding of REST/gRPC APIs, message queues, and stream/batch processing frameworks Technical Depth • Experience implementing monitoring, alerting, and logging for mission-critical services • Familiarity with ML lifecycle tools (MLflow, Kubeflow, SageMaker, Vertex AI, Feature Stores) • Working knowledge of feature engineering, model evaluation, A/B testing, and drift detection • Understanding of performance optimization techniques and cost management at scale Leadership Qualities • Ability to influence technical decisions at the executive level • Experience mentoring engineers and driving adoption of best practices • Strong communication skills with both technical and non-technical stakeholders • Track record of solving high-impact, organization-wide technical challenges Why This Role Is Exceptional Executive Access & Impact • Direct reporting line to CxO leadership with organization-wide visibility • Opportunity to influence strategic technical decisions and product direction • Access to executive meetings and the ‘strategic layer’ Elite Team Environment • Work alongside 3 other senior engineers on the most critical internal initiatives • Fast-track career progression within a market-leading technology company • Exposure to cutting-edge AI/ML technologies at unprecedented scale Market Leadership • Join a Gartner Magic Quadrant Leader with proven market dominance • Work on technology that processes trillions of searches annually • Impact millions of users through highly scalable AI-powered systems Technical Excellence • Access to world-class infrastructure and a vast array of technical resources • Opportunity to work with the latest MLOps tools and technologies • Solve complex problems that few engineers ever encounter Current Tech Stack • Backend: Python (Primary), Go • Infrastructure: Kubernetes, AWS/GCP, Docker • ML Platform: MLflow, Kubeflow, SageMaker, Vertex AI • CI/CD: GitHub Actions, Terraform, Automated Testing Frameworks • Observability: Datadog, Prometheus, Grafana • Data Processing: Kafka, Spark, Airflow, Feature Stores Work Arrangement This is a permanent role offering flexible remote work in the SF region. The company emphasizes impact, contribution, and output over physical location, operating as a high-trust environment where team members have autonomy to choose where and when they work most effectively. The team will need to be based in the SF area whilst being remote first. Ready to Apply? If you're passionate about building world-class AI/ML infrastructure at unprecedented scale, and you thrive in an elite environment with direct access to executive leadership, this is an exceptional opportunity to shape the future of search technology. This role offers immediate impact on organization-wide initiatives, direct exposure to CxO leadership, and the chance to work on technical challenges that define the next generation of AI-powered tech experiences. Start2Scale is committed to building an inclusive workplace and welcome applications from talented people regardless of race, age, ancestry, religion, sex, gender identity, sexual orientation, marital status, color, veteran status, disability and socioeconomic background.
This job posting was last updated on 9/23/2025