via Remote Rocketship
$100K - 140K a year
Architect and oversee cloud platform engineering, Kubernetes operations, and reliability engineering in a multi-cloud environment.
Requires strong Kubernetes, cloud infrastructure, multi-language proficiency including Go and Python, and experience with SRE and observability tools.
Job Description: • Broad domain architect for the internal developer platform and all cloud engineering • Drive architecture for tooling or in-house software • Mentor other platform engineers to drive strong engineering practices • Enablement of platform engineering technical capabilities in our internal client teams in software engineering • Peer with the senior architects and engineers in software engineering • Architecture and engineering focused on GCP environment • Architect and oversee GKE cluster operations and workload management • Provide feedback to others and participate in peer reviews / pair programming • Drive the broad adoption of Test Driven Development through designing, development, and debugging unit and integration tests for new and existing infrastructure and code • Continuous curiosity of existing implementations and new technologies and sharing with the team • Practice continuous improvement across all job areas and personally / professionally • Clearly communicate with platform engineering teams and other stakeholders and provide technical direction while doing so • Stay current with platform changes and third-party libraries. • Proactively investigate better solutions for current solutions • An understanding of Open Telemetry and true observability and the difference between it and monitoring and logging • Grow the engineering culture towards a high-performing team • Practice the arts of self-service, least privilege and security by default in all solutions • Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets • Lead incident response, including on-call rotations, root cause analysis, and post-mortem reviews • Implement and optimize monitoring, alerting, and observability systems for system reliability • Collaborate on capacity planning and performance optimization to ensure high availability • Other duties as assigned Requirements: • Bachelor's degree in Computer Science, Computer Engineering or related field or 8+ years experience as a software engineer • Proficiency in kubernetes. Optional: CKA, CKAD • Extensive experience in Unix / Linux • Polyglot and proficiency in multiple languages (ideally: Golang, NodeJS, Python, HCL and more) • Knowledge of multi-cloud environment, including GCP, AWS, and Azure (familiar with at least two of these environments) • Experienced in using git in trunk-based development models • Experience in use of feature flagging in infrastructure and runtime (k8s) • Experience with backend database technology is a plus, including supporting and performance enhancements • Advanced experience working with and creating public cloud resources in Terraform or other infrastructure as code tools • Experience participating in a 24/7 on-call schedule without supervision and successfully resolving issues without escalation • Experience using Open Telemetry for observability as well as other monitoring tools such as datadog, new relic and others • Good understanding of networking and routing principles • Experience in dockerizing applications and orchestrating them with kubernetes • Familiarity with security configuration for web/api services (SSL, Access control) • Experience with JIRA or other work tracking systems. • Ability to resolve tickets according to priority order and collaborating with the Technical Product Manager to adjust priorities • Excellent documentation details, using Confluence or similar tooling – this could include support notes, runbooks, ADRs, etc • Familiarity with creating an end to end CI/CD pipeline using various tools with artifact storage • Familiarity with use of MacOS as a desktop and predominantly CLI interfaces • Experience in a “product mindset” by understanding stakeholder needs, priorities and business value • Experience with security compliance frameworks including FedRAMP, NIST, and SOC2 • Proven experience in SRE practices, including incident management and reliability engineering • Familiarity with monitoring tools like Prometheus, Grafana, or Honeycomb for observability • Experience with chaos engineering, load testing, or reliability testing frameworks. Benefits: • Employees are expected to provide a high level of security to any personal or private information accessed as part of their work, whether at a DroneUp facility or remotely. • Participate in security training. • Remain sensitive to individual rights to personal privacy. • Comply with company policies.
This job posting was last updated on 3/2/2026