Vizcom

via Ashby

All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Senior Platform & Reliability Engineer (SRE)

Anywhere

Full-time

Posted 2/24/2026

Direct Apply

Key Skills:

Kubernetes

Reliability Engineering

Incident Command

Compensation

Salary Range

$Not specified

Responsibilities

Own service reliability end-to-end, focusing on incident prevention, failure isolation, and fast recovery.

Requirements

Calm, structured incident commander with expertise in failure modes, SLOs, Kubernetes runtime, and incident response.

Full Description

About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades. This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails. What You’ll Own Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows. Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access. Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety. Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation. Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution). Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline. Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk. Traits We’re Looking For Calm, structured incident commander under pressure. Thinks in failure modes and blast radius by default. Pragmatic: can stabilize quickly, then implement durable fixes. High ownership and strong written communication. First 90 Days Establish baseline reliability metrics and identify top platform risks. Tighten incident response mechanics (roles, comms cadence, runbooks, status updates). Deliver high-impact hardening fixes across probes/startup paths/queue safety. Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones. If possible please include one incident you personally led and send to Jordan@vizcom.com : 1) what failed, 2) how you contained it, 3) what permanent fixes you shipped, and measured.

This job posting was last updated on 2/26/2026

JobLogr gets you hired faster

Save $15k

in lost income

Get back 54 hrs + hired 3.5x faster

than average job search

Try for Free

No credit card required

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »