$Not specified
The Site Reliability Engineer will respond to critical service incidents and proactively monitor application health across AWS infrastructure. They will collaborate with development teams to troubleshoot issues and implement best practices for security and performance.
Candidates should have 3+ years of experience in Site Reliability or DevOps roles, with a strong background in AWS and full stack environments. Proficiency in Node.js, MySQL, and containerization technologies is also required.
Responsibilities Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule. Proactively monitor application health and performance across cloud infrastructure (AWS). Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently. Lead and participate in disaster recovery drills and security incident simulations. Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services. Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker). Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs. Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility. Champion best practices in security, availability, performance, and incident response. Required Technologies & Tools Cloud Infrastructure: Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM. Programming/Scripting: Proficiency in Node.js and scripting for automation and tooling. Containerization: Experience with Docker for container-based deployment pipelines. Frontend Awareness: Familiarity with React and Ember.js to understand performance implications at the frontend level. Backend Stack: Understanding of NestJS and scalable Node-based services. Databases: Proficient in MySQL and performance monitoring of relational databases. Version Control: Proficiency with Git for collaborative code management and DevOps workflow integration. Core Competencies Incident Response: Calm and focused under pressure with a structured approach to resolving outages and degradation. System Design: Ability to contribute to and review architectural designs for scalability and resiliency. Collaboration: Strong communication skills to coordinate across developers, QA, and product teams. Automation & Efficiency: Passion for automation, repeatability, and continuous improvement. Security Mindset: Consistent implementation of security best practices and a strong grasp of data protection standards. Qualifications 3+ years of experience in a Site Reliability, DevOps, or related engineering role. Proven track record managing and scaling applications in a production AWS environment. Familiarity with full stack environments, particularly those using Node.jss. Experience maintaining and deploying databases such as MySQL with performance tuning. Experience with container orchestration (e.g., ECS or Kubernetes is a plus). Commitment to uptime, performance, and security in fast-moving SaaS environments.
This job posting was last updated on 9/27/2025