Find your dream job faster with JobLogr
AI-powered job search, resume help, and more.
Try for Free
MI

Microsoft

via Eightfold

All our jobs are verified from trusted employers and sources. We connect to legitimate platforms only.

Senior Incident Manager

Phoenix, Arizona, Washington, District of Columbia, Atlanta, Georgia, San Antonio, Texas, Redmond, Washington
Full-time
Posted 2/26/2026
Direct Apply
Key Skills:
Incident Management
Crisis Management
Root Cause Analysis

Compensation

Salary Range

$120K - 235K a year

Responsibilities

Coordinate resources during crises, drive mitigation plans, and conduct root cause analyses to improve crisis response and service stability.

Requirements

Bachelor's degree in a technical field or equivalent experience with 3+ years in data center or critical environment and ability to pass security screening.

Full Description

Overview Microsoft Cloud Infrastructure and Operations (CO+I) is the engine that powers Microsoft's cloud services. The group is responsible for designing, building, and operating Microsoft’s global datacenters; managing the programmatic delivery of our critical infrastructure design, equipment procurement, construction delivery, infrastructure innovation, demand planning and capacity utilization of our unified infrastructure; and responsible for all operations needed to run the physical infrastructure. We focus on smart growth with an emphasis on automation, data-driven engineering, cost‐effectiveness, and environmental sustainability. We deliver the core infrastructure and foundational technologies for Microsoft's 200+ online businesses including Azure, Office 365, Bing, Xbox Live, Skype, and OneDrive. Our portfolio is built and managed by a team of subject matter experts working 24x7x365 to support services for more than 1 billion customers and 20 million businesses in over 90 countries worldwide. Within CO+I, the Data Center Incident Management Team (DCIM) is responsible for 24 x 7 x 365 incident management for Microsoft data centers worldwide. Within the DCIM Team, we are seeking a highly motivated and experienced Senior Incident Manager to join our team. If you are a strategic thinker with a passion for driving business success, we encourage you to apply for this exciting opportunity. This position works a non-traditional schedule. You'll be required to be in the office Wednesday to Saturday. This role is located either in one or all hub locations - San Antonio, TX, Redmond, WA, Atlanta, GA, San Antonio, TX, Phoenix, AZ, or Washington DC. Relocation support will be provided, and successful candidates should relocate or reside within 50 miles of the hub office location. This role is hybrid work, 3 days / week in-office Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities Collaboration and Knowledge Sharing Shares insights and best practices that can be applied to improve development and operations across related sets of the systems, services, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with members of product engineering teams and other resources (e.g., conferences, learning sessions, wikis, documentation). Mentors and coaches other engineers to help them identify and propose relevant solutions. Collaborates within and across teams (e.g., within Service Engineering, across a service) by proactively and systematically sharing information with an appropriate level of detail for their audience. Overcomes obstacles by resolving conflicts and issues across interdependent teams and engages with partners and stakeholders so issues can be resolved and mutual objectives are met. Develops, leverages, and drives sharing of information and knowledge base (e.g., customer, product, industry, troubleshooting guides) across teams. Operational Excellence Leverages advanced technical expertise, judgment, and decision making to coordinate multiple work streams and resources in crisis situations to drive mitigation plan and resolve, reduce, or mitigate the impact of a crisis by engaging necessary teams and escalating to appropriate stakeholders. Independently conducts root cause analyses and participates in post-incident reviews based on incidences/crises for the purposes of leading continuous improvement. Applies diagnostic expertise. Provides guidance to other engineers working to mitigate and resolve issues. Communicates customer impact and other relevant information with key stakeholders, leadership, and customers. Develops and drives projects and programs to improve crisis response by creating standard practices (e.g., processes, standard operating procedures) for consistent response across engineering teams. Fosters increased service stability. Reduces future noise by participating in optimization of telemetry and alarming. Influences key stakeholders to adopt new standards and practices to broadly improve crisis and problem management. Creates, monitors, and takes action on telemetry data and influences telemetry analytics to better identify patterns that reveal errors and unexpected problems that are affecting the system's availability, reliability, performance, and/or efficiency. Develops scripts and/or automation and leverages an understanding of solutions to define, develop, measure, track, change, and improve the quality of telemetry pipelines that support automated monitoring and incident response. Identifies and develops telemetry collaborations that result in better-together services. Responds to incidents during regular on-call rotations, including complex incidents with major customer or business impact, by identifying the level of impact, troubleshooting, contributing to difficult decisions based on business impact, deploying appropriate fixes to resolve root cause(s), and implementing automations for prevention of recurring incidents through coordinating resources required for incident resolution, which may include product teams, owners, leadership, other engineering teams, and/or subject matter experts. Escalates resolution of highly complex, ambiguous, and impactful incidents as needed. Contributes to postmortems and shares details related to incidents and their resolution through post-mortem reports and regular review meetings. Provides expert incident response assistance to other Service Engineers as needed, and develops incident response and resolution guidance. Adheres to and promotes prescriptive guidance for security, privacy, and compliance standards in alignment with direction from the business and technical experts. Works with security, privacy, and compliance teams to identify and address issues relevant to their services and resolve them within the service level agreement (SLA). Provides assistance to other service engineers as needed. Independently implements reliable, scalable, and high-performance solutions across teams. Contributes to design documents. Owns implementation and rollback plans. Maintains quality checklist and related documentation. Quantifies and ensures the health and compliance of a service according to Engineering and industry standards. Security Management Monitors and maintains security by addressing security vulnerabilities through patches, reconfigurations, and/or settings updates. Identifies, prioritizes, and targets solutions to complex security issues that may impact customers and partners, and drives action to promote the adoption of relevant mitigations. Drives program and process of mitigation (e.g., automation), troubleshoots system issues, and partners closely with internal customers and engineering teams to conduct root cause analyses, share end-to-end expertise in services, and to mitigate and resolve issues. Communicates and drives adherence to security policies and procedures. Technical Knowledge and Expertise Takes ownership of service design by driving efforts within an organization to identify, define, recommend, and build optimal configurations of technology solutions with considerations for cost management, and service health, security, resiliency, and reliability, while taking into account scalability of services. Develops end-to-end expertise in service and/or system design, interactions between technology layers and components, functions of infrastructure, and dependencies at scale. Independently adjusts configurations and defines infrastructures to improve the availability, reliability, efficiency, observability, and/or performance of supported products and services. Drives collaborative reviews with the engineering teams that develop and/or manage services and other stakeholders, identifying opportunities for efficiencies in operations and sharing learnings and recommendations across engineering teams and other stakeholders working on related services within their organization. Independently designs a service/system in a manner that allow for robust and scalable measurement of quantifiable metrics for assessing health, quality, and functionality. Stays current in knowledge and expertise as technology landscape evolves, maintaining awareness of industry norms. Uses knowledge to drive the adoption of new solutions across engineering teams working with related products within an organization. Provides guidance to others through sharing, coaching, conferences, and other means to drive improvements across teams. Qualifications Required Qualifications: Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 3+ years technical experience in data center or critical environment space OR equivalent experience. Other Requirements: Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter. Preferred Qualifications: Master's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 6+ years technical experience in data center or critical environment space OR equivalent experience OR Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 8+ years technical experience in in data center or critical environment space OR equivalent experience OR equivalent experience. 3+ years technical experience working with large-scale cloud or distributed systems. #COICareers #COICDS Service Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

This job posting was last updated on 2/27/2026

Ready to have AI work for you in your job search?

Sign-up for free and start using JobLogr today!

Get Started »
JobLogr badgeTinyLaunch BadgeJobLogr - AI Job Search Tools to Land Your Next Job Faster than Ever | Product Hunt