via Dice
$70K - 90K a year
Provide 24x7 monitoring, incident coordination, automation, and knowledge management to ensure availability and performance of hybrid cloud infrastructure.
8+ years in IT operations with hands-on experience in monitoring tools, incident resolution coordination, scripting, and excellent communication skills.
Duration: (5 6) Months with possible extension or conversion. (depending on performance) Contract Type: W2 Contract Job Responsibilities: Hybrid Cloud Infrastructure You will help ensure the availability, reliability, and performance of business-critical applications and infrastructure by providing 24x7x365 monitoring, proactive incident response, knowledge management, and automation. You ll work with internal technology teams and third-party vendors to quickly detect, escalate, and resolve incidents while reducing manual effort through shift left practices and scripting/automation. • Monitor applications/infrastructure using tools such as Dynatrace, Grafana, and Azure Monitor, tune dashboards, baselines, and alerts. • Serve as an Incident Coordinator for triage and major incidents: run bridge calls, document actions, and support PIRs. • Drive incident triage and escalation to meet rapid detection goals (e.g., TTD 5 minutes for major incidents) and support RCA and communications. • Build and maintain SOPs, knowledge articles, and known error content to improve L1 effectiveness. • Identify repetitive issues and create scripts/runbooks (PowerShell/Python/Bash) to automate detection and remediation. • Track and report operational KPIs (e.g., MTTD/MTTR, tickets worked, change validations, major incidents avoided). • Provide scheduled coverage for 24x7x365 operations, including off-hours and holidays as needed. • 8+ years in IT operations, incident management, or application support in a 24/7 environment. • Hands-on experience with observability/monitoring (Dynatrace, Grafana, and/or Azure Monitor), including alerting and dashboarding. • Experience supporting or coordinating major incident resolution (bridge calls, documentation, stakeholder communications). • Familiarity with ITSM tooling and workflows (e.g., ServiceNow). • Excellent scripting/automation skills (PowerShell, Python, and/or Bash) and documenting SOPs/knowledge articles. • Exceptional verbal and written communication skills; ability to document procedures, incident reports, and root cause analyses clearly. • Proven ability to provide effective escalation support and guidance to junior engineers and Tier 1/2 teams. • Bachelor s degree in a related field (or equivalent experience) • Ability to travel 10%, on average, based on the work you do and the clients and industries/sectors you serve
This job posting was last updated on 2/23/2026