via North Indian Granite
$187K - 208K a year
Design and implement monitoring and observability solutions for Azure-hosted applications using enterprise tools and collaborate with multiple teams to ensure comprehensive coverage.
5+ years experience with enterprise monitoring tools (APM, NPM, infrastructure monitoring), strong Azure monitoring skills, scripting/automation, networking fundamentals, and incident response support.
Cloud Monitoring and Observability Engineer (Azure) Seeking a skilled Cloud Monitoring and Observability Engineer (Azure) engineer to design, implement, and optimize end-to-end monitoring and observability solutions for a mission-critical application deployed in the Azure environment. The ideal candidate has hands-on experience with enterprise monitoring tools-such as AppDynamics, Thousand Eyes, NetScout, and SolarWinds (or equivalent alternatives)-and a strong background in building scalable, secure, and compliant observability stacks for cloud deployments. This role will collaborate closely with application engineering, cloud platform, network, and security teams to ensure comprehensive coverage across application, infrastructure, and network layers. Key Responsibilities • Design and implement end-to-end monitoring, alerting, and observability for an Azure-hosted application across application, infrastructure, network, and user experience layers. • Configure, integrate, and maintain enterprise monitoring platforms to deliver actionable telemetry, performance baselines, and SLA/SLO tracking. • Build dashboards, health checks, synthetic tests, and alerting workflows; optimize alert fidelity to minimize noise and improve signal-to-noise ratio. • Establish and document telemetry standards (metrics, logs, traces), data collection strategies, and service-level indicators (SLIs) aligned to reliability objectives (SLOs). • Integrate Azure-native services (Azure Monitor, Log Analytics, Application Insights) with enterprise tools to provide unified visibility and correlation. • Implement network performance monitoring, path visibility, and internet/extranet testing using NPM tools (e.g., ThousandEyes, NetScout); leverage infrastructure monitoring platforms (e.g., SolarWinds) for device and service health. • Instrument applications with APM tools (e.g., AppDynamics, Dynatrace, New Relic) for business transaction monitoring, dependency mapping, and root-cause analysis; tune anomaly detection and policy thresholds. • Collaborate with DevOps/SRE teams to embed monitoring into CI/CD and infrastructure-as-code patterns; ensure new services adhere to observability standards. • Define runbooks and escalation paths; support incident response and post-incident reviews with data-driven insights and remediation recommendations. • Ensure monitoring solutions meet applicable security and compliance requirements; support audit requests with clear documentation and evidence. • Conduct capacity and performance trend analysis; recommend optimization, right-sizing, and resilience improvements. • Provide knowledge transfer, documentation, and training on monitoring tools, best practices, and operational workflows. Required Qualifications • 5+ years implementing enterprise monitoring/observability for cloud or hybrid environments, including mission-critical applications. • Demonstrable expertise with at least one tool in each category (or equivalent), including production deployments, advanced configuration, and operational use: • pplication Performance Monitoring (APM): AppDynamics, Dynatrace, or New Relic. • Experience instrumenting services for business transaction tracing, code-level diagnostics, service maps, and anomaly detection. • bility to design APM dashboards and create alert policies with appropriate thresholds and baselines. • Network Performance Monitoring (NPM) / Digital Experience Monitoring (DEM): Thousand Eyes, NetScout, or Kentik. • Experience with synthetic tests, path visualization, packet-level analysis, and internet/WAN performance monitoring. • bility to configure endpoint agents, BGP/DNS tests, and multi-hop path monitoring for user experience correlation. • Infrastructure Monitoring and Event Management: SolarWinds, Microsoft SCOM, Datadog, or Prometheus/Grafan. • Experience monitoring servers, containers, network devices, and cloud services; creating availability and capacity dashboards. • Proficiency with alert routing, de-duplication, and event correlation. • Strong Azure monitoring experience: Azure Monitor, Log Analytics (KQL), Application Insights, and integration with third-party tools. • Solid understanding of distributed tracing, metrics, and log aggregation; familiarity with Open Telemetry concepts and data pipelines. • Scripting/automation skills (PowerShell, Python, or Bash) to automate monitoring configuration, agent deployment, test creation, and reporting. • Networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP), CDN concepts, and WAN performance monitoring; ability to correlate app and network telemetry. • Experience supporting incident response and performance troubleshooting across applications, infrastructure, and network layers. • Excellent documentation and communication skills; collaborative mindset with engineering, operations, and security stakeholders. Preferred Qualifications • Background in regulated environments (financial services, government, healthcare) with compliance-aware monitoring design. • Experience with log aggregation and SIEM/SOAR platforms (e.g., Splunk, Elastic) and integration with APM/NPM tools. • Integration experience with ITSM platforms (e.g., ServiceNow) for incident, change, and problem management workflows. • Familiarity with infrastructure-as-code (ARM/Bicep/Terraform) and embedding observability into IaC patterns; experience with CI/CD integration. • Exposure to SRE practices (SLIs/SLOs, error budgets, reliability reviews) and capacity/performance planning. • bility to code in one or more of the following languages for instrumentation, custom telemetry, SDK integration, and tooling automation: • Java: Implementing Open Telemetry SDKs/agents, custom instrumentation, and APM tagging; building synthetic test harnesses. • .NET (C#): Instrumenting ASP.NET services, configuring APM auto-instrumentation, writing custom exporters and health probes. • Python: Building automation scripts, collectors/exporters, synthetic tests, and integrating with monitoring APIs and SDKs. dditional Information • Duration: 12-month contract opportunity • Hybrid Work Model: 4 days onsite required weekly in NYC, Pittsburgh, PA or Lake Mary, FL office • Rate Range: $90-100/hr. W2 (based on experience)
This job posting was last updated on 3/6/2026