This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Azure Compute team builds a fault-tolerant, distributed system on top of commodity datacenter hardware to deliver infrastructure for hosting cloud applications in virtual machines (VMs). The team creates the illusion that resources are limitless, infinitely elastic, and always available. This role is part of the Availability Platform team within Azure Compute, which focuses on ensuring every Azure virtual machine achieves a Service Level Agreement (SLA) of 99.99 percent or higher. Meeting this target requires innovative thinking, data-driven decisions, and intelligent automation. The team owns services that monitor the health of millions of Azure machines and the control plane services that make repair decisions. We use artificial intelligence (AI) and machine learning to build predictive failure models that proactively live-migrate virtual machines before failures occur, minimizing customer impact and improving platform resilience. We are exploring generative artificial intelligence to enhance diagnostics, automate root cause analysis, and accelerate incident resolution. Collaboration with data scientists and AI researchers enables us to continuously evolve the platform with smarter, self-healing capabilities. As a Senior Software Engineer, you will join a team that emphasizes comprehensive designs, incremental development with high quality, frequent shipping, and rapid adaptation to customer feedback. This role offers hands-on experience with services architecture at hyperscale while pushing the boundaries of scale, reliability, availability, and efficiency.
Job Responsibility:
Partners with appropriate stakeholders spanning across teams and orgs to determine project requirement
Leads the design and architecture of change management features and services in Azure Compute
Identifies dependencies and authors design documents for features and services
Leverages expertise with appropriate stakeholders to develop project plans, release plans, and work items
Develops high quality, extensible, maintainable code and coaches others to do the same
Supports livesite as Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions
Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers
Collaborates with data scientists and ML engineers to design and integrate predictive models that proactively detect hardware anomalies and trigger live migrations, improving VM uptime and SLA compliance
Leads initiatives to embed AI-driven diagnostics and root cause analysis into availability services, reducing time-to-resolution for incidents and improving operational efficiency
Drives the adoption of generative AI tools to automate documentation, incident summaries, and engineering workflows, enhancing team productivity and knowledge sharing
Partners with platform teams to build intelligent observability pipelines that leverage anomaly detection and trend analysis for early warning systems
Evaluates and integrates large-scale AI models into control plane services to enable smarter, context-aware repair decisions across millions of Azure VMs
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Nice to have:
Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python