This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineer (SRE) at NTT DATA is a critical role focused on ensuring the reliability and performance of systems and infrastructure. The ideal candidate will have a bachelor's degree in Computer Science or a related field, along with hands-on experience in SRE or related roles. Proficiency in Linux, cloud platforms, and programming languages such as Python and Java is essential. The role involves monitoring system health, implementing incident response processes, and optimizing system resources. Candidates should have strong problem-solving skills and the ability to collaborate effectively with cross-functional teams.
Job Responsibility:
Monitors system health, performance metrics, and alerts to identify and respond to incidents promptly and diagnoses issues, troubleshoots problems, and restores services in a timely manner
Implements incident response processes to minimize downtime and improve system availability
Designs, develops, and maintains automation tools, scripts, and processes to streamline system management tasks, deployments, and configuration changes
Implements infrastructure-as-code principles to ensure consistency and repeatability
Optimizes system resources, configurations, and processes to enhance performance, scalability, and efficiency
Uses monitoring tools and performance testing to identify bottlenecks and implement optimizations
Collaborates with teams to forecast system resource needs, plans for capacity growth, and ensures adequate scalability
Leads incident response efforts, coordinates with cross-functional teams, and drives the resolution of system issues
Performs thorough post-incident analysis to identify root causes and implements preventive measures to minimize future incidents
Identifies opportunities for automation and drives the implementation of self-healing, monitoring, and deployment of automation tools and frameworks
Continuously improves operational efficiency, system reliability, and availability through process enhancements and automation
Ensures consistency across environments, tracks changes, and enforces configuration standards
Works closely with development teams, operations teams, and other stakeholders to ensure effective collaboration, knowledge sharing, and alignment on reliability goals
Implements security best practices, works with security teams to assess and address vulnerabilities, and ensures compliance with security standards and regulations
Requirements:
Bachelor's degree or equivalent in Computer Science, Information Technology, or a related field
Seasoned hands-on experience in a Site Reliability Engineering role or related roles, including experience in designing and maintaining highly available and scalable systems
Seasoned hands-on experience with Linux/Unix systems, networking, and system administration
In-depth knowledge of cloud platforms (such as AWS, Azure, or Google Cloud) and associated services
Seasoned proficiency in multiple programming languages like Python, Java, Go, or Ruby
Seasoned understanding of complex infrastructure architectures, including scalable and fault-tolerant designs
experience with infrastructure-as-code tools (such as Terraform or CloudFormation) and containerization technologies (such as Docker or Kubernetes)
Seasoned experience in designing and implementing robust automation frameworks, CI/CD pipelines, and deployment strategies
Seasoned experience in incident management, troubleshooting complex system issues, and conducting post-incident analysis
Seasoned understanding of DevOps principles, Agile methodologies, and a strong commitment to continuous improvement and learning
Nice to have:
Relevant certifications, such as AWS Certified DevOps Engineer - Professional, Google Cloud Professional DevOps Engineer, or Certified Kubernetes Administrator (CKA)
Expertise in scripting languages like Bash or PowerShell