This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Site Reliability Engineer (SRE) at Polygon Labs, you will play a key role in helping operate and support the production infrastructure that powers the Polygon network. Working alongside experienced SREs and protocol engineers, you will gain hands-on exposure to running large-scale, distributed blockchain systems while learning best practices for reliability, observability, and incident response. This is an ideal role for someone early in their SRE or infrastructure career who is curious about how production systems work, motivated to learn through real-world operational challenges, and excited to grow within a collaborative and mentorship-driven environment. Your work will directly contribute to the reliability and performance of critical public infrastructure used by developers and users globally.
Job Responsibility:
Monitoring production systems, alerts, dashboards, and logs across Polygon networks, including Polygon PoS and the Agglayer
Assisting with incident detection, triage, escalation, and resolution under the guidance of senior engineers
Supporting on-call and operational coverage through structured rotations, with training and mentorship
Following, maintaining, and improving runbooks and standard operating procedures
Assisting with routine operational tasks such as service restarts, upgrades, and configuration changes
Helping maintain and improve monitoring, logging, and alerting systems, including dashboards for network health, RPC performance, and node metrics
Learning to improve alert signal quality and reduce operational noise
Supporting cloud-based and containerized infrastructure, including nodes, RPC endpoints, and supporting services
Collaborating with protocol, product, and cross-functional teams to understand production issues and user impact
Participating in post-incident reviews and contributing to root-cause analysis documentation
Continuously building knowledge of blockchain fundamentals, distributed systems, and networking
Requirements:
A foundational understanding of Linux systems, processes, and basic networking concepts
Familiarity with at least one scripting or programming language, such as Python, Bash, or Go
An interest in site reliability, monitoring, and operating production infrastructure
Clear written and verbal communication skills, with a willingness to ask questions and learn
The ability to remain calm, methodical, and responsive during incidents or operational events
Nice to have:
Exposure to cloud platforms such as AWS or GCP
Familiarity with containerization or orchestration technologies, including Docker or Kubernetes
Basic understanding of blockchain or Web3 concepts, such as nodes, RPC services, or validators
Experience with monitoring and observability tools such as Grafana, Prometheus, Datadog, or ELK-based stacks
What we offer:
Remote first global workforce
Industry leading Medical, Dental and Vision health insurance
Company matching 401k with 3% match
$1,500 Home Office Set Up Allowance (life-time max)
$75 Monthly internet or phone reimbursement
Flexible Time Off
Company issued laptop
Egg freezing, mental health, and employee wellness benefits