Senior Site Reliability Engineer (SRE) job opportunity at 1GLOBAL.



Date2026-04-30T11:49:21.977Z bot
1GLOBAL Senior Site Reliability Engineer (SRE)
Experience: 5-years
Pattern: Full-time
apply Apply Now
Salary:
Status:

Job

Copy Link Report
degreeHigh School (S.S.C.E)
loacation Berlin, Germany
loacation Berlin....Germany
Auto GPT Summarize Enabled

About Us 1GLOBAL  is a technology-driven global mobile communications provider, delivering global connectivity solutions to enterprises and consumers. Powered by a best-in-class telecom platform – including its own owned and operated global mobile core network, fully fledged in-house developed eSIM technology, and an extensive portfolio of telecom licenses – 1GLOBAL operates as a fully regulated telecommunications provider across 40 countries worldwide. We serve many of the world’s leading banks, enterprises, and digital-first businesses, including neo-banks, global fast moving consumer goods companies, travel leaders, and payment service providers. Today, 1GLOBAL connects more than  70 million people and devices globally , enabling our customers to launch, scale, and innovate with confidence in the mobile ecosystem. 1GLOBAL is a profitable, fast-growing business. With full-year revenues in 2025 exceeding  US$200 million  and profits of over  US$25 million , we generate strong cash flows to fund our growth allowing us to continuously invest in infrastructure, platform innovation, and global expansion. Recent years have marked a defining phase in our journey, with major enterprise and mass-consumer client wins accelerating our evolution into a global mobile connectivity powerhouse, purpose-built to enable consumer brands to enter and succeed with their own  aspirations to offer telecommunications services to their clients. Founded in 2022 by experienced technology entrepreneurs, Hakan Koç and Pyrros Koussios, 1GLOBAL has rapidly emerged as a European technology leader shaping the future of global telecommunications . We operate as a fully regulated Mobile Virtual Network Operator (MVNO) in 12 countries and as a regulated telecom operator in an additional 28 markets. Headquartered in the Netherlands, with world-class R&D hubs in Lisbon, Berlin, and São Paulo, our team of close to  500 experts across 15 countries  is united by a single ambition: to redefine global mobile connectivity through technology, scale, and execution excellence. About the Team We are looking for a talented  Senior Site Reliability Engineer (SRE)  to join our Technology Department.  We are open to hiring this role in Berlin, Germany.  As a Senior SRE, you will be a senior individual contributor responsible for strengthening the stability, scalability, and reliability of our global infrastructure and services across both cloud and on-prem environments. You will work alongside SREs under the guidance of the SRE Team Lead, taking ownership of critical reliability domains and helping drive a data-driven reliability culture based on SLIs, SLOs, and error budgets. Your mission will be to proactively identify weaknesses across systems and improve reliability through redundancy testing, automation, and observability. You will design, build, and operate the tools and processes that automatically detect, prevent, and recover from incidents, ensuring our services remain reliable and performant for customers around the world.  This role collaborates closely with DevOps, Infrastructure, IP Network, and Security teams to maintain carrier-grade reliability standards across all layers of our platform. About the Role Act as a senior technical contributor within the SRE team, mentoring peers and setting the technical bar for reliability engineering.  Define, measure, and maintain  SLIs and SLOs  for core infrastructure and customer-facing services.  Plan and execute  redundancy and resilience testing  across service, infrastructure, and networking layers — validating failover, HA configurations, and disaster recovery readiness.  Design and implement  automated recovery mechanisms , self-healing workflows, and intelligent alerting systems.  Drive  incident response, root-cause analysis, and blameless post-mortems , and ensure implementation and tracking of corrective and preventive actions derived from them to achieve continuous improvement.  Develop and enhance observability (metrics, logs, traces) using Prometheus, Grafana, Loki, and OpenTelemetry.  Partner with Infrastructure and DevOps teams to ensure deployment safety, rollback policies, and configuration consistency.  Proactively identify weaknesses through fault-injection, load, and chaos testing.  Continuously reduce operational toil through automation and reliability tooling.  Contribute to on-call practices, improving alert quality, runbooks, escalation procedures, and incident management processes.  Perform capacity planning, performance benchmarking, and resilience audits across systems.  Ensure compliance with security, reliability, and availability standards.  Create and maintain internal documentation, playbooks, and operational guidelines for peers and users.  Contribute to cloud cost-optimization initiatives, including reserved capacity planning, autoscaling design, storage tiering, workload right-sizing, and continuous anomaly detection. About You Must Have A minimum of 5 years of experience in Site Reliability, Systems, or Infrastructure Engineering (including 2+ years in a dedicated SRE role).  Strong expertise in Linux systems engineering, distributed systems, and networking.  Proven experience building and running high-availability, mission-critical production systems.  Hands-on experience with redundancy and failover testing, disaster recovery, and high-availability architecture validation.  Deep understanding of monitoring, observability, and incident management principles. Experience with Prometheus, Grafana, Loki, Thanos, and OpenTelemetry or similar tools.  Proficiency in Python, Go, and Bash for automation and reliability tooling.  Strong knowledge of Kubernetes, container orchestration, and service mesh architectures.  Experience with AWS (EKS, EC2, VPC) and on-premises infrastructure integration.  Proficiency in Infrastructure as Code tools such as Terraform.  Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN, etc.).  Excellent analytical and problem-solving skills, capable of operating under pressure.  Strong communication and collaboration skills across distributed and cross-functional teams. Nice to Have Experience in telecom, carrier-grade, or large-scale distributed systems environments. Hands-on experience with chaos engineering and automated failure-scenario validation (e.g., simulating link or node failures). Strong understanding of high-availability networking concepts. Background in capacity planning, traffic engineering, and multi-region failover. Experience building reliability dashboards and integrating SRE metrics into business KPIs or compliance reports. Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53).

Other Ai Matches

HSS and Signalling Engineer - São Paulo/Rio de Janeiro Based Applicants are expected to have a solid experience in handling Job related tasks
Network Solution Architect - Data and Voice - São Paulo/Rio de Janeiro Based Applicants are expected to have a solid experience in handling Job related tasks
HSS and Signalling Engineer - Lisbon Based Applicants are expected to have a solid experience in handling Job related tasks
Voice and Services Engineer - São Paulo/Rio de Janeiro Based Applicants are expected to have a solid experience in handling Job related tasks