Site Reliability Engineer
Grab
Job Description
Come join the team that literally serves Grab - Grab’s Sentry team. The systems we oversee process billions of network messages for Grab's consumers every day, and our service orchestration and discovery platforms serve hundreds of Grab services without fail. We work closely with the infrastructure, security, and product teams, and are seeking talented engineers to join our platform team!
Your Role
You specialize in building, operating and maintaining leading edge solutions that Just Work™ . Built on world-class technology stacks, the software you develop brings our unique on-demand services experience to South-East Asia every day — be it transport, payments, food, or "the awesome things to come". Millions of people depend on the stability and efficiency your solutions provide, which is demanding in terms of design and quality but also incredibly rewarding.
You will
Build and own Grab's gateway systems, connecting millions of consumer devices with hundreds of backend services via Grab's service mesh..
Work at Grab's Viet Nam headquarters, closely integrated with software engineering and product teams to help build rock-solid and secure solutions with the right interface, technology and practices.
Be at home in a multi-cloud environment and build scalable, zero-downtime network traffic routing and analytics solutions for Grab's "public edge", serving billions of daily requests from millions of Grab customers and partners every day
Your Daily Routine
Independently drive projects across teams end to end, from inception to rollout
Find and troubleshoot issues in Grab's entire infrastructure and code base
Develop, maintain, and operate control- and data-plane components, and resolve production incidents
Implement quality solutions using Go and Lua and maintain the high bar of standards for code reviews and deployment processes. You mentor peers and promote development and operational excellence best practices while achieving excellent user experience
Qualifications
Your Experience
You are a habitual problem solver, and naturally assume ownership of your team’s systems and software components. You know how to be responsible for mission-critical systems and
Have a very good understanding of TCP/IP, HTTP, Routing network and the internet
Experience/certification in AWS
Solid experience with automation & provisioning tools (e.g Jenkins/gitlab CI, Ansible/Chef//Puppet)
Strong experience in system troubleshooting in the Linux environment.
Strong experience in using service monitoring, log, and alarm-related environments and tools.
Know how to build highly-available distributed systems.
Are fluent in English
Are fluent in Bash, Python, Terraform or Go, and have an understanding of common patterns and algorithms to confidently navigate 3rd-party code bases for debugging and troubleshooting
Your Advantage
HTTP/2. QUIC, and gRPC expertise
Dealing with massive concurrency and designing resilient algorithms
Experience with building monitoring and alerting systems
Hands-on experience with Terraform and large scale Docker / Kubernetes deployments
Experience with Consul and/or Envoy (envoyproxy.io), their code base, and their community
Experience in a startup