My job alerts

Lead Site Reliability Engineer

Grab

This job is no longer accepting applications

See open jobs at Grab.See open jobs similar to "Lead Site Reliability Engineer" Sovereign’s Capital.

Software Engineering

Ho Chi Minh City, Vietnam

Posted on Wednesday, April 24, 2024

Company Description

Life at Grab

At Grab, every Grabber is guided by The Grab Way, which spells out our mission, how we believe we can achieve it, and our operating principles - the 4Hs: Heart, Hunger, Honour and Humility. These principles guide and help us make decisions as we work to create economic empowerment for the people of Southeast Asia.

Job Description

Get to know the Team

The Business Ecosystem SRE team is a longstanding team responsible for the stable operation of the core Grab systems. We make an impact by contributing to Business & Transaction Platform, Search & Personalization, Demand and Ads systems, Enterprise and the company's stability and operational excellence. Our team is made up of a group of passionate Site Reliability Engineers. If you are looking for an opportunity to work in a large scale cloud environment and utilize your sharp ideas to make engineers’ life better, then you should join our team!

Get to know the Role

We are looking for a Lead Software Reliability Engineer to provide better stability and operational excellence for Business Ecosystem tech families in Grab. We believe a successful candidate has professional sysops/infrastructure knowledge and the ability to build comprehensive systems, but if you believe you have what it takes then we’d love to hear from you either way. This role is required because stability and operational excellence is critical to our services. In return, you will get an opportunity to generate impacts to Grab’s core systems.

The Day-to-Day Activities

Engage in and improve the whole lifecycle of services - from design, through deployment, operation and refinement.
Work with engineering teams to design and write code to create systems which are highly available and able to scale seamlessly.
Help improve reliability, stability and scalability challenges with engineering teams
Get involved in deep diagnosis of incidents, and engage with multiple highly skilled engineering teams on resolutions.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Contribute to a culture of learning and responsibility by guiding teams to write detailed postmortem reports.
Identify and resolve problems relating to critical service operations and to prevent their recurrence using automation.
Be part of a cool team, responsible for one of the largest cloud based services in South East Asia.
Mentor other engineers, define our technical culture, set high engineering bars and help build a fast-growing team
Lead other engineers to conquer challenging projects with great qualities
Contribute initiatives to improve tech family’s stability and operational excellence

Qualifications

The Must-Haves

Bachelor's or Master's degree in Computer Science, Software Engineering, Information Technology or related technical field involving coding.
Preferably with at least 5 years of relevant experience of this role.
Strong experience with algorithms, data structures, complexity analysis and software design.
Strong experience in one or more of the following: Go, Python, C, C++, Java, Perl or Ruby.
Strong experience in using service monitoring, log, and alarm-related environments and tools.
Strong experience in system troubleshooting in Linux environment.
Solid experience in using Linux commands and shell script, and has the ability to automate routine tasks.
Solid experience with automation & provisioning tools (e.g Jenkins, Ansible/Chef/SaltStack/Puppet).
Possess analytical skills, mental resilience and the ability to think systematically under stressful conditions.
Highly accountable and takes ownership. Outstanding work ethic, high-integrity, team player, and a lifelong learner.
Proficiency in verbal and written English.

The Nice-to-Haves

Experience in Go.
Experience with cloud based large-scale infrastructure from vendors such as Amazon Web Services, Azure or Google Cloud Platform
Experience with containerization technologies (e.g Docker) and container orchestration platforms (e.g Kubernetes)
Experience on building high throughput streaming services, and knowledge on the streaming processing framework such as Flink
Contributes to open source project experience with performance analysis and debugging tools.

Additional Information

Our Commitment

We recognize that with these individual attributes come different workplace challenges, and we will work with Grabbers to address them in our journey towards creating inclusion at Grab for all Grabbers.