Lead Software Engineer, Infrastructure
Grab
Company Description
About Grab and Our Workplace
Grab is Southeast Asia's leading superapp. From getting your favourite meals delivered to helping you manage your finances and getting around town hassle-free, we've got your back with everything. In Grab, purpose gives us joy and habits build excellence, while harnessing the power of Technology and AI to deliver the mission of driving Southeast Asia forward by economically empowering everyone, with heart, hunger, honour, and humility.
Job Description
Get to Know the Team
The Robotics Technology team is a core part of Grab's long-term vision to build urban embodied AI. Our engineers take full ownership of the product lifecycle: designing and manufacturing hardware in-house, developing control and machine‑learning systems, and rigorously testing in real-world conditions and production fleet operations. This is a fast-moving, multidisciplinary environment where software, hardware and data science experts collaborate closely to solve practical challenges at scale. We are executing an ambitious growth plan to expand our robotics fleet across cities over the coming years, and we are focused on delivering highly productive, safe and efficient robot delivery services that help address current delivery labor shortages.
Based in Singapore and China, our team offers opportunities to work on cutting-edge autonomy, deploy solutions in complex urban environments, and directly influence the future of last‑mile logistics. If you're excited by tangible impact, large-scale systems and cross-functional engineering, you'll find meaningful challenges and rapid career growth here.
Get to Know the Role
We are seeking an experienced and passionate Lead Software Engineer, Infrastructure (Lead Site Reliability Engineer) to join our robotics platform team in Singapore. You will work with a small, highly experienced engineering group building and maintaining the cloud and backend infrastructure that powers our robot fleet, simulation environments, data pipelines and ML model training/deployment. You will own the reliability, scalability and observability of systems used both in production robot operations and engineering/data science workflows. Initiative, strong problem‑solving skills and an operator-first mindset are critical — you'll proactively identify problems, build robust solutions, and help the broader team run and scale complex distributed systems.
You will report to the Senior Principal Engineer.
Work Type: 5-day onsite.
The Critical Tasks You Will Perform
- Design, build and operate scalable cloud infrastructure and IaC for fleet management, telemetry ingestion, data lakes, model training, and CI/CD for embedded and cloud services.
- Implement and maintain observability (metrics, tracing, logging) and alerting for robot operations, simulation, and platform services.
- Lead incident response, root cause analysis, postmortems and implement corrective engineering to prevent recurrence.
- Automate operational tasks: deployments, rollbacks, canary releases, capacity scaling, backups and failover.
- Collaborate with software, hardware, ML, and operations teams to define SLOs/SLIs and ensure systems meet availability and performance targets.
- Build and scale data and model training pipelines (batch/streaming), and provide reliable compute/resource platforms for training and inference.
- Harden systems for security, access control and compliance in production and test environments.
- Mentor and grow SRE and platform engineers; drive platform best practices, runbooks and runbook automation.
- Support and improve simulation and mapping infrastructure used by engineers and data scientists.
Qualifications
What Essential Skills You Will Need
- At least 6 years epxerience with cloud engineering with Alibaba Cloud (Aliyun) and/or AWS, designing and operating production services at scale.
- Experience implementing Infrastructure as Code primarily with Terraform.
- Expertise with containerization and orchestration: Docker and Kubernetes.
- Experience with Infrastructure as Code patterns and CI/CD/GitOps practices.
- Knowledge of observability tooling and practices: Prometheus/Grafana, OpenTelemetry, ELK/EFK, distributed tracing.
- Programming skills in C++ and Python; Go is a plus.
- Experience building and operating streaming systems and data pipelines (Kafka, Pub/Sub, Dataflow, Spark or equivalent).
- Solid networking, security and distributed systems fundamentals (load balancing, DNS, TLS, VPCs, IAM).
- Prior experience owning production incidents, performing postmortems and driving follow-through engineering fixes.
- Track record of mentoring engineers and leading cross-functional technical initiatives.
Soft skills
- Show ability in translating technical trade-offs to engineers and non-technical stakeholders.
- Collaborative mindset: work across software, hardware and data teams.
- Experienced solving problems with analytical and systematic thinking approach.
- Experienced in prioritizing and making trade-offs balancing reliability, speed and cost.
- Willingness to be on-call and support distributed operations including field/test site coordination.
Nice-to-haves
- Prior experience with robotics, ROS/ROS2, real-time systems or embedded deployment.
- Familiarity with simulation tools — we primarily use CARLA and NVIDIA Isaac Sim.
- Exposure to ML/AI infrastructure (Kubeflow, MLflow, model serving frameworks).
- Experience with fleet management, telemetry design and edge/cloud hybrid architectures.
- Background in hardware-in-the-loop testing or remote device management.
- Experience working across APAC sites or with distributed, cross-border teams.
Additional Information
Life at Grab
We care about your well-being at Grab, here are some of the global benefits we offer:
- We have your back with Term Life Insurance and comprehensive Medical Insurance.
- With GrabFlex, create a benefits package that suits your needs and aspirations.
- Celebrate moments that matter in life with loved ones through Parental and Birthday leave, and give back to your communities through Love-all-Serve-all (LASA) volunteering leave
- We have a confidential Grabber Assistance Programme to guide and uplift you and your loved ones through life's challenges.
- Balancing personal commitments and life's demands are made easier with our FlexWork arrangements such as differentiated hours
What We Stand For at Grab
We are committed to building an inclusive and equitable workplace that enables diverse Grabbers to grow and perform at their best. As an equal opportunity employer, we consider all candidates fairly and equally regardless of nationality, ethnicity, religion, age, gender identity, sexual orientation, family commitments, physical and mental impairments or disabilities, and other attributes that make them unique.