Site Reliability Engineer
CloudFactory
At CloudFactory, we are a mission-driven team passionate about unlocking the potential of AI to transform the world. By combining advanced technology with a global network of talented people, we make unusable data usable, driving real-world impact at scale.
More than just a workplace, we’re a global community founded on strong relationships and the belief that meaningful work transforms lives. Our commitment to earning, learning, and serving fuels everything we do as we strive to connect one million people to meaningful work and build leaders worth following.
Our Culture
At CloudFactory, we believe in building a workplace where everyone feels empowered, valued, and inspired to bring their authentic selves to work. We are:
- Mission-Driven: We focus on creating economic and social impact.
- People-Centric: We care deeply about our team’s growth, well-being, and sense of belonging.
- Innovative: We embrace change and find better ways to do things together.
- Globally Connected: We foster collaboration between diverse cultures and perspectives.
If you’re passionate about innovation, collaboration, and making a real impact, we’d love to have you on board!
Role Summary
As a Site Reliability Engineer, you will ensure the reliability, availability, and security of production systems. You’ll collaborate closely with engineers and operators to apply engineering best practices, automation, and operational excellence across infrastructure, reliability, and platform security in a mission-driven environment.
Key Responsibilities
- Design, build, and maintain scalable, resilient infrastructure that enables developer productivity and platform reliability.
- Establish and maintain Infrastructure as Code (IaC) standards, best practices, and reusable templates.
- Deploy, support, monitor, and maintain new and existing services, platforms, and application stacks.
- Troubleshoot production issues, perform rollbacks and service restorations, and create dashboards to ensure high availability.
- Create, maintain, and enhance runbooks for on-call and incident resolution.
- Define and manage availability targets and SLAs for platform products.
- Ensure production readiness across performance, availability, security, and compliance before go-live.
- Build and improve monitoring, alerting, logging, and debugging tools.
- Manage environment capacity planning and performance optimization.
- Partner with engineering teams to drive performance improvements using metrics (latency, CPU, etc.).
Must-Have Knowledge
- Cloud Architecture: Strong expertise in AWS-based cloud infrastructure and microservices (serverless and containerized).
- Infrastructure as Code (IaC): Proven experience provisioning and managing infrastructure via code.
- CI/CD & DevSecOps: Solid understanding of CI/CD pipelines, web security, and DevSecOps practices.
- Operational Excellence: Experience with monitoring, alerting, incident management, and 24x7 operational support.
Nice-to-Have Knowledge
- Broader web security principles beyond standard DevSecOps practices.
Skills & Experience:
Must-Have Skills
- AWS Services: Hands-on experience with EC2, CloudFormation, ECS Fargate, Lambda, SQS, SNS, S3, ECR, RDS, and Route 53.
- IaC Tools: Terraform, CloudFormation, Serverless Framework; scripting with Bash, Python, or Go.
- Monitoring & Logging: Experience with Grafana, ELK stack, CloudWatch, and/or Prometheus.
- Containerization & Scripting: Proficiency with Docker and shell scripting.
- CI/CD Tools: Experience using GitHub Actions.
Nice-to-Have Skills
- Programming experience in Go, Node.js, or Python.
- Advanced troubleshooting skills for complex production and customer-facing issues.
General Requirements
- Ability to collaborate effectively across global teams and time zones.
- Strong problem-solving skills with the ability to simplify complex issues into actionable solutions.
- High ownership mindset with the drive to meet deadlines and support team success.
- Willingness to participate in 24/7 operational support processes.
- Great Mission and Culture
- Meaningful Work
- Market competitive salary
- Quarterly variable compensation
- Remote and Home working
- Comprehensive medical cover
- Group life insurance
- Personal development and growth opportunities
- Office snacks and lunch
- Periodic team building and social events
At CloudFactory, we believe that work should be more than just a job—it should be a platform for growth, impact, and community. Here, you’ll earn with purpose, learn every day, and serve a mission that truly matters. If you're looking for a career where you can develop professionally, contribute meaningfully, and be part of a global movement, we’d love to have you on this journey!
Join us today and be part of our mission to connect people and technology for a better world! Apply now and bring your whole, authentic self to work—we can’t wait to meet you!