Leave us your email address and we'll send you all the new jobs according to your preferences.

Lead Site Reliability Engineer - Azure - Engineering

Posted 3 hours 44 minutes ago by Mentmore Recruitment

£120,000 Annual
Permanent
Not Specified
Other
London, United Kingdom
Job Description

Lead Site Reliability Engineer - Azure/AWS - Terraform - Engineering - London

My financial services client are looking for a Lead Site Reliability engineer who will be responsible for ensuring the reliability, scalability for their infrastructure and services. This is a senior role requiring technical expertise, leadership, and a commitment to continuous improvement. You must have team lead/mentoring experience and be able to balance technical delivery, team productivity, performance measurement, and collaboration across teams and stakeholders.

Duties & Responsibilities:

  • Hands-On Engineering & Technical Leadership
  • Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation.
  • Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability.
  • Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment.
  • Promote Transparency, Inspection, and Adaptation by making both system and team health data accessible and actionable.
  • Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency.
  • Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth.
  • Improve deployment pipelines, automation, and operational workflows across squads, fostering consistency and best practices.
  • Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience.
  • Team Productivity, Performance & Agile Ways of Working

Experience Required:

  • Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence.
  • Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments.
  • Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance.
  • Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working.
  • Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience.
  • Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability.
  • Strategic mindset, able to align technical initiatives with business goals, drive scalability and performance improvements, and proactively tackle complex challenges.
  • Strong understanding of regulatory and security requirements, such as ISO 27001, PCI DSS, CE+ and SOX, with experience implementing compliance-driven engineering practices.
  • Advocate for modern DevOps and SRE best practices, championing collaboration, transparency, automation, continuous learning, and continuous improvement across teams.
  • Excellent communication skills, able to engage stakeholders, collaborate cross-functionally, and drive alignment on reliability and operational priorities.

Email this Job