Leave us your email address and we'll send you all the new jobs according to your preferences.
SRE Lead
Posted 2 hours 8 minutes ago by Project Recruit
SRE Lead
Our client, a leading global supplier for IT services, requires an experienced SRE Lead to be based in their client's office in Edinburgh, UK.
This is a hybrid role - you can work remotely in the UK and attend the Edinburgh office 2 days per week.
This is 6 month temporary contract (possible to extend), to start ASAP.
Day rate: Competitive Market rate.
Role Description:
The client is looking for an experienced Site Reliability Engineer (SRE) to join their Everyday Banking Platform team. In this role, you will be responsible for ensuring reliability, scalability and performance of their cloud infrastructure and applications on GCP. You will focus on automating infrastructure deployments, optimising CI/CD pipelines, monitoring health of GCP environment and driving performance improvements. Collaborating closely with development and support teams, you will help maintain high availability, resilience, and seamless experience for their customers journeys on cloud. Additionally, you will focus on cost optimisation, incident management and enhancing observability across cloud-based systems.
Key Responsibilities:
Cloud infrastructure Automation
- Design, develop and maintain cloud infrastructure on GCP
- Implement infrastructure as a code using tools like Terraform, Ansible etc to automate the cloud resource provisioning
- Optimise and maintain CI/CD pipelines for efficient application deployment and rollbacks using Jenkins, Github Actions or GitLab CI/CD
- Automate the operational tasks to improve efficiency using Scripting languages like Python, Shell Scripting etc
- Ensure security best practices, Cost optimisation and performance tuning, adhering to governance standards for Cloud workloads
Site reliability & Production Support
- Continuously monitor and ensure health checks and stability of GCP clusters eg. Google Kubernetes Engine (GKE), Compute engine etc.
- Implement automated health checks, capacity planning and optimise performance to ensure high availability and reliability
- Enhance observability and alerting using the tools like Google cloud operational suite (Stackdriver), Dynatrace, Splunk etc
- Define and maintain Service Level Objectives (SLO), Service level indicators (SLIs) and error budgets to improve the system reliability
- Monitor, troubleshoot and resolve production incidents ensuring minimal downtime and disruption
- Conduct root cause analysis (RCA) to drive continuous improvement.
Collaboration and Operational Excellence
- Work closely with development and support teams and Cloud Engineers to enhance application resilience and overall system reliability
- Improve monitoring, alerting and self-healing capabilities to minimise manual intervention
- Support infrastructure upgrades, cloud migrations and platform optimisation
- Develop and maintain playbooks, runbooks, automation scripts for streamlined operations
- Ensure compliance with security policies, IAM configurations and group standards across all cloud infrastructure and processes
- Maintain stability and reliability for essential production systems supporting the company's Investment Strategy and Execution teams
- Enhance the company's technology infrastructure to align with ambitious and dynamic business goals
- Address user-reported issues promptly and effectively, diagnosing root causes and implementing measures to avoid recurrence
- Assist in advancing the firms integration with cloud technologies (eg, AWS, Docker, Kubernetes) to deliver scalable and flexible resources for research
- Actively improve monitoring and alerting frameworks to ensure system health visibility and enable swift action during performance issues
- Set up and take part in an on-call rotation to provide after-hours support for critical global operations
- Support and ensure the stability of the applications
- Change, Incident and Problem management
- Participate in change planning, deployment and review
- Perform proactive maintenance activities, engage in automation activities, and perform root cause analysis and remediation
- Write and maintain scripts to monitor system functionality and performance
- Write scripts to automate multiple manual tasks
- Monitor incident queues, troubleshoot issues and lead conference calls with other groups to mitigate impacts
- Apply code and operation break fixes and other proactive maintenance activities
- Evaluate the issues reported by their clients in order to provide ongoing solutions
- Participate in on call rotation as assigned by manager
- Provide extended support if needed and work with offshore teams to facilitate around the clock support for applications and services.
Key Requirements:
- Strong expertise in GCP services such as Google Kubernetes Engine (GKE), Cloud Run, Compute Engine, Pub/Sub etc
- Hands-on experience with Terraform, Ansible or similar tools for infrastructure automation and provisioning
- Proven experience in building and optimising CI/CD pipelines for RTL environments using tools like Jenkins, GitHub Actions, GitLab CI/CD etc
- Experience with monitoring and observability tools like Google Cloud Operations Suite (Stackdriver), Dynatrace, Splunk and strong background in incident response and troubleshooting
- Solid understanding of networking, identity and access management (IAM), cloud security and best practices
- Demonstrated ability to work in Agile teams and collaborate effectively with development and operations team to ensure system resilience and high availability
- Client facing, assertive Engineering Leader
Due to the volume of applications received, unfortunately we cannot respond to everyone.
If you do not hear back from us within 7 days of sending your application, please assume that you have not been successful on this occasion.
Project Recruit
Related Jobs
Delivery Manager (SC Cleared - Defence)
- Somerset, Bristol, United Kingdom, BS483
Staff Software Engineer - C/C++ - Embedded - Real Time - Multi-Core - Multi Threaded
- Devon, United Kingdom
Credit and Rates Clearing - Business Analyst/Project Manager
- London, City, United Kingdom, EC1A2
Kafka SME - Devops, Azure, Kubernetes, Terraform
- London, United Kingdom
Delivery Driver
- £15 - £17 Hourly
- Yorkshire, Otley, United Kingdom, LS211