Leave us your email address and we'll send you all the new jobs according to your preferences.

SRE Lead

Posted 2 hours 8 minutes ago by Project Recruit

Contract

Not Specified

Temporary Jobs

Midlothian, Edinburgh, United Kingdom, EH120

Job Description

SRE Lead

Our client, a leading global supplier for IT services, requires an experienced SRE Lead to be based in their client's office in Edinburgh, UK.

This is a hybrid role - you can work remotely in the UK and attend the Edinburgh office 2 days per week.

This is 6 month temporary contract (possible to extend), to start ASAP.

Day rate: Competitive Market rate.

Role Description:

The client is looking for an experienced Site Reliability Engineer (SRE) to join their Everyday Banking Platform team. In this role, you will be responsible for ensuring reliability, scalability and performance of their cloud infrastructure and applications on GCP. You will focus on automating infrastructure deployments, optimising CI/CD pipelines, monitoring health of GCP environment and driving performance improvements. Collaborating closely with development and support teams, you will help maintain high availability, resilience, and seamless experience for their customers journeys on cloud. Additionally, you will focus on cost optimisation, incident management and enhancing observability across cloud-based systems.

Key Responsibilities:

Cloud infrastructure Automation

Design, develop and maintain cloud infrastructure on GCP
Implement infrastructure as a code using tools like Terraform, Ansible etc to automate the cloud resource provisioning
Optimise and maintain CI/CD pipelines for efficient application deployment and rollbacks using Jenkins, Github Actions or GitLab CI/CD
Automate the operational tasks to improve efficiency using Scripting languages like Python, Shell Scripting etc
Ensure security best practices, Cost optimisation and performance tuning, adhering to governance standards for Cloud workloads

Site reliability & Production Support

Continuously monitor and ensure health checks and stability of GCP clusters eg. Google Kubernetes Engine (GKE), Compute engine etc.
Implement automated health checks, capacity planning and optimise performance to ensure high availability and reliability
Enhance observability and alerting using the tools like Google cloud operational suite (Stackdriver), Dynatrace, Splunk etc
Define and maintain Service Level Objectives (SLO), Service level indicators (SLIs) and error budgets to improve the system reliability
Monitor, troubleshoot and resolve production incidents ensuring minimal downtime and disruption
Conduct root cause analysis (RCA) to drive continuous improvement.

Collaboration and Operational Excellence

Work closely with development and support teams and Cloud Engineers to enhance application resilience and overall system reliability
Improve monitoring, alerting and self-healing capabilities to minimise manual intervention
Support infrastructure upgrades, cloud migrations and platform optimisation
Develop and maintain playbooks, runbooks, automation scripts for streamlined operations
Ensure compliance with security policies, IAM configurations and group standards across all cloud infrastructure and processes
Maintain stability and reliability for essential production systems supporting the company's Investment Strategy and Execution teams
Enhance the company's technology infrastructure to align with ambitious and dynamic business goals
Address user-reported issues promptly and effectively, diagnosing root causes and implementing measures to avoid recurrence
Assist in advancing the firms integration with cloud technologies (eg, AWS, Docker, Kubernetes) to deliver scalable and flexible resources for research
Actively improve monitoring and alerting frameworks to ensure system health visibility and enable swift action during performance issues
Set up and take part in an on-call rotation to provide after-hours support for critical global operations
Support and ensure the stability of the applications
Change, Incident and Problem management
Participate in change planning, deployment and review
Perform proactive maintenance activities, engage in automation activities, and perform root cause analysis and remediation
Write and maintain scripts to monitor system functionality and performance
Write scripts to automate multiple manual tasks
Monitor incident queues, troubleshoot issues and lead conference calls with other groups to mitigate impacts
Apply code and operation break fixes and other proactive maintenance activities
Evaluate the issues reported by their clients in order to provide ongoing solutions
Participate in on call rotation as assigned by manager
Provide extended support if needed and work with offshore teams to facilitate around the clock support for applications and services.

Key Requirements:

Strong expertise in GCP services such as Google Kubernetes Engine (GKE), Cloud Run, Compute Engine, Pub/Sub etc
Hands-on experience with Terraform, Ansible or similar tools for infrastructure automation and provisioning
Proven experience in building and optimising CI/CD pipelines for RTL environments using tools like Jenkins, GitHub Actions, GitLab CI/CD etc
Experience with monitoring and observability tools like Google Cloud Operations Suite (Stackdriver), Dynatrace, Splunk and strong background in incident response and troubleshooting
Solid understanding of networking, identity and access management (IAM), cloud security and best practices
Demonstrated ability to work in Agile teams and collaborate effectively with development and operations team to ensure system resilience and high availability
Client facing, assertive Engineering Leader

Due to the volume of applications received, unfortunately we cannot respond to everyone.

If you do not hear back from us within 7 days of sending your application, please assume that you have not been successful on this occasion.

Email this Job

Apply Now

ShortList

Recommend to a friend