Senior Site Reliability Engineer

Posted 6 hours 26 minutes ago by Parker Shaw

£80,000 Annual

Permanent

Not Specified

Other

London, United Kingdom

Job Description

We are currently recruiting for a Senior Site Reliability Engineer to work on a permanent basis 100% on-site in London. The successful candidate will be responsible for performing end-to-end Self-Healing automation solution to reduce manual effort/TOIL.

Nature of the Job:

Collaborate with Production support team, identify the existing manual activities, and automate.
Identify toil area where it can be automated to avoid manual intervention
Build Monitoring system and observability platform for more Stack traces and alerts and Dashboards.
Ability to define SLA, SLO and SLI and implement the same for better monitoring
Scalability, reliability, and observability are the primary goals for reduction of MTTD and MTTR.

Core Competencies:

Datadog, Splunk, Dynatrace, Grafana, Prometheus, Thousand Eyes, Gremlin etc.
Efficiency in creating Dashboard for Infra/APM/E2E workflows.
Monitoring, logging, Alerting and Error budget (99.9, 99.99, 99.999 %) for software, Operations & Business.
Define SLO, SLI, SLA with business/operations/Engineering team
Automation/Auto healing - Python, Shell Scripting, Java scripts etc Developing custom services - Monitoring
Experience with logging, monitoring, and event detection on Cloud or Distributed platforms.
ITIL - Incident/Change, Proficient in Problem management - Blameless postmortem, findings, applying permanent fixes, Documentation for lesson learn.
Technical operations application support and stability, reliability and resiliency experience.
DevOps, Ansible, Terraform, Dockers, AWS (Atlas), Jenkins CICD pipelines.
Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.

If you feel you have the skills and experience needed for this role; please do apply now.