Senior Site Reliability Engineer

Posted 6 hours 26 minutes ago by Parker Shaw

£80,000 Annual
Permanent
Not Specified
Other
London, United Kingdom
Job Description

We are currently recruiting for a Senior Site Reliability Engineer to work on a permanent basis 100% on-site in London. The successful candidate will be responsible for performing end-to-end Self-Healing automation solution to reduce manual effort/TOIL.

Nature of the Job:

  • Collaborate with Production support team, identify the existing manual activities, and automate.
  • Identify toil area where it can be automated to avoid manual intervention
  • Build Monitoring system and observability platform for more Stack traces and alerts and Dashboards.
  • Ability to define SLA, SLO and SLI and implement the same for better monitoring
  • Scalability, reliability, and observability are the primary goals for reduction of MTTD and MTTR.

Core Competencies:

  • Datadog, Splunk, Dynatrace, Grafana, Prometheus, Thousand Eyes, Gremlin etc.
  • Efficiency in creating Dashboard for Infra/APM/E2E workflows.
  • Monitoring, logging, Alerting and Error budget (99.9, 99.99, 99.999 %) for software, Operations & Business.
  • Define SLO, SLI, SLA with business/operations/Engineering team
  • Automation/Auto healing - Python, Shell Scripting, Java scripts etc Developing custom services - Monitoring
  • Experience with logging, monitoring, and event detection on Cloud or Distributed platforms.
  • ITIL - Incident/Change, Proficient in Problem management - Blameless postmortem, findings, applying permanent fixes, Documentation for lesson learn.
  • Technical operations application support and stability, reliability and resiliency experience.
  • DevOps, Ansible, Terraform, Dockers, AWS (Atlas), Jenkins CICD pipelines.
  • Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.

If you feel you have the skills and experience needed for this role; please do apply now.