Senior Site Reliability Engineer
Posted 6 hours 26 minutes ago by Parker Shaw
£80,000 Annual
Permanent
Not Specified
Other
London, United Kingdom
Job Description
We are currently recruiting for a Senior Site Reliability Engineer to work on a permanent basis 100% on-site in London. The successful candidate will be responsible for performing end-to-end Self-Healing automation solution to reduce manual effort/TOIL.
Nature of the Job:
- Collaborate with Production support team, identify the existing manual activities, and automate.
- Identify toil area where it can be automated to avoid manual intervention
- Build Monitoring system and observability platform for more Stack traces and alerts and Dashboards.
- Ability to define SLA, SLO and SLI and implement the same for better monitoring
- Scalability, reliability, and observability are the primary goals for reduction of MTTD and MTTR.
Core Competencies:
- Datadog, Splunk, Dynatrace, Grafana, Prometheus, Thousand Eyes, Gremlin etc.
- Efficiency in creating Dashboard for Infra/APM/E2E workflows.
- Monitoring, logging, Alerting and Error budget (99.9, 99.99, 99.999 %) for software, Operations & Business.
- Define SLO, SLI, SLA with business/operations/Engineering team
- Automation/Auto healing - Python, Shell Scripting, Java scripts etc Developing custom services - Monitoring
- Experience with logging, monitoring, and event detection on Cloud or Distributed platforms.
- ITIL - Incident/Change, Proficient in Problem management - Blameless postmortem, findings, applying permanent fixes, Documentation for lesson learn.
- Technical operations application support and stability, reliability and resiliency experience.
- DevOps, Ansible, Terraform, Dockers, AWS (Atlas), Jenkins CICD pipelines.
- Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.
If you feel you have the skills and experience needed for this role; please do apply now.