Manager Site Reliability Engineering Job Description
Position Title: Manager Site Reliability Engineering
Reports to: Director of Systems Engineering
Position Summary
This position is responsible managing the overall stability of customer engineering organization, facilitating a team of dedicated engineers while coordinating with stakeholders in development, infrastructure, product, and leadership. This position is responsible for managing the stability of the website and store fleet on incident occurrence, as well as identifying how we can be better in the future. The manager of the Site Reliability Engineering team has the opportunity to develop processes and technological solutions to address site stability, and will have full control over the direction of the stability roadmap.
Responsibilities • Manage Site Reliability engineering roadmap, backlog and active triages to ensure team is delivering on both the proactive and reactive stability needs of the customer engineering organization • Deliver on tactical decisions while maintaining quality of day to day activities through effective management of full time and contract resources. • Define day to day tasks and projects for team members, track and manage the delivery of work. • Communicate effectively with leadership, cross functional partners, and individual contributors through verbal and written communication regarding incidents, followup, and team deliverables • Maintain and enhance stability benchmarks that reflect overall stability of the site through KPIs, SLAs, SLOs, and SLIs and report on these metrics regularly • Identify opportunities for process, people, technological improvement in the stability organization and formalize plans to execute on these improvements • Reduce manual tasks through automation, process improvement, training, or elimination of manual need • Mentor individual contributors to achieve technical maturity and personal growth • Participate in business critical incident events and facilitate coordination, communication, and resolution as well as incident followup and prevention • Partner with development team to understand applications and features will impact overall stability of site and introduce or modify monitoring and operational processes to meet these need • Partner with cross-functional teams to identify and mitigate risks to system reliability and ensure application stability
Qualifications • Experience as Engineering Lead / Manager (Infrastructure, SRE, Devops, Development, Incident Management) • Experience in business critical technical incident triage and troubleshooting • Expertise in monitoring tools and technologies (New Relic, Datadog, Dynatrace, Splunk, Elk, Google Observability) and their usage in triage and problem investigation • Experience in automation tools (Ansible, Chef, Puppet, Terraform) • Understanding of cloud platforms (AWS, GCP, Azure) • Effective verbal/written communication to technical and non technical audiences • Demonstrated hands-on experience and expertise, understanding of software development, testing, deployments, project management methodologies • Experience in developing and executing plans, meeting deadlines and operating under tight time constraints • Demonstrated ability to anticipate, mitigate, and resolve technical challenges across numerous disciplines