Manager of site reliability at FTSi.Tech in Pittsburgh, Pennsylvania

Posted in Other 2 days ago.

Type: full-time





Job Description:

Manager Site Reliability Engineering Job Description

Position Title: Manager Site Reliability Engineering

Reports to: Director of Systems Engineering

Position Summary

This position is responsible managing the overall stability of customer engineering organization, facilitating a team of dedicated engineers while coordinating with stakeholders in development, infrastructure, product, and leadership. This position is responsible for managing the stability of the website and store fleet on incident occurrence, as well as identifying how we can be better in the future. The manager of the Site Reliability Engineering team has the opportunity to develop processes and technological solutions to address site stability, and will have full control over the direction of the stability roadmap.

Responsibilities
• Manage Site Reliability engineering roadmap, backlog and active triages to ensure team is delivering on both the proactive and reactive stability needs of the customer engineering organization
• Deliver on tactical decisions while maintaining quality of day to day activities through effective management of full time and contract resources.
• Define day to day tasks and projects for team members, track and manage the delivery of work.
• Communicate effectively with leadership, cross functional partners, and individual contributors through verbal and written communication regarding incidents, followup, and team deliverables
• Maintain and enhance stability benchmarks that reflect overall stability of the site through KPIs, SLAs, SLOs, and SLIs and report on these metrics regularly
• Identify opportunities for process, people, technological improvement in the stability organization and formalize plans to execute on these improvements
• Reduce manual tasks through automation, process improvement, training, or elimination of manual need
• Mentor individual contributors to achieve technical maturity and personal growth
• Participate in business critical incident events and facilitate coordination, communication, and resolution as well as incident followup and prevention
• Partner with development team to understand applications and features will impact overall stability of site and introduce or modify monitoring and operational processes to meet these need
• Partner with cross-functional teams to identify and mitigate risks to system reliability and ensure application stability

Qualifications
• Experience as Engineering Lead / Manager (Infrastructure, SRE, Devops, Development, Incident Management)
• Experience in business critical technical incident triage and troubleshooting
• Expertise in monitoring tools and technologies (New Relic, Datadog, Dynatrace, Splunk, Elk, Google Observability) and their usage in triage and problem investigation
• Experience in automation tools (Ansible, Chef, Puppet, Terraform)
• Understanding of cloud platforms (AWS, GCP, Azure)
• Effective verbal/written communication to technical and non technical audiences
• Demonstrated hands-on experience and expertise, understanding of software development, testing, deployments, project management methodologies
• Experience in developing and executing plans, meeting deadlines and operating under tight time constraints
• Demonstrated ability to anticipate, mitigate, and resolve technical challenges across numerous disciplines
More jobs in Pittsburgh, Pennsylvania

Other
about 1 hour ago

Confidential
Other
about 2 hours ago

First Commonwealth Bank
Other
about 5 hours ago

University of Pittsburgh
More jobs in Other

Other
30+ days ago

Pike Electric, Inc
Other
30+ days ago

Pike Electric, Inc
Other
12 minutes ago

University of Richmond