We are currently searching for a Senior Site Reliability Engineer to join Mueller's Smart Water Infrastructure team. This role will be based in our Atlanta, GA on a hybrid office/ remote schedule.
The Senior Site Reliability Engineer (SRE) is responsible for deployment, monitoring and ensuring the availability, reliability, scalability, and performance of software products against operational targets. They are responsible for the design, implementation, and maintenance of infrastructure required to support software products.
Key responsibilities
Collaborate with software development teams to ensure that services are designed with availability, security, scalability, reliability, and performance in mind from the outset.
Monitor and manage live production environments, identifying and resolving issues as they arise and implementing long-term solutions to prevent their recurrence.
Develop and maintain automation tools for system health, performance monitoring, and incident response to ensure rapid detection and resolution of issues.
Resolve support issues where your experience is required to ascertain the issue quickly and to find an appropriate resolution.
Lead root cause analysis of critical outages, contributing to a culture of learning and continuous improvement.
Provide SRE/DevOps/Infrastructure services and guidance to the Software Team.
Support vendor-unmanaged services such as databases.
Co-ordinate with internal and external security and penetration tests and manage the prioritization and resolution of any findings.
Produce well-written documentation and architecture diagrams.
Be available 'out of hours' if required to complete specific tasks and support customers in emergency or disaster scenarios. This is not a usual and regular occurrence.
Mentor junior engineers, fostering a culture of technical excellence and collaborative problem-solving.
Key competencies
Strong technical competency in software product operations.
Strong collaboration skills to work effectively with cross-functional teams.
Excellent communication skills, both verbal and written, to effectively articulate technical and product information.
Ability to prioritize and manage multiple tasks simultaneously and work under tight deadlines.
Exceptional problem-solving abilities and a systematic approach to root cause analysis.
Experience required.
Bachelor's or Master's degree in a computing or scientific/engineering discipline, or equivalent demonstrable experience.
5+ years of Site Reliability Engineer experience.
Operational experience of AWS Serverless technologies
Linux and Windows system administration
CI/CD pipelines
Database Administration
Patch Management and Disaster and Recovery
Advanced Monitoring knowledge.
Automation scripting in a mainstream programming language
Security fundamentals. Snyk, TFSec and other security tools.
We are an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other category protected by law.