Site Reliability Engineering Manager at Sharp Decisions in Dallas, Texas

Posted in Other 2 days ago.

Type: full-time





Job Description:

NO 3RD PARTIES, NO C2C, NO H1B, NO RELOCATION

Job Title: Manager, Site Reliability

Job Summary: As the Manager, Site Reliability Engineer (SRE), you will lead a team of SREs responsible for the availability, performance, and scalability of our services. You will work closely with development, operations, and product teams to build and maintain reliable systems, implement best practices, and ensure seamless deployment processes. Your leadership will be pivotal in fostering a culture of reliability and continuous improvement.

Key Responsibilities:
  • Team Leadership:
  • Manage and mentor a team of SREs, providing guidance, performance feedback, and professional development opportunities.
  • Foster a collaborative and inclusive team environment, encouraging innovation and knowledge sharing.
  • System Reliability:
  • Design, implement, and maintain scalable, resilient, and high-performance systems.
  • Develop and enforce reliability standards, best practices, and processes across the organization.
  • Monitor and analyze system performance and reliability metrics, identifying areas for improvement.
  • Incident Management:
  • Lead incident response efforts, ensuring timely resolution of production issues.
  • Conduct root cause analysis and post-mortems to prevent recurrence and improve system robustness.
  • Develop and maintain incident response plans, including documentation and communication protocols.
  • Automation and Tooling:
  • Drive automation initiatives to reduce manual intervention, improve efficiency, and minimize downtime.
  • Implement and maintain monitoring, alerting, and logging tools to ensure visibility into system health.
  • Develop and maintain CI/CD pipelines to streamline deployment processes.
  • Collaboration and Communication:
  • Work closely with development teams to design and implement reliable and scalable applications.
  • Collaborate with product teams to understand requirements and ensure reliability considerations are integrated into the development process.
  • Communicate effectively with stakeholders, providing regular updates on system reliability and performance.
  • Security and Compliance:
  • Ensure systems adhere to security best practices and compliance requirements.
  • Conduct regular security assessments and audits, implementing necessary improvements.
  • Stay informed about emerging security threats and technologies, adapting practices as needed.

Qualifications:
  • Education and Experience:
  • Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree preferred.
  • 7+ years of experience in Site Reliability Engineering, DevOps, or related roles.
  • 3+ years of experience in a leadership or management position.
  • Technical Skills:
  • Proficiency in cloud platforms (AWS, Google Cloud Platform, Azure) and container orchestration (Kubernetes, Docker).
  • Strong scripting and programming skills (Python, Go, Bash, etc.).
  • Experience with infrastructure as code (Terraform, Ansible, etc.) and configuration management tools.
  • Knowledge of networking, security, and database management.
  • Soft Skills:
  • Excellent leadership and team management abilities.
  • Strong problem-solving and analytical skills.
  • Effective communication and interpersonal skills.
  • Ability to work in a fast-paced, dynamic environment and manage multiple priorities.

More jobs in Dallas, Texas

Other
about 2 hours ago

Fiesta Mart
Other
about 2 hours ago

Fiesta Mart
Other
about 2 hours ago

Fiesta Mart
More jobs in Other

Other
9 minutes ago

ICIMS - Lutron Electronics Company, Inc
Other
9 minutes ago

ICIMS - Lutron Electronics Company, Inc
Other
9 minutes ago

ICIMS - Lutron Electronics Company, Inc