[Remote] Principal Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Zayo Group is a provider of mission-critical bandwidth to impactful companies, with a vast network across North America and Europe. They are seeking a Principal Site Reliability Engineer to ensure the uptime, performance, and scalability of their critical infrastructure, focusing on automation, monitoring, and collaboration with various teams.
Responsibilities
- Develop and implement automation solutions to streamline operations and reduce manual effort
- Design and implement effective monitoring and alerting systems to proactively identify and address issues
- Own the incident lifecycle, from leading root cause analysis and resolution to implementing preventative measures to avoid future occurrences. Be on-call to diagnose and resolve critical service outages
- Proactively identify and mitigate potential system risks, focusing on automation, monitoring, and tooling to ensure high service availability
- Design and implement solutions to ensure our infrastructure can handle ever-growing demands while maintaining optimal application performance
- Work closely with developers, product managers, and other engineers to translate business needs into robust and reliable technical solutions. Become the beacon for best practices and efficient processes throughout the organization
Skills
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
- Minimum of ten (10) years of experience in a Site Reliability Engineering or related role
- Strong understanding of system administration, Linux, and scripting languages (Python and various shells)
- Expert at developing automation tools for monitoring, alerting, and deployment to ensure efficient and reliable operations
- Expert at designing and implementing monitoring systems at scale
- Expert at container orchestration (Kubernetes and Docker)
- Experience with monitoring platforms such as SevOne, Assure1, and Nagios and various vendor NMS systems
- Previous work in large scale distributed production environments
- Experience with a variety of cloud platforms and tools (AWS, Google, etc)
- Experience with a variety of monitoring and alerting tools (Prometheus, Grafana, Cacti, etc.)
- Strong working knowledge of networking concepts and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S
- Experience with infrastructure management tools such as Ansible, Terrafor, Puppet, to deploy and manage infrastructure at scale
- Proven leadership skills, with the ability to mentor and inspire others
- Excellent problem-solving, analytical, and critical thinking skills
- A passion for automation and building efficient systems
- Experience working with various vendor APIs (or netconf) including Nokia, Juniper, Fujitsu, Infinera, Cisco, and Ciena
- Experience with various network orchestration platforms such as Ciena Blue Planet MDSO, Cisco NSO, Nokia NSP, or others
Benefits
- Health insurance
- Life insurance
- Disability retirement plans
- Paid time off
- Excellent Health, Dental & Vision Insurance
- Retirement 401(k) Savings Plan
- Generous paid time off policy including paid parental leave
Company Overview