Key Responsibilities:
- Implement and maintain site reliability processes and systems.
- Provide service outage escalation response and guidance alongside software engineers.
- Review and assess the impact of monitoring metrics on current system behavior.
- Research and implement new tools and technologies to solve problems more efficiently.
- Conduct root cause analysis of production issues, including complex backend troubleshooting and debugging.
- Collaborate with cross-functional teams to achieve reliability excellence.
Preferred Experience:
- 5yrs+ Proven work experience as a Site Reliability Engineer or in a similar role, particularly in the retail industry.
- Hands-on experience supporting consumer-facing applications.
- In-depth knowledge of AWS services and best practices for cloud infrastructure.
- Proficiency with Grafana for monitoring and observability.