Site Reliability Engineering Lead Responsibilities:
• Engage in and constantly improve the whole lifecycle of both new and existing services —from onboarding, deployment, operation, and refinement.
Support services before they go live through activities such as architecture review from SRE best practices and support, developing software platforms and frameworks, capacity planning, and launch reviews. Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Establish design patterns for observability. Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity. Lead and Practice sustainable incident response and blameless postmortems across application portfolio.
Partner with Service owners to implement Service Level Metrics Service Level Objectives that act as service level health indicators Develop and maintain technical documentation, network diagrams, runbooks, and procedures Qualifications:
• Overall 8+ years of experience as enterprise class Site Reliability Engineer.
4+ years of experience with managing production environment containers in AWS i.e., ECS or EKS Track record monitoring and analyzing system performance, isolating issues or bottlenecks that could impact reliability, performance, and scalability. Strong experience with observability tools such as Grafana, Prometheus, Zabbix etc Full stack debugging and performance optimization ability, including knowledge of Cloud systems (load balancing, caching, content distribution, etc.), continuous integration/build systems, Java, SQL and NoSQL databases