Vacancy caducado!
- Actively participate in on-call rotation for incident resolution for the platform and/or any dependent components which the product engineering teams rely on for their work. This will be no more than 25% of the time. The rest of the time will be automating and developing quality and operational improvements/solutions.
- Maintain and improve operational tooling, frameworks, perform chaos engineering activities.
- Perform root cause analysis and deliver resolution for tools and automation failures.
- Build frameworks that test the performance and resiliency of our platform services/tools
- Build/integrate/administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs)
- Automate alerts for metrics on performance, cost, vulnerabilities, risk, compliance violations
- Identify and measure SLOs, SLAs and SLIs
- Improve processes/runbooks and champion automation of any manual items around support
- 3+ years developing cloud-native applications using one or more languages (Typescript, .NET Core (C#) are preferred)
- 3+ years deploying and operating cloud-native applications in a public cloud (Azure preferred)
- 2+ years in a role of supporting software and/or cloud-infrastructure in an on-call rotation basis to help with identification and remediation of technical problems at the root cause
- In-depth and proactive communication skills around status of projects/issues
- Experience with Docker and Kubernetes (Azure Kubernetes Service preferred)
- in production
- Strong Git skills
- Experience using centralized logging solutions (Splunk (preferred), Elk, etc.)
- Experience using active monitoring systems (Datadog, New Relic, etc.)
- 3+ years implementing dashboards to help teams visualize logs, instrumentation, and other data to ensure optimal performance of the platform services, infra, and deployed applications. (Grafana preferred)
- Experience creating runbooks, processes, and test plans around reliability, performance, etc. of infra/applications.
- Experience planning and supporting +99.99% availability against critical applications in production