Job Details

ID #54079496
Estado Arizona
Ciudad Phoenix
Tipo de trabajo Full-time
Salario USD TBD TBD
Fuente Nexthink
Showed 2025-06-27
Fecha 2025-06-27
Fecha tope 2025-08-26
Categoría Etcétera
Crear un currículum vítae
Aplica ya

Platform Site Reliability Engineer

Arizona, Phoenix, 85001 Phoenix USA
Aplica ya

Nexthink is looking for a strong Platform Engineer with SRE operations experience to strengthen our infrastructure and accelerate our ability to deploy, monitor, and scale systems effectively. As a SaaS provider, our customers rely on us to deliver a seamless, reliable, and scalable experience 24/7. This role needs to be located in West or Mountain Time Zone. Join Nexthink's vibrant team where cutting-edge technology meets innovation. Be a part of Nexthink's Digital Employee Experience technological revolution, ensuring our global customers enjoy a seamless user experience. Embrace the future with Nexthink in US; apply now and become a key player in our dynamic Platform Engineering/SRE organization.What You'll Do:Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind.Implement and manage cloud-native systems (AWS) using best-in-class tools and automation.Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery.Establish and enforce SLOs, SLAs, and error budgets, and proactively address availability and performance issues.Develop infrastructure as code (Terraform or similar) for repeatable and auditable provisioning.Experience in programming solutions for Platform Tools such as for automation, monitoring, provisioning, using programming technologies.Solid understanding of the network stack (TCP/IP, VPN, HTTP, SSL, routing, etc.), cloud topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc).Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, GrafanaBe a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to maintain a SLA.Ability to troubleshoot, narrow down and fix incidents with minimal intervention of other functions.Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.Work closely with software engineers to embed reliability and observability into every service.Develop automated runbooks, health checks, and alerting to support reliable operations with minimal manual intervention.Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases.Contribute to security best practices, compliance automation, and cost optimization.

Aplica ya Reportar trabajo