Vacancy caducado!
SRE (Python) - Trading Systems Runtime Who are we? Our team supports the Bloomberg Trading Solutions (TS) platform, which provides hosted Buyside (AIM) and Sellside (TOMS) Order Management System (OMS) to some of the largest institutional broker-dealers, asset managers, asset owners and hedge funds in the world. Each OMS instance is composed of services and databases that are developed, owned, and managed by hundreds of TS engineers across the world. Our clients love our offering of hosted OMS that allows them to reduce their total cost of ownership while benefiting from continuous upgrades in both software and hardware. To offer these products to our clients, the TS Runtime SRE team conceives and manages the hardware and the software to manage the highly available distributed architecture to host these OMSs. In the TS platform, any downtime can have significant financial consequences to our clients. Reliability is what we strive for, and that includes observability to ensure our systems are reliable, or if issues arise they can be addressed before they become a major problem. To ensure this, some of our key responsibilities include:
- Managing the orchestration layer that runs the 1000's of processes on behalf of our clients
- Ensuring sufficient capacity for critical shared resources that include both system-level and application-level resources
- Automating systems to ensure the platform is used fairly, as well as to facilitate triaging when issues arise
- 3+ years or the equivalent of professional work experience in a software engineer or SRE role
- 3+ years or the equivalent of experience in programming and scripting using python and any shell variant (ksh and bash preferred)
- A solid understanding of object-oriented design, data structures, and algorithms
- Strong Unix or Linux fundamentals (or basic knowledge and a strong desire to learn)
- Ability to troubleshoot and triage production issues with distributed systems
- Excellent communication and collaboration skills for daily interaction with other engineering stakeholders
- The ability to identify opportunities for automation as well as developing and testing the Solution
- Continuous integration and deployment tools such as Jenkins
- Automated testing tools and frameworks
- Configuration management tools (like Chef, Puppet, Ansible, or Salt)
- Containerization and orchestration technologies (like Docker, Kubernetes, Mesos)
- Compiled languages (C, C, etc)
- SQL for performing queries at a basic level
- Grafana, Splunk, humio
- Own, manage, monitor and optimize the reliability and overall health of our development and production environments
- Configure newly allocated clusters and hosts, in addition to streamlining and automating the quality control pipeline
- Monitor current capacity, conduct regular capacity testing and predict future capacity needs
- Manage the collection and analysis of availability metrics for the management of shared resources (both system and application resources)
- Collaborate on future design and implementations of our platform ensuring optimal resource usage while ensuring client isolation.