This job might no longer be available.

Head of SRE

4 years ago

About the Team

We handle tens of millions of customers each day playing our games which turns into quite the infrastructure challenge. We receive billions of requests, among API calls and RPCs, and we serve billions of queries in our databases as well. This means we process many terabytes of logs and other data. All the while our growth continues meaning these numbers aren’t stagnant and the challenges keep mounting. As we continue to grow, scale and reliability are evermore tantamount to the success of the company and support of our gamers.

The SRE team is one of our newer parts of this Infrastructure team. We have a core team in Sao Paulo but as we look to grow and scale we wanted to build out the team, both in terms of leadership but also geography.

About the Role

Our Irish office is looking for someone who wants autonomy for making game-changing changes to our infrastructure and who can deal with a really high scale and dedicated to the reliability of our end-to-end infrastructure. You will roll up your sleeves and dive into the issues that affect reliability be it on systems, software, process, and operational levels. You will be relentless in the search for how to best automate to free up time for our SREs and software engineers to focus on the next big challenges. But also for how best to playbook and create seamless processes for handling all sorts of challenges (cost & efficiency, migrations, integrations & rollouts, outages, etc).

As this is a management position for a team that doesn't exist yet, you will have full autonomy to build it. First and foremost will be to source, attract, and staff the local team as well as to help onboard and integrate them with the Sao Paulo SRE team and broader engineering organisation. You will need to be an experienced manager who has built out a team and know how to effectively staff a team given limited resources and creative planning. Second, you will need to design and implement the model for SRE engagement working with engineering leadership on assimilation and support. This will then open up to developing a set of shared processes and playbooks by which not only SRE org but all of engineering org will need to think about and design for scale, reliability, and performance.

More about you

You have solid experience with Postgres, Redis, Cassandra and MongoDB databases on-prem or in the cloud
You have a solid understanding of systems and application design, including the operational trade-offs of various designs
You have advanced knowledge of scripting and languages
You have an expert understanding of Linux systems, services, optimization, storage subsystems, and file systems
You have solid experience with cluster management systems (Kubernetes, Mesos) and configuration management software, like Salt.
You know how network services (DNS, TLS/SSL, HTTP) and network fundamentals (DHCP, subnetting, routing, firewalls, IPv6, BGP) work
You have strong experience designing and managing multi-tenant database solutions (PostgreSQL)
You are confident in your knowledge with load balancers (Nginx, HAProxy)
You have proven ability to collaborate and affect change with multiple stakeholders and organizations
You are comfortable making tradeoffs and knowing what to prioritize
You can navigate constraints and like finding creative solutions in ambiguous environments
You have excellent written and social communication and documentation skills

What you'll do

Build and maintain a good relationship and coordination with our SRE team in Brazil
Understand our whole and highly distributed stack
You will work closely with engineering teams to design, build, and maintain systems helping them with database use, schema design and query tuning
Manage and develop large, cutting-edge clusters, using and implementing innovative technologies
Troubleshoot issues and look at our systems with an eye toward scalable and efficient architectures
Create a shared understanding within SRE and engineering on how to approach reliability (from monitoring to troubleshooting to redesign) and performance optimization
Serve in and design 24x7 on-call services for our major systems
Represent SRE to engineering leadership and push for ever-increasing automation, coaching and influencing other managers
Hire at least 6 other engineers
You will also help guide, mentor and train team members; helping to enhance our infrastructure and grow our team

What you'll need

Years of experience: You have at least 5 years managing a team
You have a minimum of 5 years experience handling services in a large scale environment
Bachelor's Degree or Master's degree in a technical field such as Computer Science, Information Technology Engineering or equivalent work experience
Fluent English is a requirement
Brazilian Portuguese is a plus
Interest in Gaming also a plus

We welcome people from all backgrounds who seek the opportunity to help build the best gaming company, where everyone thrives

Create Your Profile — Game companies can contact you with their relevant job openings.

Apply