This job might no longer be available.

Lead Site Reliability Engineer

Gearbox
Frisco/Remote Texas
1 year ago
Apply

The Gearbox Entertainment Company is an award-winning creator and distributor of entertainment for people around the world. Gearbox Entertainment develops and publishes products through its subsidiaries, Gearbox Software and Gearbox Publishing. Gearbox Entertainment has become widely known for successful game franchises including Brothers in Arms and Borderlands, as well as acquired properties Duke Nukem and Homeworld. Gearbox’s ambition is to entertain the world and its key driving objectives include the pursuit of happiness for our talent, partners and customers, the prioritization of entertainment and creativity and a measured respect for profitability. For more information, visit www.Gearbox.com.

To further drive our vision of premier stability and rapid feature delivery, we are looking for a Lead Site Reliability Engineer (SRE) to join our team. As a Lead SRE, you should feel exceptionally comfortable bringing architectural design proposals to the table for consideration among your colleagues on our platform and infrastructure development teams. You will be one of the principal technical designers helping push our cloud-native platform toward the future.

You will be responsible for driving the implementation of flexible cloud architectures with an automation-first emphasis; manual user intervention likely makes you uneasy and maybe even a little twitchy. We would expect a successful candidate for this position to be a self-starter with the ability to complete tasks independently. Though you will have access to technical leadership and senior engineers at your disposal, you should feel well acquainted with tackling complex problems without significant oversight.

Observability is paramount. If we can't measure it, we can't prove it works; if we can't prove it works, it must be assumed it doesn't work. This is a philosophy you hopefully love (and preferably obsess over). If we can't observe how a new feature is behaving, our SRE team is excited to dive into the application code and make the necessary improvements.

Typical Day

Tl;dr : You will be leading and managing a team of SREs, driving the ownership of observability libraries, implementation of flexible AWS Cloud architectures with an automation-first emphasis, collaborating with other teams, and working on solutions to technical challenges in microservice availability for our online services.

This is a people management role with a mix of hands-on lead engineering expectations. Your days will primarily be filled with leading a team of seasoned engineers, empowering them to build solutions to technical challenges in the observability and availability of our SHiFT online services. You will evangelize for and be obsessed with user experience as it relates to the services you support. You will help manage and orchestrate each of these by leaning heavily on technologies like Go, Terraform , Docker , and Bash . On any given day, you should expect to spend at least 25% of your time actively engineering and developing solutions; the remaining time should be a mixture of work planning, team mentoring and pair programming, reviewing code engineers on your team, participating in design meetings, documentation, and self-development.

This position will eventually require you to carry a company-paid mobile device and participate in 24/7 on-call rotations alongside your engineering colleagues. Don't worry though, our on-call experience doesn't suck.

Core Responsibilities:

  • Lead and manage the day-to-day operations of a team of 3-5 SREs, including road-mapping, task assignments, and performance evaluations.
  • Mentor and train your team in observability best practices and foster a culture of continuous learning and improvement.
  • Lead incident response efforts and troubleshoot critical issues to minimize downtime and maintain high availability of systems.
  • Design and implement solutions for monitoring, alerting, and incident response to proactively identify and resolve issues.
  • Be a trusted voice in the evangelism of reliability engineering throughout the team with an eagerness for mentoring.
  • Work with technical leadership to help define and oversee short and mid-term project roadmaps.
  • Participate in after-hours on-call support rotations.

Must Have (the non-negotiable parts):

  • Experience leading and managing teams in a Site Reliability Engineering or related role.
  • Minimum of 4 years professional software development experience instrumenting complex observability stacks, preferably in Go.
  • Minimum of 2 years professional experience with containers in a professional setting, preferably Docker
  • Strong understanding of microservices architecture and its associated challenges.
  • Proficiency in AWS container management, orchestration, and observability features (ECS, Fargate, Aurora, AppConfig, CloudWatch, etc.)
  • Professional Experience in Terraform and/or CloudFormation
  • Adept understanding of observability stack management (otel, tracing, monitoring, alerting, structured logging, APM, etc.)
  • Strong leadership and communication skills, able to lead and mentor other engineers, clearly detail designs and implementations, and effectively communicate with cross-functional teams.
  • Demonstrated experience in driving and leading incident response, incident management, and post-incident review processes.

Should Have (some wiggle room):

  • Extensive hands-on experience with OpenTelemetry
  • Hands-on experience developing and maintaining CI/CD pipelines, preferably in git/GitLab
  • Understanding of RESTful and Websocket based APIs
  • Bachelor's degree in computer science, related field, or equivalent training and professional experience

Now you're just showing off:

  • Familiarity with Datadog
  • Familiarity with Atlassian products (OpsGenie, JIRA, Confluence)
  • Experience working with developers in an agile environment
  • Experience in the games industry, preferably launching multiple online-enabled AAAs
  • Knowledge about Gearbox-owned IPs

Gearbox Entertainment believes that all team members should be able to enjoy a work environment free from all forms of discrimination and harassment. We are committed to reflecting the diversity of the world we strive to entertain. As an Equal Opportunity Employer, we provide fair and equal treatment to all team members and applicants. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, disability, genetic information, pregnancy or maternity, veteran status, or any other status protected by applicable national, federal, state or local law.

Create Your Profile — Game companies can contact you with their relevant job openings.
Apply

Jobs at Gearbox

Engineering jobs