This job might no longer be available.

Site Reliability Engineer (SRE)

1 year ago

At AccelByte, our mission is to empower game creators by providing them with the backend platform and tools required to make scalable, reliable AAA-quality games. The company was founded in 2016 by industry veterans who have engineered online systems for some of the largest game and distribution platforms in the world including Fortnite, Epic Store, Xbox Live, PlayStation Network, and EA Origin. We are backed by top investors including Softbank, Sony Interactive Entertainment, Galaxy Interactive, NetEase, and Krafton. Our latest Series B funding has firmly solidified our place as a top player in the gaming industry. AccelByte’s talent has decades of experience building and shipping some of the largest game and distribution platforms in the world.

We believe that the best companies empower employees to make decisions, obsess about the best user experience, and are not afraid to make and learn from their mistakes. Our culture is based on humility, openness to feedback, drive, and collaboration, which we feel results in the best performing teams. As a company that values diversity, inclusion, and employee growth, our employees have opportunities to work with and learn from teams all over the world. We offer competitive salaries, a full range of health benefits, social activities, career growth opportunities, and an amazing team. Come join us!

Position Summary

AccelByte is building a 24x7 operations team for AAA multiplayer video games. In this position, we need a driven Site Reliability Engineer who can actively participate in the day-to-day combat by maintaining high reliability of our service and drive prioritization in fixing what may be broken today, as well as able to envision, design, and implement processes and technologies to improve the ability to identify, isolate, correlate, and mitigate service impacting problems in the system. The Site Reliability Engineer must also know some coding to automate routine tasks in service metrics gathering, correlating, organizing, and presenting, in addition to detail and in-depth root cause analysis

Essential Functions/Responsibilities

Design, implement, and maintain infrastructure for applications
Architect, implement, and maintain a highly scalable deployment framework that improves the stability, reliability, and availability of our products.
Build and run service deployment using K8s and other CNCF projects
Provide a secure, high-scalable, and cost-effective cloud platform
Construct and build effective systems to monitor the health of our system/applications, and to handle outages
Solve problems occurring in all our environments and create solutions to prevent them from happening again
Produce automation and innovative tools to assist the product development teams and to deliver operational excellence
Create and maintain infrastructure-related documentation and SRE runbooks
Collaborate with other stakeholders to provide cost-effective, operational excellence, and performance-efficient infrastructure solutions to improve our products.
Identify technology, process gaps, and opportunities for improvement
Liaise, communicate, and work directly with our clients
Perform any other design-related duties as required

Qualifications/Experience Required

3+ years of Linux administration
Degree in Computer Science or equivalent experience
Prior experience helping design, manage and run large scale applications in the cloud
Experience with monitoring systems and strategies (System Admin)
Solid performance and troubleshooting skills
A solid foundation on a distributed system
Robust knowledge and experience in cloud computing of at least one cloud provider (preferred AWS/GCP)
Experience with containerization principles and frameworks such as Docker, Container, Kubernetes, etc
Proven track record of building infrastructure as code (Terraform is a must), configuration management, and package manager (eg: Helm Chart)
Proven experience with automation, CICD, and GitOps tools such as Jenkins, GitLab, GitHub, Flux, and/or ArgoCD
Experience with monitoring and alerting tools such as Prometheus, Grafana, ELK/EFK, Splunk, Datadog, OpsGenie, PagerDuty, etc
Experience within a greenfield environment, building infrastructure from scratch
Software development and scripting experience with Bash, Python, and/or Golang
Ability to work with clients on tight deadlines and fluid requirements
Good communication skills (escalation, explaining the incident)
Fluent in English both spoken and written
Flexibility in working with people with different timezones

Qualifications/Experience Preferred

Contribute to open-source projects and participate in technical communities
Experience working for or with AAA game studios
JVM tuning and troubleshooting
Experience with web services
Experience in Networking, Security, or Storage
Experience managing SQL and NoSQL databases
Familiar with Perforce version control

AccelByte Inc is an Equal Employment Opportunity Employer, all qualified candidates applicants will receive consideration for employment without regard to race, religion, gender, national origin, sexual orientation, marital status, age, or disability. Our culture is innovative, inclusive, and we value our people highest.

Please visit our career page for a complete listing of our open positions https://accelbyte.io/careers

Create Your Profile — Game companies can contact you with their relevant job openings.

Apply