This job might no longer be available.
Site Reliability Engineer-Tools (Logging)
4 years ago
PlayStation isn’t just the Best Place to Play —it’s also the Best Place to Work. We’ve thrilled gamers since 1994, when we launched the original PlayStation. Today, we’re recognized as a global leader in interactive and digital entertainment. The PlayStation brand falls under Sony Interactive Entertainment, a wholly-owned subsidiary of Sony Corporation.
It is an exciting time to be part of SIE’s Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. The SRE team strives to make the Platform a highly reliable, scalable, operable and secure product and service.
The Site Reliability Tools team within SIE’s Platform Hosting Engineering organization provides critical services used across all platform teams to provide visibility into the performance and availability of PlayStation Network services to our players, partners, and other customers. SREs on Site Reliability Tools teams work closely with developers, operations teams, and leadership to ensure we have the right set of tools to generate, collect, analyze, visualize and alert on operational data so we know exactly what happens across the PlayStation ecosystem and can see problems before they occur and address them as quickly as possible.
Responsibilities
- Build, deploy and operate a combination of open source, custom written, and vendor provided software to provide log aggregation and analysis capabilities across the PlayStation Network platform
- Collaborate with multiple software engineering teams to integrate logging tools across the entire infrastructure
- Build automation to provide self-service capabilities for on-boarding new services into the logging pipeline
- Participate in an on-call rotation to ensure 24/7/365 availability of the tools and services delivered by the team
Key Qualifications
- Equally adept at software development and systems engineering/operations
- Expert level experience at building, deploying and operating services at scale
- Hands on experience in working with distributed systems and ‘illities” (availability, reliability, scalability, etc.) of the services
- Excellent troubleshooting skills that span code, system, and network (TCP/IP). Ability to zoom in from code to JVM garbage collection problem to packet loss in the network
- Ability to design and provide operational and infrastructural requirements that promote uptime, speed and security at all phases of the software lifecycle on a global scale.
Required skills
- Fluency with running distributed services at scale with performance
- Demonstrated experience following software engineering best-practices
- In depth understanding of Unix/Linux systems internals and networking
- Experience with automation and configuration management tools
- Experience in public cloud services and deployment (Prefer AWS experience)
- Experience deploying and supporting a logging pipeline in a large enterprise environment
- Experience with one or more of these logging technologies: Splunk, Logstash, or SyslogNG
- Experience with AWS Cloudwatch
- Strong software development experience in one of these languages: Go, Perl, Python or Java
- Knowledge of the software development lifecycle with experience integrating Open Source tools
- Familiar with backend data collection tools such as Kafka and AWS Kinesis
- Prefer Experience with capturing logs from containerized environments (Kubernetes or AWS EKS)
- Strong ability to troubleshoot complex issues ranging from system resources to application stack traces
- Experienced user of one or more source code management tools
- Strong hands-on experience in building and maintaining infrastructure for micro services
- Experience with Continuous Integration and Continuous Delivery/Deployment tools like Jenkins, Bamboo, or similar
- Should have experience in developing tools for system configuration, deployment, and monitoring
- Strong belief in driving operational excellence with owning efficiency and automation at the core of operations
- OBSESSIVE desire to automate and improve everything including process improvements, standardizing tools and technologies!
Required Soft Skills
- Methodical and systematic problem-solving approach
- Complete ownership of end to end solutions and managing their life cycle
- Execution oriented and results driven
- Customer and peer relationship focused with strong interpersonal and communication skills
- Ability to thrive in a fast-paced team environment
- Ability to learn new skills/technologies quickly and independently
Experience
- BS in Computer Science, Software Engineering, or equivalent experience
- 7+ years professional experience at scale
- 3+ years experience operating monitoring technologies at scale (prefer Splunk)
Sony is an Equal Opportunity Employer. All persons will receive consideration for employment without regard to race, color, religion, gender, pregnancy, national origin, ancestry, citizenship, age, legally protected physical or mental disability, covered veteran status, status in the U.S. uniformed services, sexual orientation, marital status, genetic information or membership in any other legally protected category.
We strive to create an inclusive environment, empower employees and embrace diversity. We encourage everyone to respond.
We sincerely appreciate the time and effort you spent in contacting us and we thank you for your interest in PlayStation.
#LI-GM1
Create Your Profile — Game companies can contact you with their relevant job openings.