This job might no longer be available.
Site Reliability Engineer I - IT
4 years ago
Monitoring & Incident Management:
- Improve the studio’s reliability through monitoring, rapid response, communication and coordination.
- Develop and manage the deployment architecture for the application, develop the monitoring architecture and implement monitoring agents, dashboards, escalations and alerts.
- Routinely identifies operational problems by observing and studying system architect, functionality and performance results. Troubleshooting procedures with the overall studio architect and investigating surfaced issues, and handling incidents.
- Identifies operational priorities by assessing operational objectives; determining project objectives, such as, efficiency, cost savings, energy conservation, operator convenience, safety, environmental quality; estimating relevance, time, and costs.
Development & Data Analyzing:
- Develop operational solutions by defining, studying, estimating, and screening alternative solutions; calculating economics; determining impact on total system.
- Create new tools to facilitate automated monitoring of the studio’s operational environment.
- Anticipates operational problems by studying operating targets, modes of operation, unit limitations; monitoring unit performance.
- Improves operational quality results by studying, evaluating, and recommending process re architecting, implementing changes, contributing information and opinion to unit design and modification teams.
- Provides operational management information by collecting, analyzing, and summarizing operating and engineering data and trends.
- Updates job knowledge by participating in educational opportunities; reading professional publications; maintaining personal networks; participating in professional organizations.
- Accomplishes engineering and organization mission by completing related results as needed.
Operations Engineer Skills and Qualifications:
Mastery of Systems Linux and Networking administration
- Strong systems engineering and troubleshooting skills
- Shell scripting (BASH & PHP)
- Strong TCP/IP understanding and ability to produce detailed documentation
- Write up new and maintain technical documentation
- Ability to administer networking firewalls, routers, and switches
- S3 Maintenance, Apache maintenance, Load Balancer Management
- Puppet Management
Cloud Management
- AWS Expertise (VPC, RDS, Route53 Integration (DNS))
Database fundamentals
- Administer and maintain MySQL and other opensource databases
- Write and perform basic queries to evaluate database stability, integrity and performance
- Large/Big Data Management
- Administer and maintain Aurora infrastructure
Monitoring Systems
- System Level (Nagios, Munin, Check_MK)
- Writing checks & scripts
- Log/Application Level (Splunk, Elastic Searching, Apache)
- Ability to diagnose infrastructure as a whole!
Create Your Profile — Game companies can contact you with their relevant job openings.