Main Responsibilities and Required Skills for Site Reliability Engineer

developer working on laptop

A Site Reliability Engineer (SRE) is a software professional who ensures that a software system is able to handle the expected load. They implement strategies to minimize downtime and improve the overall user experience. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Site Reliability Engineers.

Get market insights and compare skills for other jobs here.

Main Responsibilities of Site Reliability Engineer

The following list describes the typical responsibilities of a Site Reliability Engineer:

Address

Address production issues outside of working hours in on-call capacity.

Adhere to

Adhere to backup / DR requirements and assist with regular testing.

Advocate

Advocate and implement reliable and resilient design patterns.

Assist

  • Assist engineering and PM with definition and design of features.

  • Assist in the creation and refinement of operational documentation.

  • Assist technical team members with their work (e.g., systems testers, test plans).

  • Assist with modifying system management tools for monitoring and alerting across client environments.

Automate

  • Automate common, repeatable tasks at large scale to streamline operational activities and procedures.

  • Automate manual tasks to support exponential growth.

Build

  • Build and maintain software modules for use and re-use in cloud systems automation.

  • Build and maintain the tools re-use in cloud and on-premise systems automation.

  • Build and maintain tools for deployment, monitoring and operations.

  • Build and manage kubernetes clusters.

  • Build, maintain, and improve automation that manages our infrastructure as code.

  • Build next level relationships with your peers through scheduled companywide team building events.

  • Build services and tools to ensure the stability of SaaS offering.

  • Build software used by people all around the world.

  • Build tools and automation that eliminate repetitive tasks and prevent incident occurrence.

Built

Built and ran critical production services packaged or custom (Java / PHP) on Windows or Linux.

Coach

Coach new joiners to grow a solid DevOps team.

Collaborate with

  • Collaborate with customers and functional experts to understand technical requirements.

  • Collaborate with Engineering teams, influencing and contributing to product design.

  • Collaborate with internal teams such as.

  • Collaborate with Product / Support / Engineering teams to plan and deploy product releases.

  • Collaborate with the DevOps teams to ensure that the CI / CD pipelines are efficient.

Contribute to

  • Contribute / Develop tools for metrics gathering, introspection, monitoring and orchestration.

  • Contribute to end-to-end system architecture, working with back-end engineers.

  • Contribute to system architecture documentation and runbooks.

  • Contribute to the future state of the business through the annual strategic planning process.

Create

  • Create and maintain documentation including workflows, procedures, and troubleshooting.

  • Create and maintain operational runbooks and documentation.

  • Create and update our network standards and ensure that the network is deployed to these standards.

  • Create automation mechanisms which respond to ML outcomes to optimise how the platform scales.

  • Create new automation mechanisms to build the foundations for a sustainable, scalable system.

  • Create new internal tools to enable efficient developer operations.

Debug

Debug and fix build, tool, infrastructure, and process issues.

Define

Define / Assemble Incident Response Processes for the public cloud.

Deploy

  • Deploy and maintain production cloud environments that requires 24 / 7 availability.

  • Deploy and scale our platform infrastructure on cloud providers (currently AWS).

Design

  • Design and manage infrastructure-as-code.

  • Design procedures for system troubleshooting, maintenance and logging.

Develop

  • Develop admin model to leverage automation of server tasks across high volume of servers.

  • Develop automation solutions for application adoption.

  • Develop automation through scripts for sophisticated test integrations and deployments.

  • Develop, maintain and optimize automated deployment, certification and testing infrastructure.

  • Develop metrics, critical success factors and key indicators to monitor and assess results.

  • Develop processes, tools, automation, and software changes to address operational issues.

  • Develop technical standards.

Diagnose

Diagnose and troubleshoot problems.

Dive

Dive deep and understand every issue occurred and own them completely for end to end closure.

Document

  • Document application delivery, support environments, processes, and procedures.

  • Document best practices, guides, systems design, reference architectures and implementations.

  • Document every action so, your findings turn into repeatable actions–and then into automation.

Drive

  • Drive resolution automatization initiatives to mitigate recurrent issues.

  • Drive the company's migration from co-located own hardware to AWS.

Enhance

Enhance software rollouts by injecting CI / CD automation in key areas.

Ensure

  • Ensure enterprise changes are tracked and controlled Document standards, best practices and policies.

  • Ensure infrastructure security compliance.

  • Ensure proper backup of systems and testing backups periodically.

  • Ensure service reliability and uptime for Cloud services.

  • Ensure SLAs are met ensuring high availability and performance of enterprise imaging applications.

  • Ensure suitable levels of service personnel and activity during problem resolution at all locations.

Establish

Establish credibility with the quality of your technical execution.

Evaluate

  • Evaluate and benchmark new solutions, establishing capacity and growth plans.

  • Evaluate new tools and technologies through POCs and propose solutions for implementation.

Evangelize

  • Evangelize best practices for building and operating highly secure and reliable systems.

  • Evangelize reliability culture and share expertise to support service owners.

Focus on

Focus on infrastructure automation, testing, and deployments.

Follow

Follow organizational change processes during implementations.

Handle

Handle seamless upgrades of service components.

Help

  • Help drive continual improvements to our engineering standards, practices, and tools.

  • Help drive transformation by continuously looking for ways to automate existing processes.

  • Help lead operational runbook creation and maintenance.

  • Help resolve complex IT issues.

  • Help run and create games which bring joy to millions of players all over the world.

Hold

Hold team members' code to a high standard during reviews to maintain a quality codebase.

Identify

  • Identify and validate estimates of effort to complete engineering work streams.

  • Identify and create efficiencies in operational engineering through automation.

  • Identify and drive opportunities to improve automation for the company.

  • Identify performance bottlenecks and come up with novel ways to solve them.

  • Identify performance issues and measure system performance, and monitor app performance.

  • Identify, receive, triage and act upon events and incidents coming from various SaaS services.

Implement

  • Implement AI / ML based predictive, preventative and self-healing full stack monitoring.

  • Implement centralized logging to enable incident triage and AI / ML.

  • Implement high-impact automation, replacing slow, error-prone manual processes.

  • Implement self-service tooling and enhance the developer experience.

Improve

  • Improve observability of software by implementing right monitoring, tracing and logging.

  • Improve reliability, quality, performance and scalability to our suite of software solutions.

Install

  • Install, configure, and upgrade custom and packaged applications.

  • Install software and maintain HA production systems.

Interact with

Interact with functional peers within the immediate organization, as well as clients or vendors.

Interface with

Interface with development labs to facilitate work between their work back logs and our own.

Introduce

Introduce developer productivity, code quality enhancements and process improvements.

Lead

  • Lead a broad aspect of design, test, deployment, and SaaS operations.

  • Lead product bring up within infrastructure, interfacing multiple teams.

Maintain

Maintain regular communication with Engineering and other teams.

Manage

  • Manage, adapt, plan, and support core monitoring / log analytics platform.

  • Manage and implement monitoring tools with automated alerting.

  • Manage and maintain AWS cloud environments to ensure they are secure and conform to best practice.

  • Manage and maintain reliability and availability of the hardware within the cloud infra.

  • Manage build activities until the software is deployed and delivered to users in production.

  • Manage data replication, disaster recovery, geo-replication and mirroring.

  • Manage Development, QA, and Production environment configuration.

  • Manage IAM of production system.

Mentor

Mentor and provide training to other team members within the engineering and operational teams.

Monitor

Monitor, diagnose, and resolve urgent production issues (can be outside of core business hours).

Optimize

Optimize existing systems, build infrastructure, and eliminate work through automation.

Own

  • Own systems administration and the pipeline from software development to production.

  • Own the CI / CD pipeline and foster a DevOps culture.

Participate

  • Participate as a peer member in the SRE kanban and / or the assigned scrum teams.

  • Participate in 12 / 7 on-call rotation being part of geographically distributed teams.

  • Participate in an on-call schedule.

  • Participate in design and code reviews.

  • Participate in Disaster recovery planning and execution.

  • Participate in maintenance and troubleshooting of the operational environments.

  • Participate in on-call support rotation.

  • Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.

  • Participate in software deployment / release process.

  • Participate in the development of the annual IP lifecycle management and development plans.

Partner with

  • Partner with engineering and operation teams across the organization to produce and roll out fixes.

  • Partner with software and systems engineers across the organization to produce and roll out fixes.

Perform

  • Perform in-depth data analysis to gauge service trends and drive improvements.

  • Perform operations tasks to support internal customer requests.

  • Perform production application support role including occasional off-hours support.

  • Perform root cause analysis and produce reports.

Plan

Plan, migrate and support computer farm transition to Microsoft Azure where applicable.

Prepare

Prepare incidence documentation when needed.

Provide

  • Provide escalation support for configuration and platform issues.

  • Provide expertise, direction, coaching and development to build the team capability.

  • Provide mentoring and knowledge to help train other teams and team members.

  • Provide realistic task and cost estimates.

  • Provide support to a global and diverse organization working in different countries.

  • Provide technical mentorship for fellow engineers.

  • Provide technical support to dispensing and inventory operations teams.

Recommend

Recommend new technologies to ensure quality and productivity.

Reduce

Reduce the time it takes to resolve incidents (MTTR).

Respond to

Respond to production incidents and determining how we can prevent them in the future.

Run

Run fire drills, security audits, and disaster recovery tests.

Set

Set hard goals, ask lots of questions and learn every day.

Set up

Set up and configure new applications and features.

Strive

Strive to achieve both personal and team targets.

Support

  • Support deployments of code into multiple lower environments.

  • Support Infrastructure and partner teams on development initiatives.

  • Support other teams to meet their infrastructural and monitoring needs.

  • Support the deployment of cloud solution software during and off regular office hours.

  • Support to the SRE team to bridge their tools, understanding, and experience into the product teams.

  • Support unit's goals to adopt automation solutions for applications in scope.

Take

Take ownership and troubleshoot sophisticated systems under pressure.

Track

Track our cloud customer SLAs and be on-call to ensure total conformity to our customer commitments.

Troubleshoot

Troubleshoot and resolve issues in our dev, stage, and production environments.

Understand

Understand Continuous Delivery methodologies and tooling.

Utilize

  • Utilize automation tools to manage infrastructure as code (Terraform, Cloud Formation, Github).

  • Utilize log forwarding technology to troubleshoot problems and identify trends.

Work

  • Work closely with our partners to understand their requirements and provide technical solutions.

  • Work closely with Release Engineering.

  • Work in a distributed data center infrastructure and cloud platforms.

  • Work with a geographically distributed software engineering teams to support the applications.

  • Work with a global team spread across tech hubs in multiple geographies and time zones.

  • Work with other operational teams on defining and improving SLAs, processes, tools and procedures.

  • Work with other SREs to create deployment and rollback processes.

  • Work with Tier 2 and Tier 3 support as required.

Write

  • Write Ansible automation, perform upgrades and patches.

  • Write clear, concise Test Cases.

  • Write code (Terraform, python, bash, ansible, node, etc.).

  • Write scripts, monitors, self-healing / auto remediation tools and automate the processes.

Most In-demand Hard Skills

The following list describes the most required technical skills of a Site Reliability Engineer:

  1. Python

  2. AWS

  3. Kubernetes

  4. Java

  5. Docker

  6. Ansible

  7. Terraform

  8. GO

  9. Ruby

  10. Azure

  11. Linux

  12. Chef

  13. Bash

  14. Jenkins

  15. Networking

  16. GCP

  17. Puppet

  18. Devops

  19. Automation

  20. Mysql

  21. GIT

  22. Perl

  23. Javascript

  24. Prometheus

  25. Containers

  26. Splunk

  27. Troubleshooting

  28. Grafana

  29. Software Engineering

  30. Powershell

Most In-demand Soft Skills

The following list describes the most required soft skills of a Site Reliability Engineer:

  1. Written and oral communication skills

  2. Problem-solving attitude

  3. Analytical ability

  4. Interpersonal skills

  5. Sense of ownership

  6. Troubleshooting skills

  7. Attention to detail

  8. Collaborative

  9. Work independently with little direction

  10. Drive out root causes of complex technical problems

  11. Team player

  12. Self-motivated

  13. Curious

  14. Prioritize needs

  15. Leadership

  16. Organizational capacity

  17. Creative

  18. Sense of urgency

  19. Reliable

  20. Self-starter

Restez à l'affût du marché de l'emploi dans le sport!

Abonnez-vous à notre infolettre