Main Responsibilities and Required Skills for Site Reliability Engineer

A Site Reliability Engineer (SRE) is a software professional who ensures that a software system is able to handle the expected load. They implement strategies to minimize downtime and improve the overall user experience. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Site Reliability Engineers.

Get market insights and compare skills for other jobs here.

Main Responsibilities of Site Reliability Engineer

The following list describes the typical responsibilities of a Site Reliability Engineer:

Address

Address production issues outside of working hours in on-call capacity.

Adhere to

Adhere to backup / DR requirements and assist with regular testing.

Advocate

Advocate and implement reliable and resilient design patterns.

Assist

Assist engineering and PM with definition and design of features.
Assist in the creation and refinement of operational documentation.
Assist technical team members with their work (e.g., systems testers, test plans).
Assist with modifying system management tools for monitoring and alerting across client environments.

Automate

Automate common, repeatable tasks at large scale to streamline operational activities and procedures.
Automate manual tasks to support exponential growth.

Build

Build and maintain software modules for use and re-use in cloud systems automation.
Build and maintain the tools re-use in cloud and on-premise systems automation.
Build and maintain tools for deployment, monitoring and operations.
Build and manage kubernetes clusters.
Build, maintain, and improve automation that manages our infrastructure as code.
Build next level relationships with your peers through scheduled companywide team building events.
Build services and tools to ensure the stability of SaaS offering.
Build software used by people all around the world.
Build tools and automation that eliminate repetitive tasks and prevent incident occurrence.

Built

Built and ran critical production services packaged or custom (Java / PHP) on Windows or Linux.

Coach

Coach new joiners to grow a solid DevOps team.

Collaborate with

Collaborate with customers and functional experts to understand technical requirements.
Collaborate with Engineering teams, influencing and contributing to product design.
Collaborate with internal teams such as.
Collaborate with Product / Support / Engineering teams to plan and deploy product releases.
Collaborate with the DevOps teams to ensure that the CI / CD pipelines are efficient.

Contribute to

Contribute / Develop tools for metrics gathering, introspection, monitoring and orchestration.
Contribute to end-to-end system architecture, working with back-end engineers.
Contribute to system architecture documentation and runbooks.
Contribute to the future state of the business through the annual strategic planning process.

Create

Create and maintain documentation including workflows, procedures, and troubleshooting.
Create and maintain operational runbooks and documentation.
Create and update our network standards and ensure that the network is deployed to these standards.
Create automation mechanisms which respond to ML outcomes to optimise how the platform scales.
Create new automation mechanisms to build the foundations for a sustainable, scalable system.
Create new internal tools to enable efficient developer operations.

Debug

Debug and fix build, tool, infrastructure, and process issues.

Define

Define / Assemble Incident Response Processes for the public cloud.

Deploy

Deploy and maintain production cloud environments that requires 24 / 7 availability.
Deploy and scale our platform infrastructure on cloud providers (currently AWS).

Design

Design and manage infrastructure-as-code.
Design procedures for system troubleshooting, maintenance and logging.

Develop

Develop admin model to leverage automation of server tasks across high volume of servers.
Develop automation solutions for application adoption.
Develop automation through scripts for sophisticated test integrations and deployments.
Develop, maintain and optimize automated deployment, certification and testing infrastructure.
Develop metrics, critical success factors and key indicators to monitor and assess results.
Develop processes, tools, automation, and software changes to address operational issues.
Develop technical standards.

Diagnose

Diagnose and troubleshoot problems.

Dive

Dive deep and understand every issue occurred and own them completely for end to end closure.

Document

Document application delivery, support environments, processes, and procedures.
Document best practices, guides, systems design, reference architectures and implementations.
Document every action so, your findings turn into repeatable actions–and then into automation.

Drive

Drive resolution automatization initiatives to mitigate recurrent issues.
Drive the company's migration from co-located own hardware to AWS.

Enhance

Enhance software rollouts by injecting CI / CD automation in key areas.

Ensure

Ensure enterprise changes are tracked and controlled Document standards, best practices and policies.
Ensure infrastructure security compliance.
Ensure proper backup of systems and testing backups periodically.
Ensure service reliability and uptime for Cloud services.
Ensure SLAs are met ensuring high availability and performance of enterprise imaging applications.
Ensure suitable levels of service personnel and activity during problem resolution at all locations.

Establish

Establish credibility with the quality of your technical execution.

Evaluate

Evaluate and benchmark new solutions, establishing capacity and growth plans.
Evaluate new tools and technologies through POCs and propose solutions for implementation.

Evangelize

Evangelize best practices for building and operating highly secure and reliable systems.
Evangelize reliability culture and share expertise to support service owners.

Focus on

Focus on infrastructure automation, testing, and deployments.

Follow

Follow organizational change processes during implementations.

Handle

Handle seamless upgrades of service components.

Help

Help drive continual improvements to our engineering standards, practices, and tools.
Help drive transformation by continuously looking for ways to automate existing processes.
Help lead operational runbook creation and maintenance.
Help resolve complex IT issues.
Help run and create games which bring joy to millions of players all over the world.

Hold

Hold team members' code to a high standard during reviews to maintain a quality codebase.

Identify

Identify and validate estimates of effort to complete engineering work streams.
Identify and create efficiencies in operational engineering through automation.
Identify and drive opportunities to improve automation for the company.
Identify performance bottlenecks and come up with novel ways to solve them.
Identify performance issues and measure system performance, and monitor app performance.
Identify, receive, triage and act upon events and incidents coming from various SaaS services.

Implement

Implement AI / ML based predictive, preventative and self-healing full stack monitoring.
Implement centralized logging to enable incident triage and AI / ML.
Implement high-impact automation, replacing slow, error-prone manual processes.
Implement self-service tooling and enhance the developer experience.

Improve

Improve observability of software by implementing right monitoring, tracing and logging.
Improve reliability, quality, performance and scalability to our suite of software solutions.

Install

Install, configure, and upgrade custom and packaged applications.
Install software and maintain HA production systems.

Interact with

Interact with functional peers within the immediate organization, as well as clients or vendors.

Interface with

Interface with development labs to facilitate work between their work back logs and our own.

Introduce

Introduce developer productivity, code quality enhancements and process improvements.

Lead

Lead a broad aspect of design, test, deployment, and SaaS operations.
Lead product bring up within infrastructure, interfacing multiple teams.

Maintain

Maintain regular communication with Engineering and other teams.

Manage

Manage, adapt, plan, and support core monitoring / log analytics platform.
Manage and implement monitoring tools with automated alerting.
Manage and maintain AWS cloud environments to ensure they are secure and conform to best practice.
Manage and maintain reliability and availability of the hardware within the cloud infra.
Manage build activities until the software is deployed and delivered to users in production.
Manage data replication, disaster recovery, geo-replication and mirroring.
Manage Development, QA, and Production environment configuration.
Manage IAM of production system.

Mentor

Mentor and provide training to other team members within the engineering and operational teams.

Monitor

Monitor, diagnose, and resolve urgent production issues (can be outside of core business hours).

Optimize

Optimize existing systems, build infrastructure, and eliminate work through automation.

Own

Own systems administration and the pipeline from software development to production.
Own the CI / CD pipeline and foster a DevOps culture.

Participate

Participate as a peer member in the SRE kanban and / or the assigned scrum teams.
Participate in 12 / 7 on-call rotation being part of geographically distributed teams.
Participate in an on-call schedule.
Participate in design and code reviews.
Participate in Disaster recovery planning and execution.
Participate in maintenance and troubleshooting of the operational environments.
Participate in on-call support rotation.
Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.
Participate in software deployment / release process.
Participate in the development of the annual IP lifecycle management and development plans.

Partner with

Partner with engineering and operation teams across the organization to produce and roll out fixes.
Partner with software and systems engineers across the organization to produce and roll out fixes.

Perform

Perform in-depth data analysis to gauge service trends and drive improvements.
Perform operations tasks to support internal customer requests.
Perform production application support role including occasional off-hours support.
Perform root cause analysis and produce reports.

Plan

Plan, migrate and support computer farm transition to Microsoft Azure where applicable.

Prepare

Prepare incidence documentation when needed.

Provide

Provide escalation support for configuration and platform issues.
Provide expertise, direction, coaching and development to build the team capability.
Provide mentoring and knowledge to help train other teams and team members.
Provide realistic task and cost estimates.
Provide support to a global and diverse organization working in different countries.
Provide technical mentorship for fellow engineers.
Provide technical support to dispensing and inventory operations teams.

Recommend

Recommend new technologies to ensure quality and productivity.

Reduce

Reduce the time it takes to resolve incidents (MTTR).

Respond to

Respond to production incidents and determining how we can prevent them in the future.

Run

Run fire drills, security audits, and disaster recovery tests.

Set

Set hard goals, ask lots of questions and learn every day.

Set up

Set up and configure new applications and features.

Strive

Strive to achieve both personal and team targets.

Support

Support deployments of code into multiple lower environments.
Support Infrastructure and partner teams on development initiatives.
Support other teams to meet their infrastructural and monitoring needs.
Support the deployment of cloud solution software during and off regular office hours.
Support to the SRE team to bridge their tools, understanding, and experience into the product teams.
Support unit's goals to adopt automation solutions for applications in scope.

Take

Take ownership and troubleshoot sophisticated systems under pressure.

Track

Track our cloud customer SLAs and be on-call to ensure total conformity to our customer commitments.

Troubleshoot

Troubleshoot and resolve issues in our dev, stage, and production environments.

Understand

Understand Continuous Delivery methodologies and tooling.

Utilize

Utilize automation tools to manage infrastructure as code (Terraform, Cloud Formation, Github).
Utilize log forwarding technology to troubleshoot problems and identify trends.

Work

Work closely with our partners to understand their requirements and provide technical solutions.
Work closely with Release Engineering.
Work in a distributed data center infrastructure and cloud platforms.
Work with a geographically distributed software engineering teams to support the applications.
Work with a global team spread across tech hubs in multiple geographies and time zones.
Work with other operational teams on defining and improving SLAs, processes, tools and procedures.
Work with other SREs to create deployment and rollback processes.
Work with Tier 2 and Tier 3 support as required.

Write

Write Ansible automation, perform upgrades and patches.
Write clear, concise Test Cases.
Write code (Terraform, python, bash, ansible, node, etc.).
Write scripts, monitors, self-healing / auto remediation tools and automate the processes.

Most In-demand Hard Skills

The following list describes the most required technical skills of a Site Reliability Engineer:

Python
AWS
Kubernetes
Java
Docker
Ansible
Terraform
GO
Ruby
Azure
Linux
Chef
Bash
Jenkins
Networking
GCP
Puppet
Devops
Automation
Mysql
GIT
Perl
Javascript
Prometheus
Containers
Splunk
Troubleshooting
Grafana
Software Engineering
Powershell

Most In-demand Soft Skills

The following list describes the most required soft skills of a Site Reliability Engineer:

Written and oral communication skills
Problem-solving attitude
Analytical ability
Interpersonal skills
Sense of ownership
Troubleshooting skills
Attention to detail
Collaborative
Work independently with little direction
Drive out root causes of complex technical problems
Team player
Self-motivated
Curious
Prioritize needs
Leadership
Organizational capacity
Creative
Sense of urgency
Reliable
Self-starter