Main Responsibilities and Required Skills for Site Reliability Engineer
A Site Reliability Engineer (SRE) is a software professional who ensures that a software system is able to handle the expected load. They implement strategies to minimize downtime and improve the overall user experience. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Site Reliability Engineers.
Get market insights and compare skills for other jobs here.
Main Responsibilities of Site Reliability Engineer
The following list describes the typical responsibilities of a Site Reliability Engineer:
Address
Address production issues outside of working hours in on-call capacity.
Adhere to
Adhere to backup / DR requirements and assist with regular testing.
Advocate
Advocate and implement reliable and resilient design patterns.
Assist
Assist engineering and PM with definition and design of features.
Assist in the creation and refinement of operational documentation.
Assist technical team members with their work (e.g., systems testers, test plans).
Assist with modifying system management tools for monitoring and alerting across client environments.
Automate
Automate common, repeatable tasks at large scale to streamline operational activities and procedures.
Automate manual tasks to support exponential growth.
Build
Build and maintain software modules for use and re-use in cloud systems automation.
Build and maintain the tools re-use in cloud and on-premise systems automation.
Build and maintain tools for deployment, monitoring and operations.
Build and manage kubernetes clusters.
Build, maintain, and improve automation that manages our infrastructure as code.
Build next level relationships with your peers through scheduled companywide team building events.
Build services and tools to ensure the stability of SaaS offering.
Build software used by people all around the world.
Build tools and automation that eliminate repetitive tasks and prevent incident occurrence.
Built
Built and ran critical production services packaged or custom (Java / PHP) on Windows or Linux.
Coach
Coach new joiners to grow a solid DevOps team.
Collaborate with
Collaborate with customers and functional experts to understand technical requirements.
Collaborate with Engineering teams, influencing and contributing to product design.
Collaborate with internal teams such as.
Collaborate with Product / Support / Engineering teams to plan and deploy product releases.
Collaborate with the DevOps teams to ensure that the CI / CD pipelines are efficient.
Contribute to
Contribute / Develop tools for metrics gathering, introspection, monitoring and orchestration.
Contribute to end-to-end system architecture, working with back-end engineers.
Contribute to system architecture documentation and runbooks.
Contribute to the future state of the business through the annual strategic planning process.
Create
Create and maintain documentation including workflows, procedures, and troubleshooting.
Create and maintain operational runbooks and documentation.
Create and update our network standards and ensure that the network is deployed to these standards.
Create automation mechanisms which respond to ML outcomes to optimise how the platform scales.
Create new automation mechanisms to build the foundations for a sustainable, scalable system.
Create new internal tools to enable efficient developer operations.
Debug
Debug and fix build, tool, infrastructure, and process issues.
Define
Define / Assemble Incident Response Processes for the public cloud.
Deploy
Deploy and maintain production cloud environments that requires 24 / 7 availability.
Deploy and scale our platform infrastructure on cloud providers (currently AWS).
Design
Design and manage infrastructure-as-code.
Design procedures for system troubleshooting, maintenance and logging.
Develop
Develop admin model to leverage automation of server tasks across high volume of servers.
Develop automation solutions for application adoption.
Develop automation through scripts for sophisticated test integrations and deployments.
Develop, maintain and optimize automated deployment, certification and testing infrastructure.
Develop metrics, critical success factors and key indicators to monitor and assess results.
Develop processes, tools, automation, and software changes to address operational issues.
Develop technical standards.
Diagnose
Diagnose and troubleshoot problems.
Dive
Dive deep and understand every issue occurred and own them completely for end to end closure.
Document
Document application delivery, support environments, processes, and procedures.
Document best practices, guides, systems design, reference architectures and implementations.
Document every action so, your findings turn into repeatable actions–and then into automation.
Drive
Drive resolution automatization initiatives to mitigate recurrent issues.
Drive the company's migration from co-located own hardware to AWS.
Enhance
Enhance software rollouts by injecting CI / CD automation in key areas.
Ensure
Ensure enterprise changes are tracked and controlled Document standards, best practices and policies.
Ensure infrastructure security compliance.
Ensure proper backup of systems and testing backups periodically.
Ensure service reliability and uptime for Cloud services.
Ensure SLAs are met ensuring high availability and performance of enterprise imaging applications.
Ensure suitable levels of service personnel and activity during problem resolution at all locations.
Establish
Establish credibility with the quality of your technical execution.
Evaluate
Evaluate and benchmark new solutions, establishing capacity and growth plans.
Evaluate new tools and technologies through POCs and propose solutions for implementation.
Evangelize
Evangelize best practices for building and operating highly secure and reliable systems.
Evangelize reliability culture and share expertise to support service owners.
Focus on
Focus on infrastructure automation, testing, and deployments.
Follow
Follow organizational change processes during implementations.
Handle
Handle seamless upgrades of service components.
Help
Help drive continual improvements to our engineering standards, practices, and tools.
Help drive transformation by continuously looking for ways to automate existing processes.
Help lead operational runbook creation and maintenance.
Help resolve complex IT issues.
Help run and create games which bring joy to millions of players all over the world.
Hold
Hold team members' code to a high standard during reviews to maintain a quality codebase.
Identify
Identify and validate estimates of effort to complete engineering work streams.
Identify and create efficiencies in operational engineering through automation.
Identify and drive opportunities to improve automation for the company.
Identify performance bottlenecks and come up with novel ways to solve them.
Identify performance issues and measure system performance, and monitor app performance.
Identify, receive, triage and act upon events and incidents coming from various SaaS services.
Implement
Implement AI / ML based predictive, preventative and self-healing full stack monitoring.
Implement centralized logging to enable incident triage and AI / ML.
Implement high-impact automation, replacing slow, error-prone manual processes.
Implement self-service tooling and enhance the developer experience.
Improve
Improve observability of software by implementing right monitoring, tracing and logging.
Improve reliability, quality, performance and scalability to our suite of software solutions.
Install
Install, configure, and upgrade custom and packaged applications.
Install software and maintain HA production systems.
Interact with
Interact with functional peers within the immediate organization, as well as clients or vendors.
Interface with
Interface with development labs to facilitate work between their work back logs and our own.
Introduce
Introduce developer productivity, code quality enhancements and process improvements.
Lead
Lead a broad aspect of design, test, deployment, and SaaS operations.
Lead product bring up within infrastructure, interfacing multiple teams.
Maintain
Maintain regular communication with Engineering and other teams.
Manage
Manage, adapt, plan, and support core monitoring / log analytics platform.
Manage and implement monitoring tools with automated alerting.
Manage and maintain AWS cloud environments to ensure they are secure and conform to best practice.
Manage and maintain reliability and availability of the hardware within the cloud infra.
Manage build activities until the software is deployed and delivered to users in production.
Manage data replication, disaster recovery, geo-replication and mirroring.
Manage Development, QA, and Production environment configuration.
Manage IAM of production system.
Mentor
Mentor and provide training to other team members within the engineering and operational teams.
Monitor
Monitor, diagnose, and resolve urgent production issues (can be outside of core business hours).
Optimize
Optimize existing systems, build infrastructure, and eliminate work through automation.
Own
Own systems administration and the pipeline from software development to production.
Own the CI / CD pipeline and foster a DevOps culture.
Participate
Participate as a peer member in the SRE kanban and / or the assigned scrum teams.
Participate in 12 / 7 on-call rotation being part of geographically distributed teams.
Participate in an on-call schedule.
Participate in design and code reviews.
Participate in Disaster recovery planning and execution.
Participate in maintenance and troubleshooting of the operational environments.
Participate in on-call support rotation.
Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.
Participate in software deployment / release process.
Participate in the development of the annual IP lifecycle management and development plans.
Partner with
Partner with engineering and operation teams across the organization to produce and roll out fixes.
Partner with software and systems engineers across the organization to produce and roll out fixes.
Perform
Perform in-depth data analysis to gauge service trends and drive improvements.
Perform operations tasks to support internal customer requests.
Perform production application support role including occasional off-hours support.
Perform root cause analysis and produce reports.
Plan
Plan, migrate and support computer farm transition to Microsoft Azure where applicable.
Prepare
Prepare incidence documentation when needed.
Provide
Provide escalation support for configuration and platform issues.
Provide expertise, direction, coaching and development to build the team capability.
Provide mentoring and knowledge to help train other teams and team members.
Provide realistic task and cost estimates.
Provide support to a global and diverse organization working in different countries.
Provide technical mentorship for fellow engineers.
Provide technical support to dispensing and inventory operations teams.
Recommend
Recommend new technologies to ensure quality and productivity.
Reduce
Reduce the time it takes to resolve incidents (MTTR).
Respond to
Respond to production incidents and determining how we can prevent them in the future.
Run
Run fire drills, security audits, and disaster recovery tests.
Set
Set hard goals, ask lots of questions and learn every day.
Set up
Set up and configure new applications and features.
Strive
Strive to achieve both personal and team targets.
Support
Support deployments of code into multiple lower environments.
Support Infrastructure and partner teams on development initiatives.
Support other teams to meet their infrastructural and monitoring needs.
Support the deployment of cloud solution software during and off regular office hours.
Support to the SRE team to bridge their tools, understanding, and experience into the product teams.
Support unit's goals to adopt automation solutions for applications in scope.
Take
Take ownership and troubleshoot sophisticated systems under pressure.
Track
Track our cloud customer SLAs and be on-call to ensure total conformity to our customer commitments.
Troubleshoot
Troubleshoot and resolve issues in our dev, stage, and production environments.
Understand
Understand Continuous Delivery methodologies and tooling.
Utilize
Utilize automation tools to manage infrastructure as code (Terraform, Cloud Formation, Github).
Utilize log forwarding technology to troubleshoot problems and identify trends.
Work
Work closely with our partners to understand their requirements and provide technical solutions.
Work closely with Release Engineering.
Work in a distributed data center infrastructure and cloud platforms.
Work with a geographically distributed software engineering teams to support the applications.
Work with a global team spread across tech hubs in multiple geographies and time zones.
Work with other operational teams on defining and improving SLAs, processes, tools and procedures.
Work with other SREs to create deployment and rollback processes.
Work with Tier 2 and Tier 3 support as required.
Write
Write Ansible automation, perform upgrades and patches.
Write clear, concise Test Cases.
Write code (Terraform, python, bash, ansible, node, etc.).
Write scripts, monitors, self-healing / auto remediation tools and automate the processes.
Most In-demand Hard Skills
The following list describes the most required technical skills of a Site Reliability Engineer:
GO
Ruby
Azure
Linux
Chef
Bash
Jenkins
Networking
Puppet
Devops
Automation
Mysql
GIT
Perl
Javascript
Prometheus
Containers
Splunk
Troubleshooting
Grafana
Software Engineering
Powershell
Most In-demand Soft Skills
The following list describes the most required soft skills of a Site Reliability Engineer:
Written and oral communication skills
Problem-solving attitude
Analytical ability
Interpersonal skills
Sense of ownership
Troubleshooting skills
Attention to detail
Collaborative
Work independently with little direction
Drive out root causes of complex technical problems
Team player
Self-motivated
Curious
Prioritize needs
Leadership
Organizational capacity
Creative
Sense of urgency
Reliable
Self-starter