Main Responsibilities and Required Skills for Reliability Engineer

A Reliability Engineer identifies and manages asset reliability issues that could have a negative impact on plant or business operations. They help companies prevent product recalls and ensure that their products perform as needed. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Reliability Engineers.

Get market insights and compare skills for other jobs here.

Main Responsibilities of Reliability Engineer

The following list describes the typical responsibilities of a Reliability Engineer:

Adhere to

Adhere to backup / DR requirements and assist with regular testing.
Adhere to Change Control policy requirements and availability mandates / requirements.

Adopt

Adopt automation solutions for applications.

Analyze

Analyze and recommend preventive maintenance replacement programs.
Analyze failure reports and recommend corrective action to prevent reoccurrence of problems.
Analyze root cause issues, problem determination, and continuous improvement.
Analyze, acquire, modify and support operating systems, database or utilities software.
Analyze preliminary plans and develops reliability.

Assess

Assess asset health to prioritize maintenance activities and capital replacement.
Assess team processes and make recommendations to further streamline them.

Assist in / Assist with

Assist director with roadmap planning and development goals.
Assist in incident management and problem management for applications.
Assist in troubleshooting and root cause analysis for environmental issues as they arise.
Assist our team to understand our business using your analytical and troubleshooting skills.
Assist technical team members with their work (e.g., systems testers, test plans).
Assist with commissioning new equipment.
Assist with incident / problem RCA and identification of trends.

Author

Author new tools and automation to streamline the devops pipeline.

Automate

Automate everything, from infrastructure down to day-to-day tasks.
Automate manual tasks to support exponential growth.
Automate the provisioning and management of our Cloud based Platform (in AWS) via Terraform.
Automate workflows through tools and scripting.

Build

Build and contribute to tooling where the developer experience touches databases.
Build and execute on our team's roadmap in terms of technologies, process and team enablement.
Build and maintain development, staging, sandbox, and production environments.
Build and maintain software modules for use and re-use in cloud and on-premise systems automation.
Build Big Data Technology stack clusters to do the reliability assessment and production stability.
Build, configure and tune highly available client-facing systems.
Build code to specifications and standards.
Build software used by people all around the world.
Build productive internal / external working relationships.
Build sustainable services environments and development cycles through automation.
Build systems, tools and services to measure and monitor all aspects of the platform.
Build, test, and deployment tools and best practices.
Build tools and automate to ease provisioning and scaling of the Ad Cloud Analytics infrastructure.
Build your internal network across all departments.

Champion

Champion a culture of shared service ownership within your development team.
Champion and drive the adoption of Infrastructure as Code (IaC) practices and mindset.
Champion Cvent standards and best practices.
Champion new approaches to automate testing.

Coach

Coach and mentor engineering or other SRE team members.
Coach or manage teams as applicable.
Coach technicians on troubleshooting technical problems.

Codify

Codify all environments and ensure version controlled immutable infrastructure.
Codify all possible and practical configuration of all appropriate services.

Collaborate

Collaborate directly with customers-both via video and in writing.
Collaborate effectively with various internal teams to accomplish project goals.
Collaborate independently with Engineering Change Management personnel to facilitate execution.
Collaborate on highly functional on large, matrixed project teams.
Collaborate with application engineers and train developers as needed.
Collaborate with customers and functional experts to understand technical requirements.
Collaborate with various internal teams to provide a high-quality customer experience.

Communicate

Communicate effectively and present team progress to upper management.
Communicate the reasons and consequences of infrastructural decisions effectively to other engineers.
Communicate with Users, Support, and Development teams in the event of an incident.

Complete

Complete alignment - Reverse Dial, Laser, Optical Methodologies.
Complete risk management files of new design and updates to released products.
Complete tasks with minimal guidance and oversight, and review work of others.

Conduct

Conduct annual and mid-year reviews by reviewing individual development plans and team feedback.
Conduct detailed analysis on issue investigation and determine the best path to resolution.
Conduct innovative use of new analytical tools, equipment and methodologies.
Conduct periodic on call duties.
Conduct recruitment related meetings as needed.
Conduct testing, including functionality, technical limitations, and security.
Conduct timely post-mortems of production infrastructure incidents.

Configure

Configure new appliance with same IP address as legacy deployment.
Configure and verify regular test equipment to fit special tests.

Consult

Consult in system design to meet reliability and capacity requirements.
Consult to business, business leaders, and partners.

Contribute to

Contribute to development of operational automation and self-service frameworks.
Contribute to our culture, and help make sure TrueLayer remains an exceptional place to work.
Contribute to the development of your own and team's technical acuity.

Create

Create and maintain knowledge base articles to prevent & resolve similar issues quickly.
Create and maintain new features for our Kubernetes orchestration platform.
Create, deploy, and share best practices at facility and share with other sites.
Create diagrams, including technical topology.
Create job planning and kit processes.
Create programmatic processes in Ansible.
Create reliability test protocols (ALT, HALT, RVT, RGT, etc.).
Create scripts to automate operational tasks & incorporate the solutions into infrastructure.
Create tags and event types.
Create technical design specifications and assists in scaling technical requirements.
Create telemetry and dashboards to visualize farm health.
Create visibility of our myriad of data and help our partners understand this data.

Define

Define and implement change management procedure.
Define and review analytical equipment for both production and development.
Define / Assemble Incident Response Processes for the public cloud.
Define, develop and deploy system monitoring as well as corrective actions.
Define, implement and improve instrumentation and monitoring of Big Data / AI applications.
Define, review and update SOPs / run books and ensure essential procedures are followed.

Deliver

Deliver high quality outcomes across our engineering organization.

Design

Design and deploy monitoring, metrics, and logging systems.
Design and deploy systems that are scalable, resilient and highly available.
Design and develop solutions which provide more proactive views into potential customer problems.
Design and enforce calibration procedures.
Design and implement part of the new system.
Design & implement automated solutions for continuous integration and delivery (CI / CD).
Design / implement infrastructure as code templates.
Design multi-cloud networking including Direct Connect, IP address schemes, DNS, and access control.
Design self-healing and resiliency patterns.
Design solutions to provide high-availability.

Develop

Develop and apply testing processes for new and existing products.
Develop and document best practices in developing and deploying VMware solutions.
Develop and improve internal deployment and orchestration tools.
Develop and validate statistical methodologies used to support reliability analysis.
Develop automation and integrations to deliver the custom monitoring requirements.
Develop automation for the deployment, monitoring, and observability of services.
Develop automation, optimize and drive efficiency in processes, tools and communication across teams.
Develop custom and enterprise tools and services to advance internal platforms.
Develop, establishes and enforces policies, standards and guidelines for site reliability.
Develop harmonized (inter and intra) regional plans and budgets.
Develop, implements and expands the Abbott global (PdM) program.
Develop Key Performance Indicators for our Maintenance team.
Develop & maintains CI / CD pipelines, Release automation workflows, Repositories, Access Control.
Develop new services leveraging the native technologies of AWS.
Develop software and participate in developer code reviews.
Develop solutions for complex operational and reliability issues.
Develop tooling to enable continuous delivery of our applications.
Develop, update, and maintain testing standards and procedures.

Document

Document processes, procedures and coordinate appropriate training / knowledge transfer.

Drive

Drive Application rationalization to reduce local applications to the absolute minimum.
Drive mechanical assets to 100% availability to meet the unit's daily production goals.
Drive performance as measured by client's Key Performance Indicators.
Drive reduction in deviations by increasing awareness of cGMP practices and compliance requirements.
Drive results and set priorities for self and pizza side team independently.
Drive mechanical asset availability for their assigned production unit.
Drive reliability into systems across the enterprise.
Drive the successful implementation of Reliability improvement initiatives.
Drive to get results and not let anything get in your way.
Drive to proactively identify opportunities for improvement in our systems and propose solutions.
Drive tri-annual release planning for teams within span of control.

Ensure

Ensure 24 / 7 technical support and Service Level Agreement for customers is met.
Ensure alignment and coordination with your Software Engineering leadership peers.
Ensure appropriate follow up recommendations and actions from the inspection program of assets.
Ensure availability and uptime of applications in scope, as per service level objectives.
Ensure continued smooth operation of the global network infrastructure.
Ensure execution of action plans from the shift handover meeting.
Ensure high uptime and adherence to SLA.
Ensure our services meet stability, performance and availability requirements.
Ensure quality and effectiveness of assigned tasks.
Ensure spare parts, equipment and bill of materials are updated in CMMS (SAP).
Ensure suitable level of service personnel and activity during problem resolution at all locations.
Ensure team has the tools, resources, and information they need to be successful.
Ensure that post mortem actions are top priorities for the teams.
Ensure that production SLAs are defined, measured, monitored and maintained.
Ensure the implementation of Engineering and Maintenance Best Practices across the site.
Ensure the overall system reliability, uptime, health, and performance of SaaS Protection offering.
Ensure the Vault platform meets the scalability and reliability needs of our customers.

Establish

Establish relationships and good communication channels with customers.
Establish robust and well documented CI / CD pipelines for business applications.

Evaluate

Evaluate and deploy new technology, tools, and processes to the team.
Evaluate has been a trusted partner to industry-leading organisations for over 20 years.
Evaluate new technologies and processes that enhance security capabilities.
Evaluate acts and communicate in SLA time.
Evaluate the effectiveness of the organization's existing infrastructure technology.

Evangelize

Evangelize best practices for building and operating highly reliable systems.
Evangelize SRE mindset and solve problems through systematization.

Find

Find and maintain the right balance between standardized approaches and local, flexible solutions.
Find universal solutions to common problems and mentor and support junior staff.

Generate

Generate qualification plans that meet customer and Broadcom requirements.
Generate support plans to resolve complex service related problems.

Handle

Handle customer support requests and driving them to resolution in timely manner.
Handle multiple cases involving a variety of technologies, protocols, and equipment.
Handle seamless upgrades of infrastructure and services through automation.

Help

Help invest in infrastructure using Terraform.
Help create and write support and procedure manuals and keeps track of updates to these documents.
Help set OSH priorities for his / her unit in line with corporate objectives and follow up on them.
Help teams define and measure SLI / SLOs for their applications.
Help us continually improve the way we respond to incidents and handle our on call processes.

Identify

Identify and validate estimates of effort to complete engineering work streams.
Identify and implement needed maintenance technician training and procedure development.
Identify needs and provide training to maintenance and production related to improving asset care.
Identify opportunities to optimize existing services and infrastructure.
Identify problems and make recommendations to the decision-making bodies concerned.

Implement

Implement and manage public cloud compute and storage systems.
Implement and optimize a CI / CD pipeline using industry best practices and tools.
Implement automation (CICD, IaC) and promote best practices for our CI / CD processes.
Implement automation solutions.
Implement best practices for Observability and Operation excellence (metrics).
Implement established reliability and PdM strategies.
Implement features in a Scrum / Agile environment.
Implement high-impact automation, replacing slow, error-prone manual processes.
Implement monitoring, Logging, alerting and SLA Reporting (observability).

Improve

Improve customer experience with delivering new service monitoring, alarming and scripting.
Improve production reliability of multiple Fortanix products via automation.
Improve reliability, quality, and efficiency of our suite of software solutions.
Improve reliability, quality, performance and scalability to our suite of software solutions.

Incorporate

Incorporate automation and improvement of failure analysis process.
Incorporate security standards and best practices.
Incorporate the three-year maintenance plan into priority setting for improving asset reliability.

Integrate

Integrate deeply with the engineering and product teams to understand core goals and drive execution.
Integrate third party products.

Interface

Interface with other technical teams to develop company standards for network architecture.

Investigate

Investigate, report, and resolve farm / infrastructure team issues.
Investigate and analyze relevant variables potentially affecting product and processes.

Lead

Lead cross-functional teams in reviewing, updating and developing new Engineering Standards, 11.
Lead for problem identification and resolution, change and release.
Lead internal and external audits for platforms.
Lead load testing, capacity planning, proactive monitoring, and performance optimization efforts.
Lead and is responsible for the execution of reliability excellence / maintenance program.
Lead the Product Safety activities related to the management of incidents, accidents and near-misses.

Load

Load balancing the application including Proxies and CDN.
Load balancing tools such as HA-Proxy, Nginx or F5.

Maintain

Maintain and develop custom scripts to improve our ability to further automate deployment.
Maintain and improve existing monitoring and alerting system.
Maintain and manage multiple environments for SaaS applications.
Maintain and manage platform systems.
Maintain application runbook.
Maintain lab equipment and inventory.
Maintain & manage availability of the VMware on AWS - SaaS Service platform with 99.99% uptime.
Maintain technical competency through self-guided study, and regular training.
Maintain the release repository and manage key information such as build and release procedures.
Maintain the reliability, scalability, availability, and security of our platform and applications.

Make

Make recommendations on actions that can be taken to reverse adverse trends.
Make timely, practical, effective decisions.

Manage

Manage Aging CMMS work order backlog to target, work with peers on site.
Manage and evolve observability tools and documentation used by engineers.
Manage and tune cloud infrastructure, micro-service applications, and continuous integration systems.
Manage customer expectations and outcomes for all activities.
Manage data replication, disaster recovery, geo-replication, and mirroring.
Manage large farm of servers / domains with vendors.
Manage our on and off-prem kubernetes clusters to support our growing workloads.
Manage product bring up within infrastructure, interfacing multiple teams.
Manage production and developer environments spanning multiple cloud platforms.
Manage projects to completion, as assigned.
Manage reliability projects.
Manage shift schedules and provide ongoing feedback as part of the Infrastructure Management team.
Manage SRE team members across Experian offices in the US and India.
Manage the proper staging, masking and custody of testing data across all involved systems.
Manage subcontractors to deliver goods and services while exceeding contract expectations.
Manage VPN, Firewall, and other systems supporting the security of Xencall.

Mentor

Mentor and train other SREs on the team as the need arises.
Mentor less senior Design Assurance Engineers.
Mentor other SREs on methodology for SRE related activities (monitoring, troubleshooting issues).

Monitor

Monitor alignment to cloud strategies, standards, designs, guidelines and policies.
Monitor all systems through various tools applications and consoles.
Monitor, detect and troubleshoot issues during code rollouts on the live site.
Monitor Maintenance and calibration weekly Metrics for compliance issues.
Monitor results to continually refine and improve asset strategies.
Monitor, troubleshoot and resolve Production grade issues for SaaS platform and applications.

Oversee

Oversee hardware quality for one of the most advanced consumer appliances ever built.
Oversee of all site reliability functions related to Costco's SAP Commerce Solution.
Oversee the teams that manage the quarterly enterprise releases.

Own

Own and drive key performance indicators as well as ongoing reliability test execution.
Own and manage the technical resolution and mitigation plans of mine mobile equipment bad actors.
Own Permanent Corrective Action log and drive compliance for the site.
Own systems administration and the pipeline from software development to production.
Own the scope, roadmap, implementation and tracking of services & tools offered by the team.

Participate in

Participate and contribute to partner / client business review meetings.
Participate and organize design reviews, ensure documentation and follow-up.
Participate as a technical leader, providing solutions to operations problems.
Participate in a 24x7 rotation for second-tier escalations.
Participate in a paid on-call rotation.
Participate in balanced on-call rotation, after hours and weekends.
Participate in code and design reviews with engineering teams.
Participate in code reviews with the Devops team and Embedded SREs on delivery teams.
Participate in daily team communication and travel (as needed).
Participate in maintenance projects.
Participate in on-call monitoring response.
Participate in on-call rotation for production services.
Participate in on-call rotations with platform team for platform related services.
Participate in on-call rotation with other members.
Participate in RCAs and take action-items to mitigate issues and come up with a roadmap.
Participate in requirements identification, definition, prioritization, documentation, and analysis.
Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.
Participate in software deployment / release process.
Participate in the annual DR drills.
Participate in the creation and review of Maintenance procedures and policies.
Participate in Tier 3 support activities and BAU as needed.

Perform

Perform a cost saving evaluation (VCP) based on previous maintenance activities.
Perform application upgrades for multiple NCR SaaS products.
Perform code reviews for teammates.
Perform database back-up / recovery.
Perform fault analysis and resolution.
Perform ongoing design and code reviews.
Perform other duties as necessary.

Plan

Periodically upgrade our underlying infrastructure.
Plan and execute system upgrades.
Plan, migrate and support computer farm transition to Microsoft Azure where applicable.

Prepare

Prepare the reliability analysis and test summary.
Prepare the supplier reliability requirements and ensures that they are met.
Prepare written communication.

Promote

Promote a culture of safety.
Promote DevOps / SRE mindset.

Prototype

Prototype and build new capabilities to increase the leverage of platform operations and security.

Provide

Provide 3rd line troubleshooting of cloud issues.
Provide all engineering services in a legal and ethical manner.
Provide an overall integrated understanding of turbine operation and protection.
Provide expertise on delivery methods and represent the team on and lead small projects.
Provide failure analysis / root cause analysis when required.
Provide feedback, coach and mentor team members to improve individual and team effectiveness.
Provide governance oversight for troubleshooting complex problems.
Provide leadership and work guidance to less experienced personnel.
Provide leadership, mentoring, and coaching to Software Engineers.
Provide "level 4” support for incidents and tickets escalated from Platform Operations team.
Provide operational support of systems and build automation to remediate and address the root cause.
Provide regular executive level communication updates of program status including ok2ship summary.
Provide risk management plans.
Provide support for Maintenance reports / logs, related deviations, investigations, and CAPAs.
Provide support to drive the maturity of the software development lifecycle.
Provide technical assistance and support.
Provide technical support to production, maintenance management and technical personnel.
Provide technical triage and troubleshooting by understanding and analyzing financial data systems.
Provide Tier 1 maintenance and support for all of our products.

Recommend

Recommend and bring to live new solutions to improve OS reliability and robustness.
Recommend and create new solutions to improve OS reliability and robustness.
Recommend and implement new security technologies and policies.
Recommend more efficient ways to sort through and categorize and standardize the above items.
Recommend upgrades and improvements to maintain an optimized network infrastructure.

Reduce

Reduce and eliminate equipment breakdowns.
Reduce the time it takes to resolve incidents (MTTR).

Respond

Respond positively to pressure.
Respond to issues identified via vulnerability scans, network security scans and PEN tests.

Review

Review AutoCAD layout and make updates as needed to drawings.
Review engineering contractor's drawings, calculations and specifications.
Review fleet key performance indicators, highlighting opportunities and challenges.
Review Medical device complaints and trends in support of new product.
Review operational and equipment risks to determine appropriate mitigations.
Review PM frequency to reduce 'wasted' activities (supports cost reduction initiatives).

Share knowledge of tools and techniques with your wider team.
Share purchase plan (with a contribution from Stingray).

Solve

Solve performance and stability issues using a wide variety of tools.
Solve problems for engineers and customers on this critical growth initiative.

Stay up-to-date with

Stay up-to-date with new testing methods, testing tools, and test strategies.
Stay up-to-date with new testing tools and strategies.

Support

Support 24x7 team of Cloud Operation and Development Specialists.
Support and drive process improvement.
Support a world class marketing, business-to-consumer platform.
Support capital and repair projects.
Support our product development team by testing release candidates in production.
Support the deployment of cloud solution software during and off regular office hours.
Support the design and configuration of complex system landscapes.
Support the sustainable and responsible development of the unit.
Support utility distribution, master planning, and business continuity planning.
Support validation efforts of mechanical and controls interactions for complex systems.

Take

Take on ownership of major components of the system and drive the engineering team's practices.

Triage

Triage and troubleshoots complex production issues to ensure reliability and performance.
Triage problems across the stack to help address production issues.

Troubleshoot

Troubleshoot minor incidents and contribute to resolution through post-mortems.
Troubleshoot performance and stability issues using a wide variety of tools.
Troubleshoot root cause of application performance.

Understand

Understand how to build solutions with an agile approach.
Understand Kubernetes inside and out.
Understand network services and protocols, e.g. HTTPS, TLS, SSH, TCP / IP, etc..
Understand power generation, transmission and distribution electrical energy.
Understand the risks involved in a startup (previous startup experience preferred).

Update

Update maintenance plans based on feedback from mechanics and maintenance associates.
Update PM plans according to feedback from PM mechanics or maintenance associates.
Update technical and technological maintenance best practices within the unit.

Use

Use and maintain version control for application infrastructure.
Use Container platforms such as Kubernetes for large scale deployment of microservices.
Use issue resolution as an opportunity to improve supportability of the custom software.
Use of orchestration tools such as Terraform, Ansible or CloudFormation.

Utilize

Utilize computer systems (e.g. work instructions, labor charging, prints & standards).
Utilize Agile and Lean practices to identify and solve systemic issues.

Work with

Work with different groups to develop and improve monitors for products and infrastructure.
Work with engineers and system architects to deliver non-functional business requirements.
Work with engineers to investigate recurring issues.
Work with field technicians to provide direction and support equipment troubleshooting.
Work with Maintenance Planning to have them implemented into Maximo.
Work with maintenance professionals and process equipment to improve reliability.
Work with multiple software development teams to address issues and improve quality.
Work with Product to understand useful metrics and alerts for each of our products.
Work with software development teams to design reliability and scalability into solutions.
Work with Sr. Managers and Owners.
Work with the best technology and the best technologists.
Work with the larger team and collaborate towards effective delivery of team objectives.
Work with vendors and partners for the successful implementation of critical tooling and platforms.
Work with VMware InfoSec team on security aspects of Kubernetes, docker and AWS.

Write

Write documentation, including technical standards and processes.
Write well-crafted, high-quality, self-documented, and easy to maintain code.

Most In-demand Hard Skills

The following list describes the most required technical skills of a Reliability Engineer:

Python
AWS
Kubernetes
Java
Ansible
Docker
Terraform
GO
Azure
Ruby
Linux
Bash
Chef
Jenkins
GCP
Puppet
Networking
Devops
Automation
Mysql
GIT
Perl
Javascript
Troubleshooting
Prometheus
Splunk
Containers
Operations
Powershell
Software Engineering
Grafana
Golang
Cloud
Redis
Security
Elasticsearch
SQL
Kafka
SRE
Monitoring
Programming Languages
Datadog
DNS
C
Software Development
Scripting
Cloudformation
Configuration Management
C#
Continuous Integration

Most In-demand Soft Skills

The following list describes the most required soft skills of a Reliability Engineer:

Written and oral communication skills
Problem-solving attitude
Analytical ability
Interpersonal skills
Sense of ownership
Leadership
Troubleshooting skills
Attention to detail
Collaborative
Organizational capacity
Work independently with little direction
Self-motivated
Drive out root causes of complex technical problems
Team player
Curious
Teamwork
Creative
Prioritize needs
Sense of urgency
Presentation
Self-starter
Planning
Take on progressively greater accountabilities
Prioritize tasks
Reliable
Time-management
Initiative
Autonomous
Integrity
Flexible