Main Responsibilities and Required Skills for Data Engineer

A Data engineer leverages data science in orde to understand how data flows through an organization and make sure that it's being securely managed, stored, and analyzed. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Data Engineers.

Get market insights and compare skills for other jobs here.

Main Responsibilities of Data Engineer

The following list describes the typical responsibilities of a Data Engineer:

Add

Add regression test cases for changes to existing feeds.

Advocate

Advocate working in the open by demonstrating your work.

Architect

Architect and implement high quality analytical applications and data products.
Architect build and launch new data models that provide intuitive analytics.

Assemble

Assemble large, complex data sets that meet.

Assess

Assess the effectiveness and accuracy of new data sources and data gathering techniques.

Assist in

Assist in technology innovation activities such as research, evaluations, and prototyping.
Assist in the validation and analysis of large data sets and complex data models.
Assist with data-related technical issues and support their data infrastructure needs.
Assist with publishing as needed to Tableau and / or ARCgis servers or other web platforms.
Assist with recognising and upholding digital security systems to protect private data.
Assist with the design and create automated applications and reporting solutions.

Attend

Attend regular scrum events of equivalent and provide update on the deliverables.

Automate

Automate test coverage (90+%) for data pipelines.
Automate the deployment of scrapers on a cloud architecture.

Build

Build a high-quality BI and Data Warehousing team and design the team to scale.
Build and execute integration testing for procedures that impact the state of asset data.
Build and help maintain our cloud data ecosystem.
Build and maintain consistent, secure and performant data pipelines.
Build and maintain physical and logical data models.
Build and support complex ETL infrastructure to deliver clean and reliable data to the organization.
Build continuous integration, test-driven development and production deployment frameworks.
Build, develop and lead a new Data Engineering team referred to as our Trek 3 team.
Build ETL solutions with SQL based technologies like Oracle, Snowflake and Spark.
Build queries and views to facilitate delivery of data to stakeholders.
Build, run, own and validate data pipelines in the RRPS data and analytics platform.
Build 'security agents' through plugins to develop and expand their surface.
Build systems that allow for validation of our data product, ensuring accuracy of these data sets.
Build the documents of data dictionary, data flows, governance.
Build tools and automation capabilities for data pipelines.
Build tools to reduce occurrences of errors and improve customer experience.

Champion

Champion an innovation culture that creates results and positively impacts the company's business.
Champion Data Science principles throughout the wider business.

Coach

Coach and provide guidance to junior team members.
Coach team members and grow the skill sets.
Coach the BI team on good development practices.

Collaborate with

Collaborate with all data science team members and other colleagues on data analysis projects.
Collaborate with a Project Manager to bill and forecast time for customer solutions.
Collaborate with business and technical stakeholders while defining solutions.
Collaborate with business partners to understand processes and identify improvement areas.
Collaborate with business teams and understand how data needs to be structured for consumption.
Collaborate with data scientists and visualization experts in order to provide adequate data.
Collaborate with Data Scientists integrating prediction and machine learning projects.
Collaborate with engineers / analysts / stakeholders on cross team projects.
Collaborate with Platform R&D teams to ensure smooth and reliable integration.
Collaborate with product and engineering teams to take requirements from prototype to production.
Collaborate with stakeholders to produce actionable requirements.
Collaborate with teams / partners to improve overall operational maturity of the data ecosystem.
Collaborate with various internal teams to identify, process, and analyze new data sets.

Collect

Collect data requirement from business and translate it into data sets.
Collect feedback from internal teams to fine-tune software deployments.

Communicate

Communicate cross-functionally with product, design, engineering, and executives.
Communicate / work effectively in a team environment.

Complete

Complete projects on time and within time budgets.

Conceptualize

Conceptualize and lead global research initiatives.

Conduct

Conduct Graph analysis using Datastax Graph, Cassandra, Gremlin.
Conduct the data operations necessary to ingest and process the customers data.
Conduct timely structured code reviews to ensure standards and systems interoperability.
Conduct training and knowledge transfer.
Conduct user experience data analysis and create dynamic visualizations.

Connect

Connect and support data sources that flow into data warehouses.
Connect with Pipeline Enablement and its functions to prioritize and implement these strategies.

Contribute to

Contribute code to production systems.
Contribute the design of experiments focused on assay improvement.
Contribute to data science projects as appropriate.
Contribute to overall solution, integration and enterprise architectures.
Contribute to product strategy & future tech roadmap with an outcomes driven mindset.
Contribute to team's content & IP development.
Contribute to the building of a high performing team with a high degree of trust.
Contribute to the design and architecture of services across the data landscape.
Contribute towards shaping the architecture, design and scalability of our processes and pipelines.

Coordinate

Coordinate with release manager for deployment.

Create

Create accurate test plans, conditions, and data.
Create and maintain optimal data pipeline architecture.
Create and maintain optimal data pipeline architecture for our clients to improve performance.
Create and manage a data dictionary for different data sources.
Create and use effective metrics and monitoring processes.
Create data-serving performant.
Create data systems and pipelines.
Create data tools for analytics and data scientist team members.
Create, maintain and optimize our batch and streaming data pipeline architectures.
Create new analyses that augment our data asset like tobacco use status or medication history.
Create real-time data flows to meet the needs of operational systems.
Create scripts, tools, or other software solutions to help enforce data governance policies.

Debug

Debug complex production scenarios.

Define

Define and implement key metrics for PatientIQ's data warehouse.

Deploy

Deploy implemented machine learning solutions in public and private cloud environments.
Deploy, test, validate and maintain data solutions.

Design

Design all program specifications and perform required tests.
Design and build solutions that are aligned with Fairstone's information security policies.
Design and code (Java, Scala, Spark) solutions to support common and strategic data sourcing needs.
Design and develop a cloud-based data architecture that is is optimal for data science.
Design and develop ETL processes for data integration.
Design and develop event-driven ingestion and egression framework to consume and publish data.
Design and develop for quick deployment with assistance from the Devops team.
Design and develop robust data pipelines.
Design and executes data migrations.
Design and implement API layers with appropriate security mechanisms for data ingest and consumption.
Design and implement ETL / ELT data pipelines.
Design and implement functional tests, integration tests and automate based on ROI.
Design and implement high-volume data ingestion and streaming pipelines using Open Source frameworks.
Design and implement scalable and durable data models for our data.
Design and implement solutions to improve data platform.
Design and implement technical solutions.
Design and implement the data layer for data driven applications and business intelligence solutions.
Design and manage processing pipelines via AWS Glue and / or EMR clusters.
Design and plan integration for all data warehouse technical components.
Design and scale databases and pipelines across multiple physical locations on cloud.
Design and support the new and evolving sources of data being brought into the data warehouse.
Design conceptual and logical data models and flowcharts.
Design reliable, efficient software that solves real business problems.

Develop

Develop a clear understand of information needs from business requirements.
Develop a data dictionary along with supporting technical documentation.
Develop and exercises cross-functional data delivery to support business needs.
Develop and implement tools and techniques which will help make better data-based decisions.
Develop and maintain a data warehouse.
Develop and maintain critical data pipelines.
Develop and maintain ETL infrastructure to support the ingestion of external data sources.
Develop an implementation plan for attribution project.
Develop baseline AI models with the guidance of Data Scientist (s).
Develop compliance guidelines.
Develop, construct, test and maintain large-scale data processing systems.
Develop ETL jobs to populate subject-oriented data marts.
Develop, installs and integrates computer-based systems.
Develop, maintain code, and integrate software into a fully functional software system.
Develop & maintain data publication and synchronisation processes, supporting Tableau.
Develop, manage, and maintain an up to date map of all the data sources.
Develop metrics to measure the outcome / impact of your introduced solutions.
Develop multi-dimensional data models.
Develop our custom in-house ETL framework.
Develop reports that support analysis and key business decisions on an ad-hoc and scheduled basis.
Develop streaming data pipelines using Lambda, Python, and Amazon Kinesis.
Develop the overall layout and production design for website wireframes and site maps.
Develop the robust and monitorable datapipeline and related services.
Develop tools that will enhance or vehicle import pipeline.
Develop tools to retrieve data using 3rd party APIs.

Discover

Discover, analyze and validate new data sets to add value for our customers.
Discover opportunities for data acquisition from other data sources.

Document

Document and communicate standard methods and tools used.

Drive

Drive both organizational and personal change and innovation in the Global People Data team.
Drive data quality across the product vertical and related business areas.
Drive design, building, and launching of new data models and data pipelines in production systems.
Drive key initiatives including re-platforming, data engineering projects etc..
Drive the development of cloud based and hybrid data warehouses & business intelligence platforms.

Educate

Educate and advise internal teams on how to leverage available data.
Educate and federate VSC community around data.
Educate software engineers on application security best practices and secure coding techniques.

Employ

Employ best-practices practices in continuous integration and delivery.

Empower

Empower data engineers and data scientists to solve real world problems by improving their workflow.

Engage

Engage in daily interactions with internal team members to develop data-driven solutions.
Engage with various internal cross-functional departments including Data Management, Technology.

Ensure

Ensure a proper communication of data documents.
Ensure database servers are backed up and meet the Enterprise Recovery Time Objectives.
Ensure execution and alignment to architectural standards and blueprints.
Ensure our data is kept to the highest quality and integrity standards.
Ensure proper escalation, prioritization and remediation of data quality issues.
Ensure security and privacy are first-class citizens in all solutions.
Ensure standards are met so that solutions deployed have technical integrity and stability.
Ensure that controls to verify the accuracy and consistency of data are implemented and monitored.
Ensure the best performance, quality, and responsiveness of applications and games.
Ensure the deliveries are on time and of the required quality.

Escalate

Escalate issues to the Manager, Data Science & Engineering.

Establish

Establish a comprehensive set of data models for company and client data.
Establish and enhance technical guidelines and best practices for cloud integration development team.
Establish, meet and monitor SLAs for support issues in conjunction with the rest of the teams.

Evaluate

Evaluate algorithms and create prototypes.
Evaluate and audit data quality of systems.
Evaluate business needs and objectives to find suitable solutions.
Evaluate status and resource utilization, and implement changes to improve the teams' effectiveness.

Evangelise

Evangelise the use of customer data to better understand our customers across the organisation.

Expand

Expand the existing pipeline to include new data sources or generate new derived views of the data.
Extend, improve, or, when needed, build solutions to address architectural gaps or technical debt.

Extract

Extract, analyze & interpret large, complex datasets for use in predictive modelling.
Extract, transform, load, and maintain data from various sources in the data warehouse.

Gather

Gather and document business requirements and translate it into technical design documents.
Gather data and design data platform for data calls received from various sources.

Guide

Guide and mentor team members.

Handle

Handle changes in application logic and develop supporting data transformations.
Handle large scale databases.

Help

Help design and develop structured and unstructured data acquisition solutions.
Help develop and align CXO data capabilities with Enterprise Analytics initiative roadmap.
Help develop a system that migrates data from legacy sources to new ones.
Help employees to locate and understand data of interest by preparing metadata and documentation.
Help non-technical audiences understand technical requirements.
Help on-board entry level engineers.
Help to guide the team to meet their targets.
Help unify software development and operations seamlessly, efficiently, and cost effectively.
Help us stay current on the latest data processing tools and trends.

Identify

Identify and act upon product improvement opportunities.
Identify and address issues with data sets from multiple vendors.
Identify and share best practices for key topics.
Identify and suggest or implement remediation of cases where we diverge from industry best practices.
Identify opportunities for improvement & innovation for all DE related.
Identify opportunities within existing account for extension or expansion.
Identify rejected responses and report to Encounter research team.

Implement

Implement and deploy highly scalable big data analytics system in the AWS environment.
Implement approved design following industry best practises and with high quality standard.
Implement ETL / data pipelines with a 'best of breed' / 'right tool for the job' mentality.
Implement gamification features to stimulate user adoption.
Implement necessary team structures, cadences and best practices for optimal development cycles.
Implement security measures by encrypting sensitive data.

Import

Import new data sources and transform data into data warehouse.

Improve

Improve the data availability by acting as a liaison between Lab teams and source systems.

Install

Install / update system and component fault-tolerant procedures.

Integrate

Integrate data from various resources, and manage the big data as a key enterprise asset.

Keep up to date with

Keep up to date with emerging technologies and recognise the potential value they bring to Confused.
Keep up to date with latest developments in ML / Deep Learning / Data Science / Data Engineering.

Lead

Lead and contribute in development, documentation, deployment and automation of delivery.
Lead and mentor less experienced developers.
Lead in creating all departmental policies and best practices.
Lead technical efforts, including design and code reviews, and mentor staff appropriately.

Leverage

Leverage cutting-edge research and techniques with continuous improvement in mind.
Leverage Kinesis, Glue, Lambda.
Leverage new technologies and approaches to innovating with increasingly large data sets.

Maintain

Maintain and support multiple projects and deadlines in an agile environment.
Maintain ETL processes on an ongoing basis.
Maintain knowledge of business unit policies and procedures, systems, and requirements.
Maintain knowledge of existing technology documents.

Manage

Manage and maintain the data warehouse.
Manage a team of two Data Engineers, their careers, way of working with the other teams, etc.
Manage automated unit and integration test suites.
Manage code merges and deployments to preprod in support of QA activities.
Manage data cross multiple platforms, including Azure SQL and Snowflake.
Manage regular Airflow deployments and updates.

Measure

Measure progress toward goals; and evaluate productivity and team efficiency.

Mentor

Mentor and assist team members with solution implementation and technology adoption.
Mentor and lead members of the Data Engineering team.
Mentor the team of data engineers.

Model

Model and transform classes of data into these schemas.
Model, Design, Build and maintain Data Engineering applications using Talend and Snowflake.
Model development including, for example, Machine Learning.

Monitor

Monitor and analyze information and data systems, and their performance.
Monitor data load jobs and perform root cause analysis of production issues.
Monitor new deployments and services.
Monitor the best practises of general microservices and microbatch architecture.

Optimize

Optimize cloud and data storage costs.
Optimize data access for speed, reliability, and velocity.
Optimize infrastructure with consideration for pricing efficiency.
Optimize solution designs for performance, scalability, and costs.
Optimize the use of OBIEE in Network Expense operations.

Oversee

Oversee and manage staff members in the daily use of data systems.
Oversee the quality, delivery, performance, cost and scope for data engineering initiatives.

Own

Own and deliver key product features (generally as part of a team).
Own and deliver major project initiatives which spans multiple sprints.
Own costs of the Databricks infrastructure.
Own the development, training, and optimizing of machine learning systems.
Own the development, training, optimizing, and deployment of machine learning systems.

Participate

Participate and lead in architecture and design discussions.
Participate in all phases of the software development cycle as part of a Scrum team.
Participate in collaborative software development and implementation of the new.
Participate in design, code, and test Inspections throughout life cycle to identify issues.
Participate in on-call support of Denodo products.
Participate in resolution of production issues and lead efforts toward solutions.
Participate in the data quality process in collaboration with the QA team.

Perform

Perform ad-hoc analysis to identify trends and support immediate tactical business needs.
Perform ad-hoc analysis to independently handle business questions from customers.
Perform ETL with massive data from multiple applications and 300 M+ customers.
Perform extensive data profiling and analysis based on the client's data.
Perform other duties as assigned.
Perform other related duties as required.
Perform root cause analysis of issues that hinder the data quality.
Perform tasks such as writing scripts, web scraping, getting data from APIs etc..

Plan

Plan and Execute secure, best practice data strategies and approaches.

Prepare

Prepare and complete warehouse orders for delivery.
Prepare and optimize AWS redshift table for business analytics.
Prepare presentations and reports based on the results of work / analysis.
Prepare technical specifications and documentation for projects.
Prepare test data and test environment for both manual and automated test cases.

Present

Present the results and lead the business workshops to guide developments.

Produce

Produce requirements specifications, design deliverables, status reports, project plans.

Propose

Propose and approve Software Engineering process improvement recommendations.
Propose design solutions and recommend best practices for large scale data analysis.
Propose new data structures and workflows to bring efficiency to internal projects.

Provide

Provide coaching & learning opportunities to fellow team members to promote.
Provide counseling / coaching, oversight, and support for delivery teams and staff.
Provide data analysis guidance as required.
Provide designs specifications for and develops new products and services or their components.
Provide direction to team members as assigned by management.
Provide GitLab peer review and code merges.
Provide guidance and work leadership to less-experienced software engineers.
Provide leadership and oversight to small technical team.
Provide leadership, mentoring, and coaching.
Provide Level 2 technical support.
Provide mentorship and guidance to junior team members.
Provide or coordinate troubleshooting support for data warehouses.
Provide recommendations to management on behalf of the team.
Provide strong troubleshooting and problem solving support.
Provide technical support and usage guidance to the users of our platform's services.
Provide thought leadership and architectural expertise and manage cross-team integration.
Provide training or coaching for junior members.
Provide UAT support, triage, troubleshooting, and explanations.

Refine

Refine user stories into tasks.

Release

Release the feature to Production for use by our customers.

Report

Report generation via SQL based tools.
Report your findings with data visualizations that are easy to understand.

Research

Research and employ cutting edge techniques to move well beyond internet scale data.
Research and implement tooling to support the build out of intelligence tools.
Research and prototype data acquisition strategy for scientific instruments used in the lab.
Research emerging technologies to determine impact on application execution.

Set

Set and promote data management culture and process.
Set metrics and identifies quality trends.

Set up

Set up and deploy cloud-based data services such as blob services, databases, and analytics.
Set up Spark clusters and Airflow instance.

Share and make sure the best practices are enacted in the Delivery teams.
Share knowledge and assist others in understanding technical topics.
Share technical expertise with team and provide recommendations for best practices.

Support

Support of complex and time critical suite of ACH applications.
Support on-call shift as needed to support the team.
Support Redshift cluster management including monitoring, performance tuning, and optimization.
Support the ongoing Production jobs as an when required.

Track

Track record of negotiations with suppliers.
Track record working with data from multiple sources.

Transform

Transform complex analytical models into production-ready and highly-scalable solutions.

Translate

Translate business requirements into strategy.
Translate existing SAS code into Python code.
Translate raw data to actionable insights.

Troubleshoot

Troubleshoot operational data-quality issues.

Understand

Understand and refine requirements.
Understand different file formats and processing.
Understand how to profile code, queries, programming objects and optimize performance.
Understand and can apply new industry perspectives to our existing business and data models.
Understand the strategic direction set by senior management as it relates to team goals.

Use

Use and continuously develop best practices, standards & frameworks.
Use previous experience to maintain the current loyalty and marketing databases.

Validate

Validate all standardized queries, reports, and metrics used for quality purposes.

Work with

Work with an entrepreneurial sense of urgency.
Work with billions of rows of data.
Work with data and analytics experts to strive for greater functionality in our data systems.
Work with data and analytics experts to strive for greater functionality in our Hadoop data systems.
Work with data engineers to troubleshoot application problems and effectively resolve issues.
Work with data lake architects on the ecosystem's road map, automation and orchestration.
Work with Docker for containerization, Git for version control, and Kubernetes for deployment.
Work with internal clients to provide a solution to process and store billions of events.
Work with other members to implement and integrate into our existing systems.
Work with our client seriously large volume of analytics data.
Work with our Data Science teams to implement predictive analytics pipelines.
Work with our vendor partner such as Wipro to ensure delivery against agreed upon SLAs.
Work with product managers and data scientists to understand the objectives of the data platform.
Work with product managers to understand the business objectives.
Work with small, cross-functional teams to define vision, establish team culture and processes.
Work with the SMEs to implement and build data flows.

Write

Write code for data enrichment and normalization and deploying it to Openshift.
Write programs, appropriate test artifacts, ad hoc queries, and reports.
Write technical specifications and documentation.
Write thorough documentation on work committed to projects.

Most In-demand Hard Skills

The following list describes the most required technical skills of a Data Engineer:

Python
SQL
Spark
Java
AWS
Scala
Hadoop
Kafka
Azure
Hive
Data Engineering
Redshift
ETL
Snowflake
Docker
Airflow
Tableau
GCP
Machine Learning
Statistics
Big Data
Kubernetes
GIT
Cassandra
Data Warehousing
Mysql
Nosql
Cloud
Data Modeling
EMR
Mathematics
Nosql Databases
S3
R
Oracle
Postgresql
Hbase
Big Data Technologies
Relational Databases
Data Science

Most In-demand Soft Skills

The following list describes the most required soft skills of a Data Engineer:

Written and oral communication skills
Analytical ability
Problem-solving attitude
Interpersonal skills
Organizational capacity
Attention to detail
Collaborative
Leadership
Team player
Self-starter
Creative
Self-motivated
Detail-oriented
Flexible
Curious
Work independently with little direction
Self-directed
Time-management
Critical thinker
Multi-task
Adaptable to changes
Presentation
Teamwork
Reliable
Identify opportunities for improvement
Bilingualism
Initiative
Proactive
Troubleshooting skills
Self-disciplined