Main Responsibilities and Required Skills for Big Data Engineer

data engineer working on a laptop

A Big Data Engineer is responsible for designing and developing big data applications and data visualization tools. They collect and present data for reporting and build data pipelines. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Big Data Engineers.

Get market insights and compare skills for other jobs here.

Main Responsibilities of Big Data Engineer

The following list describes the typical responsibilities of a Big Data Engineer:

Access

Access and act on the right cross-channel KPIs, dashboards, reports and AI-powered insights.

Address

Address area-level risks, provides and implements mitigation plan.

Analyze

  • Analyze and develop data set processes for data ingestion, modeling and mining.

  • Analyze and solve problems at their root, stepping back to understand the broader context.

  • Analyze, recommend and implement improvements to support Corporate initiatives for EDP.

Architect

Architect and build data pipelines for both real-time telemetry and data warehousing.

Assemble

Assemble large, complex data sets that meet functional / non-functional business requirements.

Assist in

  • Assist in building a sustainable big-data platform.

  • Assist with prototyping emerging technologies involving.

Author

Author clear technical documentation.

Automate

  • Automate CI and deployment processes and best practices for the production data pipelines.

  • Automate test coverage for data pipelines.

Benchmark

Benchmark the performance in line with the non-functional requirements.

Build

  • Build AI / ML model based alert mechanism and anomaly detection system for the product.

  • Build and maintain a framework.

  • Build a product to process large amount data / events for AI / ML and Data consumption.

  • Build complex Data Engineering workflows.

  • Build complex SQL queries using MongoDB, Oracle, SQL Server, MariaDB, MySQL.

  • Build data lake on Azure cloud.

  • Build data products that reduce friction to enable our marketing initiatives to pivot quickly.

  • Build high-performance algorithms, prototypes, and proof of concepts.

  • Build large-scale data processing systems using cloud computing technologies.

  • Build out of strong development unit test practices, with a goal of automated regression testing.

  • Build the infrastructure required for optimal.

Collaborate with

  • Collaborate in the design, development, test and maintenance of scalable data management solutions.

  • Collaborate with IT and business area partners on work groups and initiatives.

  • Collaborate with other teams.

Collect

Collect and present data for reporting and planning.

Communicate

  • Communicate systems issues at the appropriate technical level for each audience.

  • Communicate with internal teams and stakeholders to understand project requirements.

Conduct

Conduct Root-cause analysis of data issues.

Configure

Configure and manage connection process.

Content

Content Units including News and Local Services.

Continue

Continue to build knowledge of the company, processes and customers.

Contribute to

  • Contribute significantly to architectural decisions around our data.

  • Contribute towards shaping the architecture, design and scalability of our processes and pipelines.

Create

  • Create all necessary documents and communicate to the team in support of the project.

  • Create and develop data pipelines for new sources and uses of data across Mojio.

  • Create and maintain data warehouse schemas and ETL processes.

  • Create and maintain optimal data pipeline architecture.

  • Create and maintain optimal data pipeline architecture to meet business needs.

  • Create a startup mentality to accelerate the introduction of new capabilities and transform teams.

  • Create complex data solutions and build data pipelines.

  • Create data tools for analytics and data scientist team.

  • Create documentation to support knowledge sharing.

Define

  • Define data retention policies.

  • Define metrics for tracking how customers are interacting with products and service.

  • Define standards and best practices for end to end development lifecycle.

Design

  • Design and build data processing solutions, and improve current ones.

  • Design and build the infrastructure for data.

  • Design and code (Java, Scala, Spark) solutions to support common and strategic data sourcing needs.

  • Design and develop big data applications and data visualization tools.

  • Design and develop highly scalable and extensible data pipelines from internal and external sources.

  • Design and implement components of our Next Generation Platform.

  • Design and scale databases and pipelines across multiple physical locations on cloud.

Determine

Determine best course of action for meeting business needs.

Develop

  • Develop a data model around stated use cases to capture client's KPIs and data.

  • Develop and automate data quality checks.

  • Develop and enhance platform best practices.

  • Develop and maintain ETL processes using SSIS, Scripting and data replication technologies.

  • Develop and operate our data pipeline & infrastructure.Develop code using Python, Scala, R languages.

  • Develop data models and mappings.

  • Develop data processing scripts using Spark.

  • Develop data profiling, deduping logic, matching logic for analysis.

  • Develop expertise in developing microservices and hosting them on our platform.

  • Develop expertise in Golang / microservices.

  • Develop HA strategies, including replica sets and sharding to for highly available clusters.

  • Develop highly scalable and extensible data pipelines from internal and external sources.

  • Develop innovative solutions to Big Data issues and challenges within the team.

  • Develop parallel algorithms and data processing using Apache big-data stack (like Hadoop, Kafka.

  • Develop parallel data-intensive systems using Big Data technologies.

  • Develop Python, PySpark, Spark scripts to filter / cleanse / map / aggregate data.

  • Develop set processes for data mining, data modeling, and data production.

  • Develop solutions that put clients first.

  • Develop the robust and monitorable datapipeline and related services.

Document

Document and communicate product feedback in order to improve user experience.

Drive

Drive and support automation and integration of infrastructure and system processes.

Elevate

Elevate code into the development, test, and Production environments on schedule.

Ensure

  • Ensure self and peers are actively seeking ways to objectively measure productivity.

  • Ensure systems meet business requirements and industry practices.

  • Ensure that objects are modeled appropriately.

  • Ensure the Hadoop platform can effectively meet performance & SLA requirements.

Estimate

Estimate engineering work effort and effectively identify and prioritize the high impact tasks.

Evaluate

  • Evaluate and provides feedback on future technologies and new releases / upgrades.

  • Evaluate the efficiency of software / product releases and conduct read outs on results.

Execute

Execute basic to moderately complex functional work tracks for the team.

Expand

Expand and grow data platform capabilities to solve new data problems and challenges.

Explain

Explain technical considerations at related meetings, including those with internal clients.

Explore

Explore and evaluate new ideas and technologies.

Follow

  • Follow architecture standards.

  • Follow build and automation practices to support continuous integration and improvement.

  • Follow industry-standard agile software design methodology for development and documentation.

  • Follow software development methodology.

Help

Help design and implement components of Next Generation Platform.

Identify

  • Identify and develop Big Data sources & techniques to solve business problems.

  • Identify and communicate technical problems, process and solutions.

  • Identify and resolve issues, bugs, and impediments.

  • Identify, design, and implement internal process.

Implement

  • Implement and manage large scale ETL jobs on Hadoop / Spark clusters in Amazon AWS / Microsoft Azure.

  • Implement security measures by encrypting sensitive data.

Improve

Improve database tables, views, processes and storage to be more efficient and save costs.

Influence

Influence within the team on the effectiveness of Big Data systems to solve their business problems.

Initiate

Initiate and conduct code reviews, create code standards, conventions and guidelines.

Integrate

  • Integrate platform into the existing enterprise data warehouse and various operational systems.

  • Integrate third party products.

Integrate

Integrate these solutions with the architecture used across the company.

Interface with

Interface with customers, understanding their requirements and delivering complete data solutions.

Investigate

  • Investigate and integrate up-and-coming big data technologies into existing requirements.

  • Investigate issues reported by testing teams to determine impact, root cause, and solve them.

Lead

  • Lead functional and architectural design of assigned areas.

  • Lead in prototyping emerging technologies involving.

  • Lead others to solve complex problems.

  • Lead technical efforts, including design and code reviews, and mentor staff appropriately.

  • Lead work and deliver elegant and scalable solutions.

Learn

  • Learn from deep subject matter experts through mentoring and on the job coaching.

  • Learn how to use our application platform.

Maintain

Maintain and incrementally improve existing solutions.

Make

  • Make a significant contribution towards Infoblox's big data pipeline.

  • Make our data lake run like a core service.

  • Make significant contributions towards design and development.

  • Make sure design decisions on the project meet architectural and design requirements.

Manage

  • Manage and implement data processes (Data Quality reports).

  • Manage own learning and contribute to technical skill building of the team.

  • Manage system / application environment and ongoing operations.

Optimize

Optimize queries, data models, and storage formats to support common usage patterns.

Own

Own one or more key components of the infrastructure.

Participate in

  • Participate in an on-call support rotation.

  • Participate in development of datamarts for reports and data visualization solutions.

  • Participate in infrastructure and system design of the NCR Data Lake.

  • Participate in periodic team on call rotations supporting all our Big Data platforms.

  • Participate in strategic planning discussions with technical and non-technical partners.

Perform

  • Perform a range of assignments related to job discipline.

  • Perform code reviews and supports SQL optimization and tuning.

  • Perform on-call activities as needed for the environment and technologies.

  • Perform optimization, debugging and capacity planning of a Big Data cluster.

  • Perform security remediation, automation and self heal as per the requirement.

  • Perform tasks such as writing scripts, write SQL queries, etc.

Plan

Plan / schedule tasks, lead small development teams, and mentor junior colleagues.

Present

Present ideas and recommendations on Hadoop and other technologies best use to management.

Process

Process unstructured data into structured data, manage schema of new data.

Provide

  • Provide follow up Production support.

  • Provide leadership by mentoring junior DBAs and by leading internal projects and initiatives.

  • Provide ongoing operations and support for production systems to meet defined SLAs.

  • Provide oversight and guidance to our Data Engineering development team.

  • Provide RDMS support for MasterCard Applications.

  • Provide support, on-going maintenance, and required modifications to multiple Hadoop environments.

  • Provide technical assistance to junior team members and to colleagues across the organization.

  • Provide the skills to consistently search for improved methods to provide customer service.

  • Provide verifiable technical solutions to support operations at scale and with high availability.

Recommend

  • Recommend and implement solutions to improve performance, resource consumption, and resiliency.

  • Recommend technological application programs to accomplish long-range objectives.

  • Recommend ways to improve data reliability, efficiency and quality.

Recruit

Recruit, mentor, build and motivate the IT teams that will positively impact our business.

Research

  • Research, design, implement and test technology solutions.

  • Research modern technologies to solve unique challenges.

  • Research new uses for existing data.

  • Research opportunities for data acquisition and new uses for existing data.

Resolve

Resolve alerts and perform remediation activities.

Review

  • Review and test code changes in lower environments.

  • Review code and provide feedback relative to best practices and improving performance.

Seek

Seek to understand the data being worked with as its often unstructured data sets.

Specialize

  • Specialize in data egestion (from the Entrprise Data Lake to anyalitical and operational systems).

  • Specialize in data governance and security of data assets.

  • Specialize in making trusted data available and accessible to the users.

Submit

Submit change control requests and documents.

Suggest

Suggest technical and functional improvements to add value to the product.

Support

  • Support Cloud Initiatives.

  • Support Data pipeline with bug fixes, and additional enhancements.

  • Support enterprise Big Data platforms in AWS including EMR, Presto, Spark, and Ranger.

  • Support IAAS and Devops initiatives for infrastructure delivery transformation.

  • Support MercuryPlus Data delivery effort.

  • Support storage retention and disposition of data.

  • Support TMX internal / external users for application related inquiries.

Take

Take ownership of design and implementation of scalable and fault tolerant projects.

Test

Test deliverables against a user story's acceptance tests.

Train

Train and mentor staff with less experience.

Transform

Transform the data to create a consumable data layer for various application uses.

Understand

  • Understand deeply how to build data warehouses and data marts.

  • Understand merging medically coded data across coding types such as SNOMED CT, ICD10, CPT, CCS, etc..

Use

Use Spark to implement truly scalable ETL processes.

Work with

  • Work closely with other team to ensure that features meet business needs.

  • Work closely with team members from across Mastercard to identify functional and system requirements.

  • Work closely with the engineering team.

  • Work closely with various cross-functional product teams.

  • Work in a small agile team to deliver highly optimized batch and real-time data processing.

  • Work on Data and Analytics Tools in the Cloud.

  • Work on deployment and making sure products are production-ready and function smoothly.

  • Work on geographically dispersed team embracing Agile and DevOps principles.

  • Work on Performance Tuning and Increase Operational efficiency on a continuous basis.

  • Work to establish a Hadoop efficiencies on our Cloudera stack.

  • Work to identify gaps and improving the platform's quality, robustness, maintainability, and speed.

  • Work with infrastructure, security, and other partners.

  • Work with senior stakeholders to develop a clear understanding of requirement drivers.

Write

  • Write programs, develops code, tests artifacts, and produces reports.

  • Write the system / technical portion of assigned deliverables.

Most In-demand Hard Skills

The following list describes the most required technical skills of a Big Data Engineer:

  1. Python

  2. Spark

  3. Java

  4. Hive

  5. Scala

  6. AWS

  7. Kafka

  8. Hadoop

  9. SQL

  10. Azure

  11. Hbase

  12. Hdfs

  13. Cassandra

  14. Big Data Technologies

  15. Sqoop

  16. PIG

  17. CS

  18. GIT

  19. Nosql Databases

  20. CE

  21. EE

  22. GCP

  23. Design

  24. Designing

  25. ETL

  26. Oozie

  27. Docker

  28. Jenkins

  29. Big Data

  30. Mongodb

  31. Storm

  32. Cloud

  33. Hadoop Ecosystem

  34. Kubernetes

  35. Microservices Architecture

  36. Batch

  37. Data Warehousing

  38. Dynamodb

  39. EMR

Most In-demand Soft Skills

The following list describes the most required soft skills of a Big Data Engineer:

  1. Written and oral communication skills

  2. Problem-solving attitude

  3. Analytical ability

  4. Organizational capacity

  5. Interpersonal skills

  6. Collaborative

  7. Curious

  8. Leadership

  9. Innovation

  10. Attention to detail

  11. Creative

  12. Passion for deep technical excellence

  13. Personal qualities

  14. Tenacity

  15. Multi-task

  16. Passion for learning

  17. Adaptable to changes

  18. Time-management

  19. Flexible

  20. Presentation

  21. Team player

  22. Teamwork

  23. Troubleshooting skills

Restez à l'affût du marché de l'emploi dans le sport!

Abonnez-vous à notre infolettre