Main Responsibilities and Required Skills for Big Data Engineer

A Big Data Engineer is responsible for designing and developing big data applications and data visualization tools. They collect and present data for reporting and build data pipelines. In this blog post we describe the primary responsibilities and the most in-demand hard and soft skills for Big Data Engineers.
Get market insights and compare skills for other jobs here.
Main Responsibilities of Big Data Engineer
The following list describes the typical responsibilities of a Big Data Engineer:
Access
Access and act on the right cross-channel KPIs, dashboards, reports and AI-powered insights.
Address
Address area-level risks, provides and implements mitigation plan.
Analyze
Analyze and develop data set processes for data ingestion, modeling and mining.
Analyze and solve problems at their root, stepping back to understand the broader context.
Analyze, recommend and implement improvements to support Corporate initiatives for EDP.
Architect
Architect and build data pipelines for both real-time telemetry and data warehousing.
Assemble
Assemble large, complex data sets that meet functional / non-functional business requirements.
Assist in
Assist in building a sustainable big-data platform.
Assist with prototyping emerging technologies involving.
Author
Author clear technical documentation.
Automate
Automate CI and deployment processes and best practices for the production data pipelines.
Automate test coverage for data pipelines.
Benchmark
Benchmark the performance in line with the non-functional requirements.
Build
Build AI / ML model based alert mechanism and anomaly detection system for the product.
Build and maintain a framework.
Build a product to process large amount data / events for AI / ML and Data consumption.
Build complex Data Engineering workflows.
Build complex SQL queries using MongoDB, Oracle, SQL Server, MariaDB, MySQL.
Build data lake on Azure cloud.
Build data products that reduce friction to enable our marketing initiatives to pivot quickly.
Build high-performance algorithms, prototypes, and proof of concepts.
Build large-scale data processing systems using cloud computing technologies.
Build out of strong development unit test practices, with a goal of automated regression testing.
Build the infrastructure required for optimal.
Collaborate with
Collaborate in the design, development, test and maintenance of scalable data management solutions.
Collaborate with IT and business area partners on work groups and initiatives.
Collaborate with other teams.
Collect
Collect and present data for reporting and planning.
Communicate
Communicate systems issues at the appropriate technical level for each audience.
Communicate with internal teams and stakeholders to understand project requirements.
Conduct
Conduct Root-cause analysis of data issues.
Configure
Configure and manage connection process.
Content
Content Units including News and Local Services.
Continue
Continue to build knowledge of the company, processes and customers.
Contribute to
Contribute significantly to architectural decisions around our data.
Contribute towards shaping the architecture, design and scalability of our processes and pipelines.
Create
Create all necessary documents and communicate to the team in support of the project.
Create and develop data pipelines for new sources and uses of data across Mojio.
Create and maintain data warehouse schemas and ETL processes.
Create and maintain optimal data pipeline architecture.
Create and maintain optimal data pipeline architecture to meet business needs.
Create a startup mentality to accelerate the introduction of new capabilities and transform teams.
Create complex data solutions and build data pipelines.
Create data tools for analytics and data scientist team.
Create documentation to support knowledge sharing.
Define
Define data retention policies.
Define metrics for tracking how customers are interacting with products and service.
Define standards and best practices for end to end development lifecycle.
Design
Design and build data processing solutions, and improve current ones.
Design and build the infrastructure for data.
Design and code (Java, Scala, Spark) solutions to support common and strategic data sourcing needs.
Design and develop big data applications and data visualization tools.
Design and develop highly scalable and extensible data pipelines from internal and external sources.
Design and implement components of our Next Generation Platform.
Design and scale databases and pipelines across multiple physical locations on cloud.
Determine
Determine best course of action for meeting business needs.
Develop
Develop a data model around stated use cases to capture client's KPIs and data.
Develop and automate data quality checks.
Develop and enhance platform best practices.
Develop and maintain ETL processes using SSIS, Scripting and data replication technologies.
Develop and operate our data pipeline & infrastructure.Develop code using Python, Scala, R languages.
Develop data models and mappings.
Develop data processing scripts using Spark.
Develop data profiling, deduping logic, matching logic for analysis.
Develop expertise in developing microservices and hosting them on our platform.
Develop expertise in Golang / microservices.
Develop HA strategies, including replica sets and sharding to for highly available clusters.
Develop highly scalable and extensible data pipelines from internal and external sources.
Develop innovative solutions to Big Data issues and challenges within the team.
Develop parallel algorithms and data processing using Apache big-data stack (like Hadoop, Kafka.
Develop parallel data-intensive systems using Big Data technologies.
Develop Python, PySpark, Spark scripts to filter / cleanse / map / aggregate data.
Develop set processes for data mining, data modeling, and data production.
Develop solutions that put clients first.
Develop the robust and monitorable datapipeline and related services.
Document
Document and communicate product feedback in order to improve user experience.
Drive
Drive and support automation and integration of infrastructure and system processes.
Elevate
Elevate code into the development, test, and Production environments on schedule.
Ensure
Ensure self and peers are actively seeking ways to objectively measure productivity.
Ensure systems meet business requirements and industry practices.
Ensure that objects are modeled appropriately.
Ensure the Hadoop platform can effectively meet performance & SLA requirements.
Estimate
Estimate engineering work effort and effectively identify and prioritize the high impact tasks.
Evaluate
Evaluate and provides feedback on future technologies and new releases / upgrades.
Evaluate the efficiency of software / product releases and conduct read outs on results.
Execute
Execute basic to moderately complex functional work tracks for the team.
Expand
Expand and grow data platform capabilities to solve new data problems and challenges.
Explain
Explain technical considerations at related meetings, including those with internal clients.
Explore
Explore and evaluate new ideas and technologies.
Follow
Follow architecture standards.
Follow build and automation practices to support continuous integration and improvement.
Follow industry-standard agile software design methodology for development and documentation.
Follow software development methodology.
Help
Help design and implement components of Next Generation Platform.
Identify
Identify and develop Big Data sources & techniques to solve business problems.
Identify and communicate technical problems, process and solutions.
Identify and resolve issues, bugs, and impediments.
Identify, design, and implement internal process.
Implement
Implement and manage large scale ETL jobs on Hadoop / Spark clusters in Amazon AWS / Microsoft Azure.
Implement security measures by encrypting sensitive data.
Improve
Improve database tables, views, processes and storage to be more efficient and save costs.
Influence
Influence within the team on the effectiveness of Big Data systems to solve their business problems.
Initiate
Initiate and conduct code reviews, create code standards, conventions and guidelines.
Integrate
Integrate platform into the existing enterprise data warehouse and various operational systems.
Integrate third party products.
Integrate
Integrate these solutions with the architecture used across the company.
Interface with
Interface with customers, understanding their requirements and delivering complete data solutions.
Investigate
Investigate and integrate up-and-coming big data technologies into existing requirements.
Investigate issues reported by testing teams to determine impact, root cause, and solve them.
Lead
Lead functional and architectural design of assigned areas.
Lead in prototyping emerging technologies involving.
Lead others to solve complex problems.
Lead technical efforts, including design and code reviews, and mentor staff appropriately.
Lead work and deliver elegant and scalable solutions.
Learn
Learn from deep subject matter experts through mentoring and on the job coaching.
Learn how to use our application platform.
Maintain
Maintain and incrementally improve existing solutions.
Make
Make a significant contribution towards Infoblox's big data pipeline.
Make our data lake run like a core service.
Make significant contributions towards design and development.
Make sure design decisions on the project meet architectural and design requirements.
Manage
Manage and implement data processes (Data Quality reports).
Manage own learning and contribute to technical skill building of the team.
Manage system / application environment and ongoing operations.
Optimize
Optimize queries, data models, and storage formats to support common usage patterns.
Own
Own one or more key components of the infrastructure.
Participate in
Participate in an on-call support rotation.
Participate in development of datamarts for reports and data visualization solutions.
Participate in infrastructure and system design of the NCR Data Lake.
Participate in periodic team on call rotations supporting all our Big Data platforms.
Participate in strategic planning discussions with technical and non-technical partners.
Perform
Perform a range of assignments related to job discipline.
Perform code reviews and supports SQL optimization and tuning.
Perform on-call activities as needed for the environment and technologies.
Perform optimization, debugging and capacity planning of a Big Data cluster.
Perform security remediation, automation and self heal as per the requirement.
Perform tasks such as writing scripts, write SQL queries, etc.
Plan
Plan / schedule tasks, lead small development teams, and mentor junior colleagues.
Present
Present ideas and recommendations on Hadoop and other technologies best use to management.
Process
Process unstructured data into structured data, manage schema of new data.
Provide
Provide follow up Production support.
Provide leadership by mentoring junior DBAs and by leading internal projects and initiatives.
Provide ongoing operations and support for production systems to meet defined SLAs.
Provide oversight and guidance to our Data Engineering development team.
Provide RDMS support for MasterCard Applications.
Provide support, on-going maintenance, and required modifications to multiple Hadoop environments.
Provide technical assistance to junior team members and to colleagues across the organization.
Provide the skills to consistently search for improved methods to provide customer service.
Provide verifiable technical solutions to support operations at scale and with high availability.
Recommend
Recommend and implement solutions to improve performance, resource consumption, and resiliency.
Recommend technological application programs to accomplish long-range objectives.
Recommend ways to improve data reliability, efficiency and quality.
Recruit
Recruit, mentor, build and motivate the IT teams that will positively impact our business.
Research
Research, design, implement and test technology solutions.
Research modern technologies to solve unique challenges.
Research new uses for existing data.
Research opportunities for data acquisition and new uses for existing data.
Resolve
Resolve alerts and perform remediation activities.
Review
Review and test code changes in lower environments.
Review code and provide feedback relative to best practices and improving performance.
Seek
Seek to understand the data being worked with as its often unstructured data sets.
Specialize
Specialize in data egestion (from the Entrprise Data Lake to anyalitical and operational systems).
Specialize in data governance and security of data assets.
Specialize in making trusted data available and accessible to the users.
Submit
Submit change control requests and documents.
Suggest
Suggest technical and functional improvements to add value to the product.
Support
Support Cloud Initiatives.
Support Data pipeline with bug fixes, and additional enhancements.
Support enterprise Big Data platforms in AWS including EMR, Presto, Spark, and Ranger.
Support IAAS and Devops initiatives for infrastructure delivery transformation.
Support MercuryPlus Data delivery effort.
Support storage retention and disposition of data.
Support TMX internal / external users for application related inquiries.
Take
Take ownership of design and implementation of scalable and fault tolerant projects.
Test
Test deliverables against a user story's acceptance tests.
Train
Train and mentor staff with less experience.
Transform
Transform the data to create a consumable data layer for various application uses.
Understand
Understand deeply how to build data warehouses and data marts.
Understand merging medically coded data across coding types such as SNOMED CT, ICD10, CPT, CCS, etc..
Use
Use Spark to implement truly scalable ETL processes.
Work with
Work closely with other team to ensure that features meet business needs.
Work closely with team members from across Mastercard to identify functional and system requirements.
Work closely with the engineering team.
Work closely with various cross-functional product teams.
Work in a small agile team to deliver highly optimized batch and real-time data processing.
Work on Data and Analytics Tools in the Cloud.
Work on deployment and making sure products are production-ready and function smoothly.
Work on geographically dispersed team embracing Agile and DevOps principles.
Work on Performance Tuning and Increase Operational efficiency on a continuous basis.
Work to establish a Hadoop efficiencies on our Cloudera stack.
Work to identify gaps and improving the platform's quality, robustness, maintainability, and speed.
Work with infrastructure, security, and other partners.
Work with senior stakeholders to develop a clear understanding of requirement drivers.
Write
Write programs, develops code, tests artifacts, and produces reports.
Write the system / technical portion of assigned deliverables.
Most In-demand Hard Skills
The following list describes the most required technical skills of a Big Data Engineer:
Azure
Hbase
Hdfs
Cassandra
Big Data Technologies
Sqoop
PIG
CS
GIT
Nosql Databases
CE
EE
GCP
Design
Designing
ETL
Oozie
Docker
Jenkins
Big Data
Mongodb
Storm
Cloud
Hadoop Ecosystem
Kubernetes
Microservices Architecture
Batch
Data Warehousing
Dynamodb
EMR
Most In-demand Soft Skills
The following list describes the most required soft skills of a Big Data Engineer:
Written and oral communication skills
Problem-solving attitude
Analytical ability
Organizational capacity
Interpersonal skills
Collaborative
Curious
Leadership
Innovation
Attention to detail
Creative
Passion for deep technical excellence
Personal qualities
Tenacity
Multi-task
Passion for learning
Adaptable to changes
Time-management
Flexible
Presentation
Team player
Teamwork
Troubleshooting skills