Data Engineer (Fully remote) - Gauteng Parkwood

The core advanced data engineering skillset is a comprehensive combination of technical expertise, platform knowledge, and problem-solving abilities required to build, maintain, and optimize robust, scalable, and efficient data systems.
Data Architecture and Design
– Data Modeling:
o Create normalized and denormalized schemas (3NF, star, snowflake).
o Design data lakes, warehouses, and marts optimized for analytical or transactional workloads.
o Incorporate modern paradigms like data mesh, lakehouse, and delta architecture.
– ETL/ELT Pipelines:
o Develop end-to-end pipelines for extracting, transforming, and loading data.
o Optimize pipelines for real-time and batch processing.
– Metadata Management:
o Implement data lineage, cataloging, and tagging for better discoverability and governance.

Distributed Computing and Big Data Technologies
– Proficiency with big data platforms:
o Apache Spark (PySpark, Sparklyr).
o Hadoop ecosystem (HDFS, Hive, MapReduce).
o Apache Iceberg or Delta Lake for versioned data lake storage.
– Manage large-scale, distributed datasets efficiently.
– Utilize query engines like Presto, Trino, or Dremio for federated data access.

Data Storage Systems
– Expertise in working with different types of storage systems:
o Relational Databases (RDBMS): SQL Server, PostgreSQL, MySQL, etc.
o NoSQL Databases: MongoDB, Cassandra, DynamoDB.
o Cloud Data Warehouses: Snowflake, Google BigQuery, Azure Synapse, AWS Redshift.
o Object Storage: Amazon S3, Azure Blob Storage, Google Cloud Storage.
– Optimize storage strategies for cost and performance:
o Partitioning, bucketing, indexing, and compaction.

Programming and Scripting
– Advanced knowledge of programming languages:
o Python (pandas, PySpark, SQL Alchemy).
o SQL (window functions, CTEs, query optimization).
o R (data wrangling, Sparklyr for data processing).
o Java or Scala (for Spark and Hadoop customizations).
– Proficiency in scripting for automation (e.g., Bash, PowerShell).

Real-Time and Streaming Data
– Expertise in real-time data processing:
o Apache Kafka, Kinesis, Azure Event Hub for event streaming.
o Apache Flink or Spark Streaming for real-time ETL.
o Implement event-driven architectures using message queues.
– Handle time-series data and process live feeds for real-time analytics.

Cloud Platforms and Services
– Experience with cloud environments:
o AWS: Lambda, Glue, EMR, Redshift, S3, Athena.
o Azure: Data Factory, Synapse, Data Lake, Databricks.
o GCP: BigQuery, Dataflow, Dataproc.
– Manage infrastructure-as-code (IaC) using tools like Terraform or CloudFormation.
– Leverage cloud-native features like auto-scaling, serverless compute, and managed services.

DevOps and Automation
– Implement CI/CD pipelines for data workflows:
o Tools: Jenkins, GitHub Actions, GitLab CI, Azure DevOps.
– Monitor and automate tasks using orchestration tools:
o Apache Airflow, Prefect, Dagster.
o Managed services like AWS Step Functions or Azure Data Factory.
– Automate resource provisioning using tools like Kubernetes or Docker.

Data Governance, Security, and Compliance
– Data Governance:
o Implement role-based access control (RBAC) and attribute-based access control (ABAC).
o Maintain master data and metadata consistency.
– Security:
o Apply encryption at rest and in transit.
o Secure data pipelines with IAM roles, OAuth, or API keys.
o Implement network security (e.g., firewalls, VPCs).
– Compliance:
o Ensure adherence to regulations like GDPR, CCPA, HIPAA, or SOC 2.
o Track and document audit trails for data usage.

Performance Optimization
– Optimize query and pipeline performance:
o Query tuning (partition pruning, caching, broadcast joins).
o Reduce IO costs and bottlenecks with columnar formats like Parquet or ORC.
o Use distributed computing patterns to parallelize workloads.
– Implement incremental data processing to avoid full dataset reprocessing.

Advanced Data Integration
– Work with API-driven data integration:
o Consume and build REST/GraphQL APIs.
o Implement integrations with SaaS platforms (e.g., Salesforce, Twilio, Google Ads).
– Integrate disparate systems using ETL/ELT tools like:
o Informatica, Talend, dbt (data build tool), or Azure Data Factory.

Data Analytics and Machine Learning Integration
– Enable data science workflows by preparing data for ML:
o Feature engineering, data cleaning, and transformations.
– Integrate machine learning pipelines:
o Use Spark MLlib, TensorFlow, or scikit-learn in ETL pipelines.
– Automate scoring and prediction serving using ML models.

Monitoring and Observability
– Set up monitoring for data pipelines:
o Tools: Prometheus, Grafana, or ELK stack.
o Create alerts for SLA breaches or job failures.
– Track pipeline and job health with detailed logs and metrics.

Business and Communication Skills
– Translate complex technical concepts into business terms.
– Collaborate with stakeholders to define data requirements and SLAs.
– Design data systems that align with business goals and use cases.

Continuous Learning and Adaptability
– Stay updated with the latest trends and tools in data engineering:
o E.g., Data mesh architecture, Fabric, and AI-integrated data workflows.
– Actively engage in learning through online courses, certifications, and community contributions:
o Certifications like Databricks Certified Data Engineer, AWS Data Analytics Specialty, or Google Professional Data Engineer.

Desired Skills:

PostgreSQL
Apache Spark
Hadoop ecosystem
Python
SQL
R

Learn more/Apply for this position

Data Engineer (Fully remote) – Gauteng Parkwood