Data Engineer

Full-Time

India, Remote

2 Openings

About the Role

LakeFusion is seeking an experienced Data Engineer to design and build the data infrastructure that powers our Master Data Management platform, built natively on the Databricks Data Intelligence Platform. In this role, you will develop scalable, high-performance data pipelines that support entity resolution, data quality, and real-time analytics.

You will take a hands-on role in building and optimizing batch and streaming pipelines using modern Lakehouse technologies, including Delta Live Tables and Change Data Capture patterns. This includes ensuring data reliability, consistency, and performance through robust pipeline design, testing, and optimization of Spark workloads.

Working closely with AI/ML Engineers, Product Managers, and Data Scientists, you will translate data requirements into efficient data models and pipelines that enable intelligent features and analytics. You will also contribute to best practices across data engineering, including monitoring, version control, and automated testing.

This is a highly self-directed role suited for someone who thrives in a fast-paced environment, where building scalable data systems and ensuring data quality at scale are central to success.

What you'll do

Design and Develop Scalable Data Pipelines: Lead the design, development, and optimization of robust, high-performance data pipelines within the Databricks Lakehouse Platform to support LakeFusion's core functionalities, including entity resolution, data quality, and analytical reporting.
Implement Real-time Data Ingestion: Build and manage streaming data pipelines using Delta Live Tables (DLT) and other Databricks capabilities for real-time data ingestion, transformation, and processing, leveraging Change Data Capture (CDC) and Change Data Feed (CDF) patterns.
Optimize Spark Workloads: Apply advanced PySpark best practices and Spark optimization techniques to ensure efficient processing of large-scale datasets, reducing latency and cost for batch and streaming operations.
Ensure Data Reliability and Quality: Develop pipelines with a strong focus on reliability, testability, and data quality. Implement idempotent designs to guarantee data consistency and accuracy across all data flows.
Data Modeling and Architecture: Design and implement logical and physical data models for the Lakehouse, including dimensional modeling (e.g., Kimball methodology) and handling Slowly Changing Dimensions (SCDs), to support analytical and operational needs.
Collaborate on Data Solutions: Work closely with AI/ML Engineers, Product Managers, and Data Scientists to understand data requirements, integrate new data sources, and provide foundational data infrastructure for LakeFusion's intelligent features.‍
Promote Best Practices: Advocate for and implement best practices in data engineering, including version control, automated testing, monitoring, and alerting for data pipelines.

What we're looking for

5+ years of hands-on experience as a Data Engineer or in a similar role, specifically building and managing large-scale data platforms and pipelines in a production environment.
Deep expertise with the Databricks Lakehouse Platform, including extensive experience with Delta Lake, Databricks SQL, Unity Catalog, and Databricks Workflows.
Proven proficiency in building and optimizing data pipelines using Apache Spark, particularly with PySpark for complex data transformations and processing.
Demonstrated experience with streaming data technologies and building real-time pipelines, ideally using Delta Live Tables (DLT).
Strong understanding and practical application of Change Data Capture (CDC) and Change Data Feed (CDF) patterns for incremental data loading.
Solid foundation in data modeling concepts, including dimensional modeling (Kimball) and techniques for managing Slowly Changing Dimensions (SCDs).
Experience in designing and implementing reliable, testable, and idempotent data pipelines, ensuring data quality and consistency.
Familiarity with data governance, metadata management, and data cataloging principles.
Excellent problem-solving skills and the ability to debug complex data issues across distributed systems.
Strong communication skills, capable of articulating complex technical concepts to both technical and non-technical stakeholders.

Nice-to-have

Specific experience with Entity Resolution or Master Data Management (MDM) systems and their underlying data structures.
Experience with cloud platforms (AWS, Azure) for data engineering deployments.
Knowledge of MLOps practices and integrating data pipelines with machine learning workflows.
Experience with CI/CD for data pipelines and infrastructure as code (e.g., Terraform).

About LakeFusion

LakeFusion is the modern Master Data Management (MDM) company. Global enterprises across industries ranging from retail to manufacturing and financial services rely on the LakeFusion platform to unify, govern, and deliver trusted data entities such as customers, products, suppliers, and employees. Built natively on the Databricks Lakehouse, LakeFusion creates a single source of truth that powers analytics and AI. LakeFusion enables organizations worldwide to accelerate innovation with trusted and governed data.

Apply Now