Data Engineer

AI/ML Engineer – Develop LLM-powered solutions to enhance RAG, entity resolution, and MDM workflows.

India, Remote
Full-Time
2 Openings
Apply Now
About the Role

LakeFusion is seeking a Data Engineer to design and build the scalable data infrastructure that powers our AI-driven Master Data Management (MDM) platform. In this role, you will be responsible for developing high-performance data pipelines within the Databricks Lakehouse Platform to support core product functionalities such as entity resolution, data quality, and analytical reporting.You will work on both batch and real-time data ingestion, applying advanced Spark optimization techniques and best practices to ensure efficiency, reliability, and cost-effectiveness at scale. This role involves close collaboration with AI/ML Engineers, Product Managers, and Data Scientists to integrate diverse data sources and provide a solid foundation for LakeFusion’s intelligent features.As a Data Engineer at LakeFusion, you will play a critical role in shaping the architecture of our data ecosystem, designing reliable and testable pipelines, implementing robust data models, and ensuring data quality and governance across all layers of the Lakehouse. Your contributions will directly enable business-ready, trusted data that fuels analytics, operational workflows, and next-generation AI capabilities.

What you’ll do
  • Design and Develop Scalable Data Pipelines: Lead the design, development, and optimization of robust, high-performance data pipelines within the Databricks Lakehouse Platform to support LakeFusion's core functionalities, including entity resolution, data quality, and analytical reporting.
  • Implement Real-time Data Ingestion: Build and manage streaming data pipelines using Delta Live Tables (DLT) and other Databricks capabilities for real-time data ingestion, transformation, and processing, leveraging Change Data Capture (CDC) and Change Data Feed (CDF) patterns.
  • Optimize Spark Workloads: Apply advanced PySpark best practices and Spark optimization techniques to ensure efficient processing of large-scale datasets, reducing latency and cost for batch and streaming operations.
  • Ensure Data Reliability and Quality: Develop pipelines with a strong focus on reliability, testability, and data quality. Implement idempotent designs to guarantee data consistency and accuracy across all data flows.
  • Data Modeling and Architecture: Design and implement logical and physical data models for the Lakehouse, including dimensional modeling (e.g., Kimball methodology) and handling Slowly Changing Dimensions (SCDs), to support analytical and operational needs.
  • Collaborate on Data Solutions: Work closely with AI/ML Engineers, Product Managers, and Data Scientists to understand data requirements, integrate new data sources, and provide foundational data infrastructure for LakeFusion's intelligent features.
  • Promote Best Practices: Advocate for and implement best practices in data engineering, including version control, automated testing, monitoring, and alerting for data pipelines.
What We're Looking For
  • 5+ years of hands-on experience as a Data Engineer or in a similar role, specifically building and managing large-scale data platforms and pipelines in a production environment.
  • Deep expertise with the Databricks Lakehouse Platform, including extensive experience with Delta Lake, Databricks SQL, Unity Catalog, and Databricks Workflows.
  • Proven proficiency in building and optimizing data pipelines using Apache Spark, particularly with PySpark for complex data transformations and processing.
  • Demonstrated experience with streaming data technologies and building real-time pipelines, ideally using Delta Live Tables (DLT).
  • Strong understanding and practical application of Change Data Capture (CDC) and Change Data Feed (CDF) patterns for incremental data loading.
  • Solid foundation in data modeling concepts, including dimensional modeling (Kimball) and techniques for managing Slowly Changing Dimensions (SCDs).
  • Experience in designing and implementing reliable, testable, and idempotent data pipelines, ensuring data quality and consistency.
  • Familiarity with data governance, metadata management, and data cataloging principles.
  • Excellent problem-solving skills and the ability to debug complex data issues across distributed systems.
  • Strong communication skills, capable of articulating complex technical concepts to both technical and non-technical stakeholders.
Nice-to-Have
  • Specific experience with Entity Resolution or Master Data Management (MDM) systems and their underlying data structures.
  • Experience with cloud platforms (AWS, Azure) for data engineering deployments.
  • Knowledge of MLOps practices and integrating data pipelines with machine learning workflows.
  • Experience with CI/CD for data pipelines and infrastructure as code (e.g., Terraform).
About the LakeFusion

LakeFusion is the modern Master Data Management (MDM) company. Global enterprises across industries ranging from retail to manufacturing and financial services  rely on the LakeFusion platform to unify, govern, and deliver trusted data entities such as customers, products, suppliers, and employees. Built natively on the Databricks Lakehouse, LakeFusion creates a single source of truth that powers analytics and AI. LakeFusion enables organizations worldwide to accelerate innovation with trusted and governed data.

Apply Now