Why migrate from Salesforce to Hadoop?

Key drivers were unsustainable Salesforce storage costs at banking scale, limited analytics capability, slow document retrieval, regulatory compliance requirements for long-term auditable storage, and the need for a centralised data lake to power enterprise analytics.

What technology stack was used?

Apache NiFi for ETL orchestration, Salesforce Bulk API v2 for extraction, a relational database for on-premise staging, Hadoop HDFS for data lake storage, a custom Java NiFi processor for document upload, an API gateway for security, cloud VMs with a private network tunnel for connectivity, and a BI tool for analytics.

How long did the migration take?

The project was initiated in May 2024. The live pipeline continues to operate as of March 2026, processing large transaction and document volumes daily on a fully automated basis with no manual intervention required.

Blogs

Cloud & DevOps

From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss

Published On: 2 May 2024.By suman yubraj.

From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss

Enterprise Data Engineering · Banking · India

How Auriga IT moved a private sector bank's entire Salesforce CRM data into a Hadoop data lake - 130 million+ transactions, millions of documents, and over 10 TB - with zero data loss and a fully automated daily pipeline that runs without any manual intervention.

Apache NiFi

Salesforce

Oracle DB

Hadoop / HDFS

Cloud VMs

Document Management

BI Analytics

Published: 2 May 2024 · Updated: 25 March 2026 · By Auriga IT

Pipeline Overview

End-to-end pipeline: CRM extraction, cloud ETL, private tunnel, on-premise gateway, staging database, data lake, document archival, and analytics layer.

130M+

Transactions Migrated

Daily Sync

Fully Automated

Millions

Documents Archived

10+ TB

Data Processed

100%

Checksum Verified

Zero

Data Loss

01 - About the Client

Yes Bank

Yes Bank is one of India's leading private sector banks, established in 2004 and headquartered in Mumbai. It serves retail, corporate, and MSME clients across a broad range of banking and financial services, operating at a scale where reliable data infrastructure is critical to everyday operations.

Yes Bank Limited

Leading Private Sector Bank · Mumbai, India · Established 2004

Every day, the bank generates large volumes of structured records and regulated documents across lending, customer service, and compliance operations. As these volumes grew, storing and querying everything within Salesforce became increasingly expensive and slower to manage, making it difficult to support the bank's long-term archival and analytics requirements.

02 - The Problem

Why Salesforce Could No Longer Be the Data Home

The bank's CRM, loan data, and document archive all lived inside Salesforce - a system designed for customer relationship management, not data warehousing or long-term regulated archival. As volumes grew into hundreds of millions of records, this became both an operational and financial problem.

The issues were concrete. Salesforce storage pricing at banking scale was not viable for long-term data archival. Analytics teams could not run the cross-object queries they needed. Retrieving large volumes of loan documents caused daily delays for dependent teams. And regulatory requirements mandated auditable long-term storage that Salesforce alone could not deliver.

Storage costs too high

Salesforce document and image storage pricing was not sustainable for a bank holding millions of customer records and loan-related files long-term.

Analytics were too limited

Salesforce is not a data warehouse. Complex cross-object queries and BI workloads were slow, costly, and could not deliver the depth the analytics team needed.

Document retrieval was slow

Fetching large volumes of loan-related documents and images from Salesforce caused daily delays for operational teams who depended on those files.

Regulatory mandate unmet

Regulations required long-term, auditable document storage with full metadata traceability - a requirement Salesforce alone could not satisfy.

No central data lake

Analytics and BI teams had no single queryable source of truth. Data was siloed across Salesforce objects, making enterprise reporting difficult.

Real-time and historical sync

Updated Salesforce records needed to be merged cleanly with large historical datasets in Hadoop - without duplication, gaps, or re-processing everything from scratch.

03 - The Solution

A Six-Stage ETL Pipeline Built on Apache NiFi

Auriga IT designed and built a multi-layer data pipeline with Apache NiFi as the central orchestration engine. The architecture connects cloud and on-premise systems through a private network tunnel, handles structured CRM data and unstructured documents through separate purpose-built flows, and reconciles every record end-to-end using checksums and audit logs.

The pipeline was designed around one principle: decouple each stage so a failure in one layer does not cascade into others. Salesforce extraction is independent from staging. Staging is independent from data lake ingestion. Documents follow a completely separate flow with their own retry and error handling. Each stage can be monitored, replayed, or scaled independently.

Architecture

Architecture overview: CRM source, cloud-hosted ETL engine, private tunnel, on-premise API gateway, staging database, document store, and data lake.

Cloud Bridge

The private tunnel bridge: secure, encrypted connectivity between the cloud ETL layer and on-premise infrastructure.

How the Pipeline Works - Step by Step

Extraction from Salesforce

Apache NiFi, hosted on cloud VMs, extracts structured CRM data from Salesforce using Bulk API v2. Extraction is partitioned by object type and date range so large volumes are processed in parallel. A watermark-based incremental strategy means only records updated since the last run are fetched - ensuring exactly-once delivery without re-processing historical data on every run.

Staging in a relational database on-premise

Extracted data is loaded into an on-premise relational database, which acts as an independently-queryable staging layer between Salesforce and the data lake. This decoupling means if the data lake ingestion has an issue, extraction does not need to be re-run from Salesforce. The staging layer also maintains encryption at rest and a full audit trail.

Document and image migration via a custom Java processor

A purpose-built Java processor handles extraction of Salesforce document objects and uploads them to the document management system via the API gateway. It manages binary extraction, metadata mapping to the DMS taxonomy, batched uploads with configurable concurrency, chunked transfer encoding for large files over 100 MB, exponential backoff retry for transient failures, and a dead-letter queue that routes permanent failures for review rather than silently dropping them.

Document Flow

Document flow: binary extraction, metadata mapping, chunked upload, retry handling, and dead-letter routing.

Ingestion into Hadoop HDFS

Data moves from the staging database into Hadoop HDFS on-premise. Output file sizes are controlled to align with Hadoop block size for efficient distributed storage. NiFi clusters were scaled horizontally during peak extraction windows, with separate process groups per object type so different data flows do not compete for resources.

BI analytics on top of Hadoop

A BI analytics tool connects directly to Hadoop, giving analytics teams dashboards across loan processing metrics, first-time-right rates, customer behaviour patterns, and loan application outcomes - capabilities that were not possible when data lived inside Salesforce.

Reconciliation and audit trail

Reconciliation tables track every record from source to target: source-to-target counts, MD5/SHA-256 checksum verification, timestamp audit logs, and failed record tracking. Automated end-of-day reports validate daily completeness. Every record and document is fully traceable, satisfying internal governance and external regulatory compliance requirements.

04 - Technology Stack

What Powered the Pipeline

Category	Technology	Role in the pipeline
Source	Salesforce CRM	Source of structured CRM records, loan documents, and images
Orchestration	Apache NiFi	Core ETL engine - extraction, transformation, routing; hosted on cloud VMs
Custom Dev	Java (NiFi API)	Purpose-built processor for document upload to DMS via API gateway
Database	Relational DB	On-premise staging layer between the ETL engine and the data lake
Data Lake	Hadoop / HDFS	On-premise distributed analytics storage for large-scale querying
DMS	Document Mgmt System	On-premise regulated document archival with full metadata preserved
API Gateway	On-premise Gateway	Secures all ETL-to-on-premise service communication
Connectivity	Private Tunnel	Encrypted, private cloud-to-on-premise network link
Analytics	BI Analytics Tool	Connected to Hadoop for dashboards and business intelligence reporting
Security	SOC 2 / ISO 27001	Encryption at rest and in transit, audit vault, data privacy compliance

05 - Challenges and Solutions

Eight Problems the Team Had to Solve

Challenge	How it was solved
Hundreds of millions of records per Salesforce object	Bulk API v2 with NiFi parallel extraction partitioned by object type and date range. Sustained high-throughput document processing consistently.
Watermark and timestamp management	Watermark-based incremental extraction. Timestamps updated only after the staging database commit, guaranteeing exactly-once delivery.
Multiple concurrent NiFi flows competing for resources	Separate NiFi process groups per object type and document flow, each with independent scheduling and back-pressure configuration.
Special characters causing document management system failures	A dedicated sanitisation step in the custom processor cleans ampersands, angle brackets, quotes, and non-ASCII characters before every DMS API call.
Salesforce API rate limits under sustained load	Adaptive throttling: Bulk API for large structured sets, REST API for document queries, dynamically balanced per load.
Binary files over 100 MB crashing standard upload flows	NiFi streaming content repository combined with chunked transfer encoding for all oversized DMS uploads.
Ensuring no duplication or data loss across stages	Reconciliation tables with MD5/SHA-256 checksums, source-to-target counts, and automated end-of-day audit reports per object type.
Legacy SOAP-only integrations in the on-premise layer	Custom SOAP processors built for loan data flows where legacy systems used SOAP rather than REST.

06 - Inside the Work

Two Flows Worth Looking at Closely

Moving 130 Million+ CRM Records into the Data Lake

The structured migration covered all Salesforce CRM records - customer data, loan records, transaction history - into the on-premise staging database, followed by full ingestion into Hadoop HDFS. Apache NiFi clusters were scaled horizontally during peak windows. Bulk API v2 parallelism partitioned by object type and date range kept extraction throughput high. Watermark-based tracking ensured every run picked up only what had changed since the last run, without re-processing billions of historical records. Post-migration, 100% record match was confirmed across all Salesforce objects via checksum verification and automated reconciliation reports.

130M+ Records

10+ TB Data

Bulk API v2

100% Verified

Zero Data Loss

Archiving Millions of Loan Documents at High Throughput

Millions of PDFs, JPEGs, PNGs, and Word files associated with loan applications were migrated to the document management system at a sustained high daily throughput. The custom Java NiFi processor handled the full flow: binary extraction from Salesforce document objects, metadata schema mapping to the DMS taxonomy, batched uploads with configurable concurrency, and chunked transfer encoding for files over 100 MB. Exponential backoff retry managed transient API failures, while a dead-letter queue captured permanent failures for review rather than silently losing them. Every document was archived with full metadata preserved, satisfying long-term regulatory traceability requirements.

Millions of Docs

High Daily Throughput

100MB+ Binaries

Exponential Backoff

Full Audit Trail

07 - Results

What Changed After the Migration

The migration fundamentally changed Yes Bank's data infrastructure. Analytics that were not possible inside Salesforce are now available via the data lake. Storage costs dropped significantly with documents and images moved off Salesforce. The pipeline runs daily without manual intervention. And every migrated record and document is fully traceable, meeting the bank's regulatory archival requirements.

Outcomes

Eight outcomes: transaction scale, document volume, data integrity, automation, compliance, and the analytics layer now running on Hadoop.

Metric	Outcome
Total transactions migrated	130M+ across all Salesforce objects
Daily pipeline	Runs automatically, no manual triggers
Document migration throughput	High daily volume, fully unattended
Total documents archived	Millions, full metadata preserved
Total data volume	10+ TB end-to-end
Data integrity	Zero data loss, 100% checksum-verified
Regulatory compliance	Full audit trail, compliance mandate satisfied
Manual intervention required	None - monitoring and retry are built in

Loan analytics unlocked

FTR rates, processing times, and rejection patterns now available via BI dashboards on Hadoop

Storage costs reduced

Document and image storage fully moved off Salesforce into on-premise DMS and HDFS

Daily sync automated

Salesforce-to-Hadoop runs daily automatically - no manual triggers or team intervention needed

Decoupled architecture

Staging database lets Hadoop ingest independently - no live Salesforce dependency at query time

Regulatory traceability

Every record and document traceable, satisfying long-term archival and audit requirements

Enterprise BI now possible

Analytics teams have a single queryable source of truth across all customer, loan, and operations data

08 - Frequently Asked Questions

Common Questions About This Project

How did Auriga IT migrate data from Salesforce to Hadoop?

Auriga IT used Apache NiFi as the core ETL engine hosted on cloud VMs. Structured CRM data was extracted from Salesforce via Bulk API v2, loaded into an on-premise relational database as a staging layer, and then ingested into Hadoop HDFS. Documents and images followed a separate flow handled by a custom Java NiFi processor that extracted document objects from Salesforce and uploaded them to a document management system via an API gateway.

How much data was migrated?

The migration covered 130 million+ total transactions, over 10 TB of structured data, and millions of documents and images. The live daily pipeline continues to run at high throughput, fully automated, with zero data loss and 100% checksum verification across all records.

What is Apache NiFi and why was it chosen?

Apache NiFi is a data flow automation and ETL platform built for real-time data ingestion, transformation, and routing. It was chosen for its built-in Salesforce API support, scalable parallel processing, back-pressure management, custom processor extensibility via Java, and detailed audit logging - all essential for a migration of this scale and compliance requirement.

How was zero data loss ensured?

Reconciliation tables tracked every record with source-to-target counts, MD5/SHA-256 checksum verification, timestamp audit logs, and failed record tracking. Watermark-based incremental extraction guaranteed exactly-once delivery. Automated end-of-day reports validated daily completeness, resulting in 100% verified data across all migrated records.

Why did the bank need to move off Salesforce?

The primary drivers were unsustainable Salesforce storage costs at banking scale, limited analytics capability for complex cross-object queries, document retrieval bottlenecks impacting daily operations, regulatory requirements for long-term auditable document storage, and the need for a centralised data lake to power enterprise analytics and BI.

How long has the pipeline been running?

The project was initiated in May 2024 and the live pipeline has been running since then. As of March 2026, it continues to process large volumes of transactions and documents daily on a fully automated basis with no manual intervention required.

Planning a Similar Data Migration?

Auriga IT has deep expertise in enterprise data engineering, Apache NiFi pipelines, and large-scale CRM migrations. Let us help with your next data challenge.

Talk to Our Team

Auriga: Leveling Up for Enterprise Growth!

By ronak|2026-05-25T14:33:24+05:303 July 2024|Categories: expert-in|

Auriga’s journey began in 2010 crafting products for India’s [...]

Comments Off

Stay Close to What We’re Building

Get insights on product engineering, AI, and real-world technology decisions shaping modern businesses.

From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss

From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss

Yes Bank

Why Salesforce Could No Longer Be the Data Home

A Six-Stage ETL Pipeline Built on Apache NiFi

What Powered the Pipeline

Eight Problems the Team Had to Solve

Two Flows Worth Looking at Closely

Moving 130 Million+ CRM Records into the Data Lake

Archiving Millions of Loan Documents at High Throughput

What Changed After the Migration

Common Questions About This Project

Planning a Similar Data Migration?

Related content

Auriga: Leveling Up for Enterprise Growth!

Auriga: Leveling Up for Enterprise Growth!

Stay Close to What We’re Building