• Cloud & DevOps

From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss

Published On: 2 May 2024.By .
Enterprise Data Engineering  ·  Banking  ·  India

How Auriga IT moved a private sector bank's entire Salesforce CRM data into a Hadoop data lake - 130 million+ transactions, millions of documents, and over 10 TB - with zero data loss and a fully automated daily pipeline that runs without any manual intervention.

Apache NiFi
Salesforce
Oracle DB
Hadoop / HDFS
Cloud VMs
Document Management
BI Analytics

Published: 2 May 2024  ·  Updated: 25 March 2026  ·  By Auriga IT

Pipeline Overview
Six-stage data migration pipeline from CRM extraction through cloud ETL engine, private tunnel, API gateway, staging database, data lake, document management, and BI analytics
End-to-end pipeline: CRM extraction, cloud ETL, private tunnel, on-premise gateway, staging database, data lake, document archival, and analytics layer.
130M+
Transactions Migrated
Daily Sync
Fully Automated
Millions
Documents Archived
10+ TB
Data Processed
100%
Checksum Verified
Zero
Data Loss
01 - About the Client

Yes Bank

Yes Bank is one of India's leading private sector banks, established in 2004 and headquartered in Mumbai. It serves retail, corporate, and MSME clients across a broad range of banking and financial services, operating at a scale where reliable data infrastructure is critical to everyday operations.
Yes Bank logo
Yes Bank Limited
Leading Private Sector Bank  ·  Mumbai, India  ·  Established 2004

Every day, the bank generates large volumes of structured records and regulated documents across lending, customer service, and compliance operations. As these volumes grew, storing and querying everything within Salesforce became increasingly expensive and slower to manage, making it difficult to support the bank's long-term archival and analytics requirements.

02 - The Problem

Why Salesforce Could No Longer Be the Data Home

The bank's CRM, loan data, and document archive all lived inside Salesforce - a system designed for customer relationship management, not data warehousing or long-term regulated archival. As volumes grew into hundreds of millions of records, this became both an operational and financial problem.

The issues were concrete. Salesforce storage pricing at banking scale was not viable for long-term data archival. Analytics teams could not run the cross-object queries they needed. Retrieving large volumes of loan documents caused daily delays for dependent teams. And regulatory requirements mandated auditable long-term storage that Salesforce alone could not deliver.

01
Storage costs too high
Salesforce document and image storage pricing was not sustainable for a bank holding millions of customer records and loan-related files long-term.
02
Analytics were too limited
Salesforce is not a data warehouse. Complex cross-object queries and BI workloads were slow, costly, and could not deliver the depth the analytics team needed.
03
Document retrieval was slow
Fetching large volumes of loan-related documents and images from Salesforce caused daily delays for operational teams who depended on those files.
04
Regulatory mandate unmet
Regulations required long-term, auditable document storage with full metadata traceability - a requirement Salesforce alone could not satisfy.
05
No central data lake
Analytics and BI teams had no single queryable source of truth. Data was siloed across Salesforce objects, making enterprise reporting difficult.
06
Real-time and historical sync
Updated Salesforce records needed to be merged cleanly with large historical datasets in Hadoop - without duplication, gaps, or re-processing everything from scratch.
03 - The Solution

A Six-Stage ETL Pipeline Built on Apache NiFi

Auriga IT designed and built a multi-layer data pipeline with Apache NiFi as the central orchestration engine. The architecture connects cloud and on-premise systems through a private network tunnel, handles structured CRM data and unstructured documents through separate purpose-built flows, and reconciles every record end-to-end using checksums and audit logs.

The pipeline was designed around one principle: decouple each stage so a failure in one layer does not cascade into others. Salesforce extraction is independent from staging. Staging is independent from data lake ingestion. Documents follow a completely separate flow with their own retry and error handling. Each stage can be monitored, replayed, or scaled independently.

Architecture
Data migration architecture showing CRM source, cloud ETL engine, security layer, private network tunnel, API gateway, staging database, document store, and distributed data lake
Architecture overview: CRM source, cloud-hosted ETL engine, private tunnel, on-premise API gateway, staging database, document store, and data lake.
Cloud Bridge
Private tunnel connecting cloud-hosted ETL engine to on-premise API gateway, staging database, and data lake
The private tunnel bridge: secure, encrypted connectivity between the cloud ETL layer and on-premise infrastructure.
How the Pipeline Works - Step by Step
1
Extraction from Salesforce
Apache NiFi, hosted on cloud VMs, extracts structured CRM data from Salesforce using Bulk API v2. Extraction is partitioned by object type and date range so large volumes are processed in parallel. A watermark-based incremental strategy means only records updated since the last run are fetched - ensuring exactly-once delivery without re-processing historical data on every run.
2
Staging in a relational database on-premise
Extracted data is loaded into an on-premise relational database, which acts as an independently-queryable staging layer between Salesforce and the data lake. This decoupling means if the data lake ingestion has an issue, extraction does not need to be re-run from Salesforce. The staging layer also maintains encryption at rest and a full audit trail.
3
Document and image migration via a custom Java processor
A purpose-built Java processor handles extraction of Salesforce document objects and uploads them to the document management system via the API gateway. It manages binary extraction, metadata mapping to the DMS taxonomy, batched uploads with configurable concurrency, chunked transfer encoding for large files over 100 MB, exponential backoff retry for transient failures, and a dead-letter queue that routes permanent failures for review rather than silently dropping them.
Document Flow
Document migration flow from CRM source through transformation processor, retry logic, dead-letter queue, and document management system
Document flow: binary extraction, metadata mapping, chunked upload, retry handling, and dead-letter routing.
4
Ingestion into Hadoop HDFS
Data moves from the staging database into Hadoop HDFS on-premise. Output file sizes are controlled to align with Hadoop block size for efficient distributed storage. NiFi clusters were scaled horizontally during peak extraction windows, with separate process groups per object type so different data flows do not compete for resources.
5
BI analytics on top of Hadoop
A BI analytics tool connects directly to Hadoop, giving analytics teams dashboards across loan processing metrics, first-time-right rates, customer behaviour patterns, and loan application outcomes - capabilities that were not possible when data lived inside Salesforce.
6
Reconciliation and audit trail
Reconciliation tables track every record from source to target: source-to-target counts, MD5/SHA-256 checksum verification, timestamp audit logs, and failed record tracking. Automated end-of-day reports validate daily completeness. Every record and document is fully traceable, satisfying internal governance and external regulatory compliance requirements.
04 - Technology Stack

What Powered the Pipeline

CategoryTechnologyRole in the pipeline
SourceSalesforce CRMSource of structured CRM records, loan documents, and images
OrchestrationApache NiFiCore ETL engine - extraction, transformation, routing; hosted on cloud VMs
Custom DevJava (NiFi API)Purpose-built processor for document upload to DMS via API gateway
DatabaseRelational DBOn-premise staging layer between the ETL engine and the data lake
Data LakeHadoop / HDFSOn-premise distributed analytics storage for large-scale querying
DMSDocument Mgmt SystemOn-premise regulated document archival with full metadata preserved
API GatewayOn-premise GatewaySecures all ETL-to-on-premise service communication
ConnectivityPrivate TunnelEncrypted, private cloud-to-on-premise network link
AnalyticsBI Analytics ToolConnected to Hadoop for dashboards and business intelligence reporting
SecuritySOC 2 / ISO 27001Encryption at rest and in transit, audit vault, data privacy compliance
05 - Challenges and Solutions

Eight Problems the Team Had to Solve

ChallengeHow it was solved
Hundreds of millions of records per Salesforce objectBulk API v2 with NiFi parallel extraction partitioned by object type and date range. Sustained high-throughput document processing consistently.
Watermark and timestamp managementWatermark-based incremental extraction. Timestamps updated only after the staging database commit, guaranteeing exactly-once delivery.
Multiple concurrent NiFi flows competing for resourcesSeparate NiFi process groups per object type and document flow, each with independent scheduling and back-pressure configuration.
Special characters causing document management system failuresA dedicated sanitisation step in the custom processor cleans ampersands, angle brackets, quotes, and non-ASCII characters before every DMS API call.
Salesforce API rate limits under sustained loadAdaptive throttling: Bulk API for large structured sets, REST API for document queries, dynamically balanced per load.
Binary files over 100 MB crashing standard upload flowsNiFi streaming content repository combined with chunked transfer encoding for all oversized DMS uploads.
Ensuring no duplication or data loss across stagesReconciliation tables with MD5/SHA-256 checksums, source-to-target counts, and automated end-of-day audit reports per object type.
Legacy SOAP-only integrations in the on-premise layerCustom SOAP processors built for loan data flows where legacy systems used SOAP rather than REST.
06 - Inside the Work

Two Flows Worth Looking at Closely

1

Moving 130 Million+ CRM Records into the Data Lake

The structured migration covered all Salesforce CRM records - customer data, loan records, transaction history - into the on-premise staging database, followed by full ingestion into Hadoop HDFS. Apache NiFi clusters were scaled horizontally during peak windows. Bulk API v2 parallelism partitioned by object type and date range kept extraction throughput high. Watermark-based tracking ensured every run picked up only what had changed since the last run, without re-processing billions of historical records. Post-migration, 100% record match was confirmed across all Salesforce objects via checksum verification and automated reconciliation reports.

130M+ Records
10+ TB Data
Bulk API v2
100% Verified
Zero Data Loss
2

Archiving Millions of Loan Documents at High Throughput

Millions of PDFs, JPEGs, PNGs, and Word files associated with loan applications were migrated to the document management system at a sustained high daily throughput. The custom Java NiFi processor handled the full flow: binary extraction from Salesforce document objects, metadata schema mapping to the DMS taxonomy, batched uploads with configurable concurrency, and chunked transfer encoding for files over 100 MB. Exponential backoff retry managed transient API failures, while a dead-letter queue captured permanent failures for review rather than silently losing them. Every document was archived with full metadata preserved, satisfying long-term regulatory traceability requirements.

Millions of Docs
High Daily Throughput
100MB+ Binaries
Exponential Backoff
Full Audit Trail
07 - Results

What Changed After the Migration

The migration fundamentally changed Yes Bank's data infrastructure. Analytics that were not possible inside Salesforce are now available via the data lake. Storage costs dropped significantly with documents and images moved off Salesforce. The pipeline runs daily without manual intervention. And every migrated record and document is fully traceable, meeting the bank's regulatory archival requirements.
Outcomes
Eight outcome tiles showing migration scale, daily automation, document throughput, data volume, checksum integrity, regulatory compliance, and zero manual intervention
Eight outcomes: transaction scale, document volume, data integrity, automation, compliance, and the analytics layer now running on Hadoop.
MetricOutcome
Total transactions migrated130M+ across all Salesforce objects
Daily pipelineRuns automatically, no manual triggers
Document migration throughputHigh daily volume, fully unattended
Total documents archivedMillions, full metadata preserved
Total data volume10+ TB end-to-end
Data integrityZero data loss, 100% checksum-verified
Regulatory complianceFull audit trail, compliance mandate satisfied
Manual intervention requiredNone - monitoring and retry are built in
Loan analytics unlocked
FTR rates, processing times, and rejection patterns now available via BI dashboards on Hadoop
Storage costs reduced
Document and image storage fully moved off Salesforce into on-premise DMS and HDFS
Daily sync automated
Salesforce-to-Hadoop runs daily automatically - no manual triggers or team intervention needed
Decoupled architecture
Staging database lets Hadoop ingest independently - no live Salesforce dependency at query time
Regulatory traceability
Every record and document traceable, satisfying long-term archival and audit requirements
Enterprise BI now possible
Analytics teams have a single queryable source of truth across all customer, loan, and operations data
08 - Frequently Asked Questions

Common Questions About This Project

How did Auriga IT migrate data from Salesforce to Hadoop?
Auriga IT used Apache NiFi as the core ETL engine hosted on cloud VMs. Structured CRM data was extracted from Salesforce via Bulk API v2, loaded into an on-premise relational database as a staging layer, and then ingested into Hadoop HDFS. Documents and images followed a separate flow handled by a custom Java NiFi processor that extracted document objects from Salesforce and uploaded them to a document management system via an API gateway.
How much data was migrated?
The migration covered 130 million+ total transactions, over 10 TB of structured data, and millions of documents and images. The live daily pipeline continues to run at high throughput, fully automated, with zero data loss and 100% checksum verification across all records.
What is Apache NiFi and why was it chosen?
Apache NiFi is a data flow automation and ETL platform built for real-time data ingestion, transformation, and routing. It was chosen for its built-in Salesforce API support, scalable parallel processing, back-pressure management, custom processor extensibility via Java, and detailed audit logging - all essential for a migration of this scale and compliance requirement.
How was zero data loss ensured?
Reconciliation tables tracked every record with source-to-target counts, MD5/SHA-256 checksum verification, timestamp audit logs, and failed record tracking. Watermark-based incremental extraction guaranteed exactly-once delivery. Automated end-of-day reports validated daily completeness, resulting in 100% verified data across all migrated records.
Why did the bank need to move off Salesforce?
The primary drivers were unsustainable Salesforce storage costs at banking scale, limited analytics capability for complex cross-object queries, document retrieval bottlenecks impacting daily operations, regulatory requirements for long-term auditable document storage, and the need for a centralised data lake to power enterprise analytics and BI.
How long has the pipeline been running?
The project was initiated in May 2024 and the live pipeline has been running since then. As of March 2026, it continues to process large volumes of transactions and documents daily on a fully automated basis with no manual intervention required.

Planning a Similar Data Migration?

Auriga IT has deep expertise in enterprise data engineering, Apache NiFi pipelines, and large-scale CRM migrations. Let us help with your next data challenge.

Talk to Our Team

Related content

Stay Close to What We’re Building

Get insights on product engineering, AI, and real-world technology decisions shaping modern businesses.

Go to Top