
- Cloud & DevOps
From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss

From Salesforce CRM to a Hadoop Data Lake: 130 Million Transactions Migrated with Zero Data Loss
How Auriga IT moved a private sector bank's entire Salesforce CRM data into a Hadoop data lake - 130 million+ transactions, millions of documents, and over 10 TB - with zero data loss and a fully automated daily pipeline that runs without any manual intervention.
Published: 2 May 2024 · Updated: 25 March 2026 · By Auriga IT
Yes Bank
Every day, the bank generates large volumes of structured records and regulated documents across lending, customer service, and compliance operations. As these volumes grew, storing and querying everything within Salesforce became increasingly expensive and slower to manage, making it difficult to support the bank's long-term archival and analytics requirements.
Why Salesforce Could No Longer Be the Data Home
The issues were concrete. Salesforce storage pricing at banking scale was not viable for long-term data archival. Analytics teams could not run the cross-object queries they needed. Retrieving large volumes of loan documents caused daily delays for dependent teams. And regulatory requirements mandated auditable long-term storage that Salesforce alone could not deliver.
A Six-Stage ETL Pipeline Built on Apache NiFi
The pipeline was designed around one principle: decouple each stage so a failure in one layer does not cascade into others. Salesforce extraction is independent from staging. Staging is independent from data lake ingestion. Documents follow a completely separate flow with their own retry and error handling. Each stage can be monitored, replayed, or scaled independently.
What Powered the Pipeline
| Category | Technology | Role in the pipeline |
|---|---|---|
| Source | Salesforce CRM | Source of structured CRM records, loan documents, and images |
| Orchestration | Apache NiFi | Core ETL engine - extraction, transformation, routing; hosted on cloud VMs |
| Custom Dev | Java (NiFi API) | Purpose-built processor for document upload to DMS via API gateway |
| Database | Relational DB | On-premise staging layer between the ETL engine and the data lake |
| Data Lake | Hadoop / HDFS | On-premise distributed analytics storage for large-scale querying |
| DMS | Document Mgmt System | On-premise regulated document archival with full metadata preserved |
| API Gateway | On-premise Gateway | Secures all ETL-to-on-premise service communication |
| Connectivity | Private Tunnel | Encrypted, private cloud-to-on-premise network link |
| Analytics | BI Analytics Tool | Connected to Hadoop for dashboards and business intelligence reporting |
| Security | SOC 2 / ISO 27001 | Encryption at rest and in transit, audit vault, data privacy compliance |
Eight Problems the Team Had to Solve
| Challenge | How it was solved |
|---|---|
| Hundreds of millions of records per Salesforce object | Bulk API v2 with NiFi parallel extraction partitioned by object type and date range. Sustained high-throughput document processing consistently. |
| Watermark and timestamp management | Watermark-based incremental extraction. Timestamps updated only after the staging database commit, guaranteeing exactly-once delivery. |
| Multiple concurrent NiFi flows competing for resources | Separate NiFi process groups per object type and document flow, each with independent scheduling and back-pressure configuration. |
| Special characters causing document management system failures | A dedicated sanitisation step in the custom processor cleans ampersands, angle brackets, quotes, and non-ASCII characters before every DMS API call. |
| Salesforce API rate limits under sustained load | Adaptive throttling: Bulk API for large structured sets, REST API for document queries, dynamically balanced per load. |
| Binary files over 100 MB crashing standard upload flows | NiFi streaming content repository combined with chunked transfer encoding for all oversized DMS uploads. |
| Ensuring no duplication or data loss across stages | Reconciliation tables with MD5/SHA-256 checksums, source-to-target counts, and automated end-of-day audit reports per object type. |
| Legacy SOAP-only integrations in the on-premise layer | Custom SOAP processors built for loan data flows where legacy systems used SOAP rather than REST. |
Two Flows Worth Looking at Closely
Moving 130 Million+ CRM Records into the Data Lake
The structured migration covered all Salesforce CRM records - customer data, loan records, transaction history - into the on-premise staging database, followed by full ingestion into Hadoop HDFS. Apache NiFi clusters were scaled horizontally during peak windows. Bulk API v2 parallelism partitioned by object type and date range kept extraction throughput high. Watermark-based tracking ensured every run picked up only what had changed since the last run, without re-processing billions of historical records. Post-migration, 100% record match was confirmed across all Salesforce objects via checksum verification and automated reconciliation reports.
Archiving Millions of Loan Documents at High Throughput
Millions of PDFs, JPEGs, PNGs, and Word files associated with loan applications were migrated to the document management system at a sustained high daily throughput. The custom Java NiFi processor handled the full flow: binary extraction from Salesforce document objects, metadata schema mapping to the DMS taxonomy, batched uploads with configurable concurrency, and chunked transfer encoding for files over 100 MB. Exponential backoff retry managed transient API failures, while a dead-letter queue captured permanent failures for review rather than silently losing them. Every document was archived with full metadata preserved, satisfying long-term regulatory traceability requirements.
What Changed After the Migration
| Metric | Outcome |
|---|---|
| Total transactions migrated | 130M+ across all Salesforce objects |
| Daily pipeline | Runs automatically, no manual triggers |
| Document migration throughput | High daily volume, fully unattended |
| Total documents archived | Millions, full metadata preserved |
| Total data volume | 10+ TB end-to-end |
| Data integrity | Zero data loss, 100% checksum-verified |
| Regulatory compliance | Full audit trail, compliance mandate satisfied |
| Manual intervention required | None - monitoring and retry are built in |
Common Questions About This Project
Planning a Similar Data Migration?
Auriga IT has deep expertise in enterprise data engineering, Apache NiFi pipelines, and large-scale CRM migrations. Let us help with your next data challenge.
Talk to Our TeamRelated content
Auriga: Leveling Up for Enterprise Growth!
Auriga’s journey began in 2010 crafting products for India’s [...]






