• Cloud & DevOps

Analytics & Compliance with Salesforce-Hadoop Integration

Published On: 2 May 2024.By .
Yes Bank Data Migration: Salesforce to Hadoop | Auriga IT Case Study
Enterprise Data Engineering  ·  Banking  ·  India

Yes Bank —
Salesforce to Hadoop
Data Migration

How Auriga IT architected a multi-layer, enterprise-grade ETL pipeline to migrate 13 Crore+ transactions, 1 Crore+ documents, and 10+ TB of data from Salesforce CRM into a Hadoop-based data lake — with zero data loss and full regulatory compliance.

Apache NiFi
Salesforce
Oracle DB
Hadoop / HDFS
Azure VMs
Newgen DMS
Power BI

Published: 2 May 2024  ·  Updated: 25 March 2026  ·  By Auriga IT

13 Cr+
Total Transactions
4 Lac+
Daily Transactions
30,000+
Documents per Day
1 Cr+
Documents Migrated
10+ TB
Data Processed
100%
Checksum Verified
01 — About the Client

Yes Bank

Yes Bank Limited
Leading Private Sector Bank — Mumbai, India — Est. 2004

Yes Bank is one of India's fastest-growing private sector banks, serving retail, corporate, and MSME clients across the full spectrum of banking and financial services. With millions of customer records, loan applications, and regulated financial documents, Yes Bank requires enterprise-grade data infrastructure to support deep analytics, operational efficiency, and regulatory compliance at scale.

02 — Problem Statement

Why Yes Bank Needed to Move Off Salesforce

Yes Bank's entire CRM, loan data, and document archive lived within Salesforce — a system not designed for data warehousing, large-scale analytics, or long-term regulated archival. As data volumes grew into hundreds of crores of records, these limitations became both operational and financial liabilities.

01
Unsustainable Storage Costs
Salesforce licensing and storage pricing for documents and images at banking scale was no longer financially viable for long-term archival.
02
Limited Analytics Capability
Salesforce is not a data warehouse. Complex cross-object queries and BI workloads were slow, costly, and lacked the analytical depth teams needed.
03
Document Retrieval Bottlenecks
Retrieving large volumes of loan-related documents and images from Salesforce caused daily operational delays for dependent teams.
04
Regulatory Compliance
RBI regulations required auditable, long-term document storage with full metadata traceability — mandates Salesforce alone could not satisfy.
05
No Centralised Data Lake
BI and analytics teams had no single queryable source of truth. Data was siloed across Salesforce objects, making enterprise reporting impossible.
06
Real-Time and Historical Sync
Updated Salesforce records needed seamless merging with large historical datasets in Hadoop — without duplication, gaps, or full re-processing.
03 — Solution Architecture

A Multi-Layer ETL Pipeline with Apache NiFi

Auriga IT designed and built an enterprise-grade, multi-layered data pipeline with Apache NiFi as the central orchestration engine. The architecture bridges cloud (Azure) and on-premise systems via ExpressRoute — secured through DataPower API gateway — handling structured CRM data and unstructured documents through separate, purpose-optimised flows.

Architecture Diagram
Salesforce to Hadoop multi-layer ETL pipeline architecture: NiFi on Azure VMs extracts from Salesforce via Bulk API, loads to Oracle DB staging on-premise, ingests into Hadoop HDFS data lake, and uploads documents to Newgen DMS via DataPower API gateway — Yes Bank migration by Auriga IT
Salesforce (Source) → Apache NiFi on Azure → Oracle DB (Staging) → Hadoop HDFS (Data Lake)  |  Documents → Newgen DMS via DataPower
Pipeline — Step by Step
1
Extraction via Salesforce Bulk API v2
Apache NiFi (hosted on Azure VMs) extracts structured CRM data from Salesforce using Bulk API v2, partitioned by object type and date range. A watermark-based incremental strategy ensures only updated records are fetched per run — guaranteeing exactly-once delivery without re-processing historical data. The "last modified date" approach keeps the Hadoop environment continuously current with Salesforce.
2
Oracle DB — On-Premise Staging Layer
Extracted structured data is loaded into Oracle DB on-premise, serving as a trusted independently-queryable staging layer. This decouples the Salesforce extraction phase from the Hadoop ingestion phase — the Hadoop team can ingest from Oracle independently, with full retry capability that does not re-touch Salesforce. Oracle TDE and Audit Vault ensure data security and auditability at the staging layer.
3
Custom Java NiFi Processor — Document and Image Upload
A purpose-built Java processor (AbstractProcessor, Maven-packaged) handles extraction of Salesforce ContentDocument and ContentVersion objects and uploads them to Newgen DMS via DataPower. Key capabilities: binary extraction via REST API, metadata mapping to DMS taxonomy, batched uploads with configurable concurrency, chunked transfer encoding for binaries over 100MB, exponential backoff retry (initial 1s, max 60s, max 5 retries) for HTTP 429 and 5xx errors, dead-letter routing for permanent failures, and sanitization for special characters (&, <, >, quotes, non-ASCII) before all DMS API calls.
4
Hadoop HDFS — Data Lake Ingestion
Data ingested from Oracle DB into Hadoop HDFS on-premise. Output file sizes are controlled per Hadoop block size for optimum distributed storage utilisation. JSON-to-CSV transformation was applied for optimal Hadoop compatibility. NiFi clusters scaled horizontally during peak extraction windows using separate process groups per object type — each with independent scheduling and back-pressure configuration.
5
Power BI Analytics Layer
Power BI connects directly to Hadoop, enabling the analytics team to build dashboards across loan processing metrics, First-Time Right (FTR) rates, customer behaviour patterns, loan application rejection ratios, and more — leveraging the full processing power of the distributed data lake for actionable business insights.
6
Reconciliation and Full Audit Trail
Reconciliation tables track every record end-to-end: source-to-target counts, MD5/SHA-256 checksum verification, timestamp audit logs, failed record tracking, and automated end-of-day reports. Every record and document is fully traceable — satisfying both internal data governance requirements and external RBI regulatory compliance mandates.
04 — Technology Stack

What Powered the Pipeline

CategoryTechnologyRole in the Pipeline
SourceSalesforce CRMSource of structured CRM records, loan documents, and images
OrchestrationApache NiFiCore ETL engine — extraction, transformation, routing; hosted on Azure VMs
Custom DevJava (NiFi API)Purpose-built AbstractProcessor for Newgen DMS document upload via DataPower
DatabaseOracle DBOn-premise staging layer between NiFi and Hadoop — decouples pipeline stages
Data LakeHadoop / HDFSOn-premise distributed analytics storage for large-scale querying
DMSNewgen DMSOn-premise document management for long-term regulated document archival
API GatewayDataPowerSecures all NiFi-to-on-premise service communication
CloudAzure + ExpressRouteNiFi hosting and private, low-latency connectivity to on-premise infrastructure
AnalyticsPower BIConnected to Hadoop for business intelligence and loan analytics reporting
SecuritySOC 2 / ISO 27001Oracle TDE, Audit Vault, NiFi TLS, data privacy compliance throughout
05 — Challenges and Solutions

Technical Challenges and How They Were Solved

ChallengeSolution
Crores of records per Salesforce objectBulk API v2 with NiFi parallel extraction partitioned by object type and date range. Sustained 30,000+ docs/day consistently.
Timestamp and watermark managementWatermark-based incremental extraction. Timestamps updated only after Oracle commit — guaranteeing exactly-once delivery.
Multiple concurrent NiFi flowsSeparate NiFi process groups per object type and document flow, each with independent scheduling and back-pressure configuration.
Special characters causing DMS failuresDedicated sanitization layer in the custom processor for &, <, >, double quotes, and non-ASCII before every DMS API call.
Salesforce API rate limitsAdaptive throttling with Bulk API for large sets and REST API for document queries — dynamically balanced per load.
Large binaries over 100MBNiFi streaming content repository combined with chunked transfer encoding for all oversized DMS uploads.
Data duplication and reconciliationReconciliation tables with MD5/SHA-256 checksums, source-to-target counts, and automated end-of-day audit reports.
Legacy SOAP-only integrationsCustom SOAP processors built to handle loan data migration flows where legacy systems used SOAP instead of REST.
06 — Scalability and Performance

Case Studies

1

High-Volume CRM Record Migration

Bulk migration of over 1 Crore Salesforce CRM records and more than 10 TB of structured data into Oracle DB staging, followed by full ingestion into Hadoop HDFS. Apache NiFi clusters were scaled horizontally during peak extraction windows, with Bulk API v2 parallelism partitioned by object type and date range. Watermark-based tracking ensured exactly-once delivery throughout. 100% record match was validated post-migration via checksum verification and reconciliation reports — with zero data loss recorded across all Salesforce objects.

1 Crore+ Records
10+ TB Data
Bulk API v2
100% Verified
Zero Data Loss
2

Document and Image Migration to Newgen DMS

Over 1 Crore documents and images — PDFs, JPEGs, PNGs, and Word files — migrated to Newgen DMS at a sustained rate of 30,000+ documents per day. The custom Java NiFi processor handled binary extraction from Salesforce ContentVersion, metadata schema mapping to DMS taxonomy, batched uploads with configurable concurrency, and chunked transfer encoding for large binaries over 100MB. Exponential backoff (1s to 60s, max 5 retries) managed HTTP 429 and 5xx transient failures, while a dead-letter queue captured permanent failures for review. A full, uninterrupted audit trail was maintained for every document migrated.

1 Crore+ Docs
30K docs/day
100MB+ Binaries
Exponential Backoff
Full Audit Trail
07 — Results and Business Impact

Key Outcomes

The migration fundamentally transformed Yes Bank's data infrastructure — unlocking analytics capabilities that were previously impossible, significantly reducing storage costs, and placing the bank in full regulatory compliance across every migrated record and document.

MetricOutcome
Total transactions migrated13 Crore (130M) across all Salesforce objects
Daily transaction volume4 Lakh+ per day, fully automated
Document migration throughput30,000+ documents/day — unattended
Total documents migrated1 Crore+ with full metadata preserved
Total data volume10+ TB end-to-end
Data integrityZero data loss — 100% checksum-verified
Regulatory complianceFull audit trail — RBI mandate satisfied
Pipeline reliabilityFully automated, monitoring + retry, zero manual intervention
Loan Analytics Unlocked
FTR rates, processing times, and rejection ratios now accessible via Power BI on Hadoop
Storage Cost Reduction
Salesforce document and image storage fully offloaded to on-premise Newgen DMS and HDFS
Daily Automated Sync
Salesforce-to-Hadoop runs daily, automatically — no manual triggers or interventions required
Decoupled Architecture
Oracle DB staging allows Hadoop to ingest independently with no live Salesforce dependency
Regulatory Compliance
Every record and document traceable — satisfying RBI archival and auditability requirements
Enterprise BI Layer
Power BI delivers actionable insights on customer behaviour, loans, and operations at scale
08 — Frequently Asked Questions

Questions About This Project

How did Auriga IT migrate data from Salesforce to Hadoop for Yes Bank?
Auriga IT used Apache NiFi as the core ETL engine hosted on Azure VMs. Structured CRM data was extracted from Salesforce via Bulk API v2, loaded into Oracle DB on-premise as a staging layer, and then ingested into Hadoop HDFS. Documents and images were handled separately by a custom Java NiFi processor that extracted Salesforce ContentVersion objects and uploaded them to Newgen DMS via a DataPower API gateway.
How much data was migrated from Salesforce to Hadoop for Yes Bank?
The migration covered 13 Crore+ (130 million) total transactions, over 10 TB of structured data, and 1 Crore+ (10 million) documents and images. The live pipeline sustains 4 Lakh+ daily transactions and 30,000+ documents per day — all checksum-verified with zero data loss.
What is Apache NiFi and why was it chosen for this migration?
Apache NiFi is a data flow automation and ETL platform that provides real-time data ingestion, transformation, and routing. It was chosen for its built-in Salesforce API support, scalable parallel processing, back-pressure management, custom processor extensibility via Java, and robust audit logging — essential for a migration of this scale and compliance requirement.
How was data integrity and zero data loss ensured during the Yes Bank migration?
Reconciliation tables tracked every record with source-to-target counts, MD5/SHA-256 checksum verification, timestamp audit logs, and failed record tracking. Watermark-based incremental extraction guaranteed exactly-once delivery. Automated end-of-day reports validated daily completeness — resulting in 100% verified data across all 13 Crore transactions.
Why did Yes Bank migrate from Salesforce to Hadoop?
The primary drivers were unsustainable Salesforce storage costs at banking scale, limited analytics and complex query capability, slow document retrieval impacting operations, RBI regulatory requirements for long-term auditable document storage, and the need for a centralised data lake to power enterprise analytics via Power BI.
How long did the Yes Bank Salesforce to Hadoop migration take?
The project was initiated and first published in May 2024. The live pipeline continues to operate as of March 2026, processing 4 Lakh+ transactions and 30,000+ documents daily on a fully automated basis with no manual intervention required.

Planning a Similar Data Migration?

Auriga IT has deep expertise in enterprise data engineering, Apache NiFi pipelines, and large-scale Salesforce migrations. Let us help with your next data challenge.

Talk to Our Team →
Auriga IT

Case Study — Yes Bank Data Migration  ·  © Auriga IT 2024–2026

Related content

Stay Close to What We’re Building

Get insights on product engineering, AI, and real-world technology decisions shaping modern businesses.

Go to Top