
- Cloud & DevOps
Analytics & Compliance with Salesforce-Hadoop Integration

Analytics & Compliance with Salesforce-Hadoop Integration
Yes Bank —
Salesforce to Hadoop
Data Migration
How Auriga IT architected a multi-layer, enterprise-grade ETL pipeline to migrate 13 Crore+ transactions, 1 Crore+ documents, and 10+ TB of data from Salesforce CRM into a Hadoop-based data lake — with zero data loss and full regulatory compliance.
Published: 2 May 2024 · Updated: 25 March 2026 · By Auriga IT
Yes Bank
Yes Bank is one of India's fastest-growing private sector banks, serving retail, corporate, and MSME clients across the full spectrum of banking and financial services. With millions of customer records, loan applications, and regulated financial documents, Yes Bank requires enterprise-grade data infrastructure to support deep analytics, operational efficiency, and regulatory compliance at scale.
Why Yes Bank Needed to Move Off Salesforce
Yes Bank's entire CRM, loan data, and document archive lived within Salesforce — a system not designed for data warehousing, large-scale analytics, or long-term regulated archival. As data volumes grew into hundreds of crores of records, these limitations became both operational and financial liabilities.
A Multi-Layer ETL Pipeline with Apache NiFi
Auriga IT designed and built an enterprise-grade, multi-layered data pipeline with Apache NiFi as the central orchestration engine. The architecture bridges cloud (Azure) and on-premise systems via ExpressRoute — secured through DataPower API gateway — handling structured CRM data and unstructured documents through separate, purpose-optimised flows.
What Powered the Pipeline
| Category | Technology | Role in the Pipeline |
|---|---|---|
| Source | Salesforce CRM | Source of structured CRM records, loan documents, and images |
| Orchestration | Apache NiFi | Core ETL engine — extraction, transformation, routing; hosted on Azure VMs |
| Custom Dev | Java (NiFi API) | Purpose-built AbstractProcessor for Newgen DMS document upload via DataPower |
| Database | Oracle DB | On-premise staging layer between NiFi and Hadoop — decouples pipeline stages |
| Data Lake | Hadoop / HDFS | On-premise distributed analytics storage for large-scale querying |
| DMS | Newgen DMS | On-premise document management for long-term regulated document archival |
| API Gateway | DataPower | Secures all NiFi-to-on-premise service communication |
| Cloud | Azure + ExpressRoute | NiFi hosting and private, low-latency connectivity to on-premise infrastructure |
| Analytics | Power BI | Connected to Hadoop for business intelligence and loan analytics reporting |
| Security | SOC 2 / ISO 27001 | Oracle TDE, Audit Vault, NiFi TLS, data privacy compliance throughout |
Technical Challenges and How They Were Solved
| Challenge | Solution |
|---|---|
| Crores of records per Salesforce object | Bulk API v2 with NiFi parallel extraction partitioned by object type and date range. Sustained 30,000+ docs/day consistently. |
| Timestamp and watermark management | Watermark-based incremental extraction. Timestamps updated only after Oracle commit — guaranteeing exactly-once delivery. |
| Multiple concurrent NiFi flows | Separate NiFi process groups per object type and document flow, each with independent scheduling and back-pressure configuration. |
| Special characters causing DMS failures | Dedicated sanitization layer in the custom processor for &, <, >, double quotes, and non-ASCII before every DMS API call. |
| Salesforce API rate limits | Adaptive throttling with Bulk API for large sets and REST API for document queries — dynamically balanced per load. |
| Large binaries over 100MB | NiFi streaming content repository combined with chunked transfer encoding for all oversized DMS uploads. |
| Data duplication and reconciliation | Reconciliation tables with MD5/SHA-256 checksums, source-to-target counts, and automated end-of-day audit reports. |
| Legacy SOAP-only integrations | Custom SOAP processors built to handle loan data migration flows where legacy systems used SOAP instead of REST. |
Case Studies
High-Volume CRM Record Migration
Bulk migration of over 1 Crore Salesforce CRM records and more than 10 TB of structured data into Oracle DB staging, followed by full ingestion into Hadoop HDFS. Apache NiFi clusters were scaled horizontally during peak extraction windows, with Bulk API v2 parallelism partitioned by object type and date range. Watermark-based tracking ensured exactly-once delivery throughout. 100% record match was validated post-migration via checksum verification and reconciliation reports — with zero data loss recorded across all Salesforce objects.
Document and Image Migration to Newgen DMS
Over 1 Crore documents and images — PDFs, JPEGs, PNGs, and Word files — migrated to Newgen DMS at a sustained rate of 30,000+ documents per day. The custom Java NiFi processor handled binary extraction from Salesforce ContentVersion, metadata schema mapping to DMS taxonomy, batched uploads with configurable concurrency, and chunked transfer encoding for large binaries over 100MB. Exponential backoff (1s to 60s, max 5 retries) managed HTTP 429 and 5xx transient failures, while a dead-letter queue captured permanent failures for review. A full, uninterrupted audit trail was maintained for every document migrated.
Key Outcomes
The migration fundamentally transformed Yes Bank's data infrastructure — unlocking analytics capabilities that were previously impossible, significantly reducing storage costs, and placing the bank in full regulatory compliance across every migrated record and document.
| Metric | Outcome |
|---|---|
| Total transactions migrated | 13 Crore (130M) across all Salesforce objects |
| Daily transaction volume | 4 Lakh+ per day, fully automated |
| Document migration throughput | 30,000+ documents/day — unattended |
| Total documents migrated | 1 Crore+ with full metadata preserved |
| Total data volume | 10+ TB end-to-end |
| Data integrity | Zero data loss — 100% checksum-verified |
| Regulatory compliance | Full audit trail — RBI mandate satisfied |
| Pipeline reliability | Fully automated, monitoring + retry, zero manual intervention |
Questions About This Project
Planning a Similar Data Migration?
Auriga IT has deep expertise in enterprise data engineering, Apache NiFi pipelines, and large-scale Salesforce migrations. Let us help with your next data challenge.
Talk to Our Team →Related content
Auriga: Leveling Up for Enterprise Growth!
Auriga’s journey began in 2010 crafting products for India’s [...]







