One of India’s largest logistics and supply chain companies, generates massive volumes of operational and transactional data daily. Over time, this resulted in the accumulation of over XX TB of historical data across ~XXX Snowflake tables. While Snowflake had served the company's analytical needs well initially, it began to pose challenges:
To address these needs, client undertook a strategic migration of its historical data from Snowflake to Amazon Redshift and Amazon S3, with a future-ready architecture designed to enable both cost-efficiency and advanced analytics.
Why Migrate? (Business Drivers)
Business Need | Challenge with Snowflake | Redshift/S3 Benefit |
Cost Optimization | Snowflake's compute/storage model was proving expensive for data that's infrequently queried | Redshift + S3 lifecycle storage significantly cuts costs for cold data |
Audit & Regulatory Compliance | Regulatory mandates required historical data availability & traceability | S3-based versioned backups + Redshift Spectrum = reliable audit trail |
Unified Analytics Platform | Fragmented setup made cross-system queries harder | Redshift + Spectrum allow integrated querying across recent & historical data |
ML & GenAI Readiness | Snowflake limits flexibility in ML pipeline integration | S3-based data lake supports downstream ML/GenAI workloads on AWS |
Cloud Strategy Alignment | Snowflake was a silo outside AWS-native stack | Redshift, EC2, S3 bring tighter integration with existing AWS ecosystem |
A two-level S3 staging approach was implemented to ensure flexibility, data integrity, and recoverability:
Step | Description |
1. Schema Conversion | Used AWS Schema Conversion Tool (SCT) to translate Snowflake schema to Redshift-compatible DDL |
2. Data Extraction | Exported data from Snowflake to Amazon S3 as STRING-type Parquet files to avoid format/type issues |
3. S3 Level 1 - Raw Staging | Raw backup of all exported data (STRING Parquet), version-controlled for audit and rollback |
4. S3 Level 2 - Transformed Staging | Converted STRING to typed format using SCT mapping + EC2 batch scripts (Python) |
5. Load into Redshift | Redshift COPY commands used to batch load 20–30 tables/day |
6. Retention & Archival | S3 Lifecycle Policies applied: older data moved to Glacier or Intelligent Tiering |
7. Query Optimization | Redshift Spectrum used to query archived data directly from S3 |
Fig : Historical Data Migration from Snowflake to Redshift
Fig : Overall Solution Architecture for Incremental Data
This migration was not just a technical re-platforming. It will deliver: