How to Load Data from Snowflake to Databricks: Two Practical Methods

Sourabh G
Aug 14
1k
0
0

Article

Snowflake and Databricks are two of the most widely used platforms in modern data architectures. Snowflake is optimized for scalable SQL-based analytics, while Databricks is built for machine learning, streaming, and working with open file formats like Delta Lake. Many organizations use both to handle different stages of the data lifecycle.

As teams adopt both tools, a new challenge emerges: how to move data from Snowflake to Databricks efficiently. This isn't just about migration. It's often about enabling continuous data sync so that machine learning models, real-time dashboards, or downstream analytics in Databricks always have access to fresh data from Snowflake.

Manual exports and batch ETL processes can work in simple scenarios. But as data volume grows or freshness becomes critical, these methods start to fall short. What’s needed is a scalable, low-latency approach that reliably bridges the gap between warehouse and lakehouse environments.

This article breaks down two practical methods for moving Snowflake data into Databricks. The first is a manual approach using file exports and uploads. The second shows how real-time data movement is possible using modern pipeline tools.

Method 1. Manual Transfer Using Export and Delta Lake Upload

If you're just getting started or only need to transfer a limited amount of data, a manual export from Snowflake and upload to Databricks can be a simple solution. This method works well for one-time migrations or small-scale testing, though it doesn't support automation or real-time updates.

Step 1. Export Data from Snowflake

To export data, you can use the COPY INTO command in Snowflake to unload your table into a file format like CSV or Parquet. This file is typically stored in a cloud storage location such as Amazon S3 or Azure Blob Storage.

COPY INTO 's3://your-bucket/snowflake-export/'
FROM your_database.your_schema.your_table
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE;

Make sure the storage integration is properly configured in your Snowflake account. You'll also want to confirm that the data types are compatible with your target environment in Databricks.

Step 2. Load Data into Databricks

Once your files are in storage, you can load them into Databricks using the Data Explorer UI, a notebook, or a simple SQL command.

df = spark.read.parquet("s3://your-bucket/snowflake-export/")
df.write.format("delta").save("/mnt/datalake/your-delta-table")

Alternatively, you can register the data in Unity Catalog or use SQL to create a table over the data:

CREATE TABLE your_table
USING DELTA
LOCATION '/mnt/datalake/your-delta-table';

When to Use This Approach

This manual method is suitable when:

You're moving a static snapshot of data
Real-time sync isn't required
You're experimenting with new workflows in Databricks

It’s not ideal for production pipelines where updates, deletions, and schema changes happen regularly. In those cases, a real-time solution is more appropriate.

Method 2. Real-Time Sync Using a Managed Data Pipeline

For use cases that demand fresh data, a real-time pipeline between Snowflake and Databricks is often more efficient than manual or batch-based methods. This approach captures changes in Snowflake as they occur and streams them into Databricks, where they can be used immediately in Delta Lake tables for analytics, machine learning, or reporting.

Estuary Flow is one platform that supports this kind of real-time integration. It allows you to set up a pipeline without writing code or managing infrastructure, using change data capture (CDC) to track updates in Snowflake and materializing those changes directly into Databricks.

Step 1. Connect Snowflake as the Source

Log in to the Estuary Dashboard and create a new data capture.
Choose the Snowflake connector and enter connection details like host URL, database, user credentials, and warehouse.
Select one or more tables to capture. Estuary will begin tracking all inserts, updates, and deletes in real time using CDC.

Step 2. Configure Databricks as the Destination

In the dashboard, create a new materialization and select Databricks.
Provide your Databricks workspace details, including SQL warehouse endpoint, HTTP path, catalog name, and a personal access token.
Link the captured Snowflake tables to this Databricks destination. Estuary handles schema mapping and ensures the data is written in Delta Lake format.

Step 3. Activate the Pipeline

Save and publish the configuration to start streaming.
Estuary will begin syncing data from Snowflake to Databricks in real time.
You can monitor latency, row throughput, and sync health from the UI.

This method is suitable for production-grade pipelines that require minimal latency, automatic schema handling, and consistent delivery. It removes the need for managing Spark jobs, orchestrators, or manual exports, making it easier to scale and maintain over time.

Considerations When Choosing an Approach

Before deciding how to move data from Snowflake to Databricks, it's important to assess the needs of your use case. Not every scenario requires real-time streaming, and not every workload can tolerate the limitations of a manual export. The right approach often depends on factors like data freshness, scale, complexity, and operational overhead.

Here are a few key considerations

Data freshness: If your downstream workloads rely on near real-time data, such as live dashboards or ML pipelines, streaming is the better choice. Manual exports work better for static or infrequent transfers.
Data volume and frequency: Large or frequently updated datasets are harder to manage manually. A managed pipeline can reduce repeated export overhead.
Schema changes: Evolving source schemas are easier to handle with tools that support automatic schema propagation.
Team and tooling: Consider the effort needed to build and maintain custom scripts or jobs. A no-code tool may be more efficient.
Security and deployment: Ensure the solution supports private networking and access control if you're working with sensitive data.
Cost and performance: Offloading compute from Snowflake to Databricks may help reduce cost, but only if data sync is reliable and timely.

Conclusion

Moving data from Snowflake to Databricks enables a variety of advanced use cases. These include training machine learning models on up-to-date data, supporting real-time analytics, and integrating structured and unstructured data within a unified architecture.

A manual export and import method can work well for small datasets or one-time transfers. It is straightforward to implement and does not require any additional tools. However, this approach becomes difficult to maintain as data volume increases or when frequent updates are needed.

For teams that depend on data freshness and reliability, a real-time sync using a managed pipeline is often the better fit. It reduces operational overhead, handles schema changes automatically, and ensures that Databricks always has access to current data from Snowflake.

Each method has its place. The right choice depends on the complexity of your use case, your team's resources, and how critical data latency is to your workflow.