Azure Synapse Analytics - Loading Dataset To Warehouse From Azure Blob Storage

In this article, we’ll learn to load dataset and data from Azure Blob Storage to warehouse we created using Azure Synapse Analytics with Azure Data Studio.  This article is a part of the Azure Synapse Analytics Articles Series. You can check out other articles in the series from the following links. 

  1. Azure Synapse Analytics 
  2. Azure Synapse Analytics - Create Dedicated SQL Pool 
  3. Azure Synapse Analytics - Creating Firewall at Server-level 
  4. Azure Synapse Analytics - Connect, Query and Delete Data Warehouse SQL Pool 
  5. Azure Synapse Analytics – Load Dataset to Warehouse from Azure Blob Storage 

Data Warehouse 

Data Warehouse can be understood as a warehouse of data that consists of large volumes of data that are used to support organizations to make decisions. Data Warehouses assist organizations with business intelligence and analytics that help in decision-making. Data Warehouse is different from Database, Data Lake, and Data Mart. Data Warehousing is enabled by Azure Synapse which can fetch data from an On-premises network or Cloud into Storage blob to perform the required operations and analysis on the data. 

Azure Synapse Analytics  

Azure Synapse is a limitless enterprise analytics service that enables us to get insight from data analytics and data warehousing. Using dedicated resources or serverless architecture, data can be queried and provides scalability as per the increase in the size of the data. You can learn more about it in the previous article, Azure Synapse Analytics

Azure Blob Storage 

The static contents of the web applications such as the HTML, CSS, and JS files are stored in the Azure Blob Storage. Server-side codes are not required to load the page and all the interactions are performed dynamically with JS codes that communicate through the back-end APIs. The Azure Blob Storage is majorly used to store huge amounts of data that are unstructured and can later be used for analytics purposes.  

Let us learn in detail the process to load dataset to warehouse from Azure Blob Storage.  

Step 1 

First of all, create a dedicated SQL Pool – our data warehouse in Azure following the article,  Azure Synapse Analytics - Create Dedicated SQL Pool.  

Step 2 

Open Azure Data Studio and connect it to your data warehouse. Read the article to understand the process in brief. 

Start the connection process by clicking on New Connection.  

Next, in the Connection details fill up the Server, Authentication Type, Username, and Password.  

The Server name is copied from the Azure and pasted here. Next, Authentication type must be SQL login and type in your username and password as you set up while creating the Datawarehouse in Azure.  

Finally, Click on Connect as all details are filled.  

Step 3 

Now, Right click on your server and select New Query.  

Here, my server is ojashserver.database.windows.net. 

A new empty query page is now open.  

Creating User to Load Data 

Step 4

Now under master, type in the following query to create a new user.  

CREATE LOGIN LoaderRC30 WITH PASSWORD = 'X963XTREMESTRONGpassword@!~';
CREATE USER LoaderRC30 FOR LOGIN LoaderRC30;

Here, we’ve taken LoaderRC30 as user and a complex password. You can set the username as you choose for such as LoaderRC60 as per your choice.  

Now, as we hit Run, the query will execute and the details are shown in Messages.  

Step 5 

Next, Right-click on the data warehouse you created.  

Here, we do it on ojash-pool-datawarehouse and create a new Query window.  

Here, see that ojash-pool-datawarehouse is selected.  

Next, type in the following commands to add role and grant control of the database for the user LoaderRC30.  

CREATE USER LoaderRC30 FOR LOGIN LoaderRC30;
GRANT CONTROL ON DATABASE::[ojash-pool-datawarehouse] to LoaderRC30;
EXEC sp_addrolemember 'staticrc30', 'LoaderRC30';

Once run, the message is generated below.  

Connecting to Server as User 

Step 6 

Now, Select on New Connection and add in the details. 

Here, the Server is the same Server we created in Azure that contains the ojash-pool-datawarehouse. Under the Authentication type, select SQL Login and the username must be the one we just created – LoaderRC30 or LoaderRC60 and put in the password you set. Next, under Database select the database you created the user under. Here, we’ll select the ojash-pool-datawarehouse.  

Once, filled, click on Connect.  

Step 7 

Now, we can see the database is connected with the new user LoaderRC60.  

Next, Right-Click on it and select New Query.  

Creating Tables to Load Dataset 

Step 8

Now, before we load in our dataset from Azure Blob Storage, we need to have the tables to contain those data. Thus, on the new query we created under the user LoaderRC60, we query to create table.  

CREATE TABLE [dbo].[Date]
(
    [DateID] int NOT NULL,
    [Date] datetime NULL,
    [DateBKey] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayOfMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DaySuffix] varchar(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayOfWeek] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayOfWeekInMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayOfWeekInYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayOfQuarter] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DayOfYear] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [WeekOfMonth] varchar(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [WeekOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [WeekOfYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [Month] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [MonthName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [MonthOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [Quarter] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [QuarterName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [Year] char(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [YearName] char(7) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [MonthYear] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [MMYYYY] char(6) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [FirstDayOfMonth] date NULL,
    [LastDayOfMonth] date NULL,
    [FirstDayOfQuarter] date NULL,
    [LastDayOfQuarter] date NULL,
    [FirstDayOfYear] date NULL,
    [LastDayOfYear] date NULL,
    [IsHolidayUSA] bit NULL,
    [IsWeekday] bit NULL,
    [HolidayUSA] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE [dbo].[Geography]
(
    [GeographyID] int NOT NULL,
    [ZipCodeBKey] varchar(10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [County] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [City] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [State] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [Country] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [ZipCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE [dbo].[HackneyLicense]
(
    [HackneyLicenseID] int NOT NULL,
    [HackneyLicenseBKey] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [HackneyLicenseCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE [dbo].[Medallion]
(
    [MedallionID] int NOT NULL,
    [MedallionBKey] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [MedallionCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE [dbo].[Time]
(
    [TimeID] int NOT NULL,
    [TimeBKey] varchar(8) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [HourNumber] tinyint NOT NULL,
    [MinuteNumber] tinyint NOT NULL,
    [SecondNumber] tinyint NOT NULL,
    [TimeInSecond] int NOT NULL,
    [HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [DayTimeBucketGroupKey] int NOT NULL,
    [DayTimeBucket] varchar(100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE [dbo].[Trip]
(
    [DateID] int NOT NULL,
    [MedallionID] int NOT NULL,
    [HackneyLicenseID] int NOT NULL,
    [PickupTimeID] int NOT NULL,
    [DropoffTimeID] int NOT NULL,
    [PickupGeographyID] int NULL,
    [DropoffGeographyID] int NULL,
    [PickupLatitude] float NULL,
    [PickupLongitude] float NULL,
    [PickupLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DropoffLatitude] float NULL,
    [DropoffLongitude] float NULL,
    [DropoffLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [PassengerCount] int NULL,
    [TripDurationSeconds] int NULL,
    [TripDistanceMiles] float NULL,
    [PaymentType] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [FareAmount] money NULL,
    [SurchargeAmount] money NULL,
    [TaxAmount] money NULL,
    [TipAmount] money NULL,
    [TollsAmount] money NULL,
    [TotalAmount] money NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

CREATE TABLE [dbo].[Weather]
(
    [DateID] int NOT NULL,
    [GeographyID] int NOT NULL,
    [PrecipitationInches] float NOT NULL,
    [AvgTemperatureFahrenheit] float NOT NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

Once, the Query is written, we hit Run and execute the query. The message will be shown as the query is run successfully.  

Loading Data from Azure Blob Storage to our Data Warehouse 

Step 9 

Finally, we are ready to load the dataset from Azure Blob Storage to our Datawarehouse.  

Here, we use the blob storage dataset of NEW York Taxi available at https://nytaxiblob.blob.core.windows.net which is in several GBs of memory size.  

From the following query, we load the dataset to our datawarehouse.  

COPY INTO [dbo].[Date]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Date'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = ',',
	FIELDQUOTE = ''
)
OPTION (LABEL = 'COPY : Load [dbo].[Date] - Taxi dataset');


COPY INTO [dbo].[Geography]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Geography'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = ',',
	FIELDQUOTE = ''
)
OPTION (LABEL = 'COPY : Load [dbo].[Geography] - Taxi dataset');

COPY INTO [dbo].[HackneyLicense]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/HackneyLicense'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = ',',
	FIELDQUOTE = ''
)
OPTION (LABEL = 'COPY : Load [dbo].[HackneyLicense] - Taxi dataset');

COPY INTO [dbo].[Medallion]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Medallion'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = ',',
	FIELDQUOTE = ''
)
OPTION (LABEL = 'COPY : Load [dbo].[Medallion] - Taxi dataset');

COPY INTO [dbo].[Time]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Time'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = ',',
	FIELDQUOTE = ''
)
OPTION (LABEL = 'COPY : Load [dbo].[Time] - Taxi dataset');

COPY INTO [dbo].[Weather]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Weather'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = ',',
	FIELDQUOTE = '',
	ROWTERMINATOR='0X0A'
)
OPTION (LABEL = 'COPY : Load [dbo].[Weather] - Taxi dataset');

COPY INTO [dbo].[Trip]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Trip2013'
WITH
(
    FILE_TYPE = 'CSV',
	FIELDTERMINATOR = '|',
	FIELDQUOTE = '',
	ROWTERMINATOR='0X0A',
	COMPRESSION = 'GZIP'
)
OPTION (LABEL = 'COPY : Load [dbo].[Trip] - Taxi dataset');

The dataset is huge and the process will take some time. We can see the message as we click on run.  

The query started to execute and 170261325 rows were updated which tool over 3 minutes and 19 seconds.  

Viewing Data 

Step 10 

As the data is loaded or is in process, type in the following query which uses Dynamic Management View and shows the loading status.  

SELECT  r.[request_id]                           
,       r.[status]                               
,       r.resource_class                         
,       r.command
,       sum(bytes_processed) AS bytes_processed
,       sum(rows_processed) AS rows_processed
FROM    sys.dm_pdw_exec_requests r
              JOIN sys.dm_pdw_dms_workers w
                     ON r.[request_id] = w.request_id
WHERE [label] = 'COPY : Load [dbo].[Date] - Taxi dataset' OR
    [label] = 'COPY : Load [dbo].[Geography] - Taxi dataset' OR
    [label] = 'COPY : Load [dbo].[HackneyLicense] - Taxi dataset' OR
    [label] = 'COPY : Load [dbo].[Medallion] - Taxi dataset' OR
    [label] = 'COPY : Load [dbo].[Time] - Taxi dataset' OR
    [label] = 'COPY : Load [dbo].[Weather] - Taxi dataset' OR
    [label] = 'COPY : Load [dbo].[Trip] - Taxi dataset' 
and session_id <> session_id() and type = 'WRITER'
GROUP BY r.[request_id]                           
,       r.[status]                               
,       r.resource_class                         
,       r.command;

We can see the results below.  

Viewing the Query 

Step 11 

With the following commands, we select all the system requests and view the request details. From Request ID to Time elapsed, we can check out numerous details of the loading process.  

SELECT * FROM sys.dm_pdw_exec_requests;

Step 12 

Finally, we can see the Tables on the Left-hand side corner and observe the data.  

Deleting Resource 

Step 13 

With this, we’ve learned to load dataset from Azure Blob Storage to our Datawarehouse. Utilizing the resources cost money. To save from any charges that may incur we now delete the resources in Azure following the steps below. 

Visit the Azure and Select the Resource we created for the Datawarehouse and Server.  

Here, we select the ojash-pool-datawarehouse under ojash-server.  

Click on Delete.  

A Dialox Box will pop up where we need to type the name of the database. This added security is to prevent from any unwanted deletion as the process is irreversible.  

Once, done, we click on Delete. 

We are then notified of the deletion and can check by refreshing the Azure portal to make sure the resources are deleted.  

Conclusion 

Thus, in this article, we learned in a step-by-step process to load dataset from Azure Blob Storage into the Datawarehouse we created in Azure Synapse Analytics. This will help us explore the power of Azure Synapse Analytics and allow us to use and benefit from the features and functionality of using data warehouse for our projects.