Azure Data Factory - Datasets and Linked Services

Welcome back to my overview of Azure Data Factory. In our previous post we went through the process of creating an Azure Data Factory pipeline. In today's blog post I would like to talk about the next component of ADF (and there are several) which is Datasets and Linked Services.

If you stumbled upon this post and are wondering how to even create a pipeline, please see my previous post on the topic: Azure Data Factory - Create a pipeline


What are Datasets and Linked Services

Datasets and Linked Services are an integral part of Azure Data Factory and while the two are linked, they provide 2 different services.

Datasets can be considered as the source and target of a pipeline. A pipeline can have multiple Datasets, sometimes extracting a file, transforming it, and then writing it to a different folder within the same storage account. Thus a Dataset is just a term to describe tables, file storages, or ERP systems.

Linked Services can be through of as the connection string, or the authorization that the pipeline uses to connect to a Dataset. A linked service can be the key value of a blob storage account, the username and password of a database, or a Service account that is used to log into a BW system. While it is possible to have more than one Linked Service attached to the same Dataset, usually you will see a one to one relation ship.


Datasets/Data store - Source and Sink

A Datasets can be used as a Source or Sink, only as a Source, or only as a Sink. For example, when using Blob storage, Azure Data Factory can use that "Dataset" as both a Source and Sink. When searching online you can come across a different term for this called "Data Store". To me both terms mean the same thing, the difference being a on what you are trying to describe.

I am not exactly sure why Microsoft decided to use different terms for the two, but when searching online on specific Dataset restrictions (ex: Teradata, SAP) use the term "Data store". This will help you find the page that lists all the different Datasets that are possible with Azure Data Factory and how they can be used.


Another thing to pay attention to is if a Dataset is available through Azure IR or Self-Hosted IR (IR stands for Integration Runtime). For example, if you follow the above link you will see that SAP can only be used a Dataset if you have a Self-Hosted IR. Reason being is that SAP requires additional drivers that can only be instead on a hosted VM to allow communication between ADF and SAP.


Dataset Demo

Terminology out of the way let us create a new Dataset in our pipeline pointing to our blob storage account.

If you need a refresher on how to create a pipeline in ADF or how to create a blob storage account follow the below posts where I listed out all the steps.


By this point I am assuming that you are an ADF pro and know how to get to log into your ADF and select a pipeline.

There are two places where you can start creating your Dataset. You can either create it from your Copy Activity, or you can click on the three dots next to "Datasets" and select "Create new Datasets".

New Dataset in Azure Data Factory

In either case when you click new you see a pane appear on the right hand side that asks you what type of Dataset you want to create. In the search box type in storage and it will automatically limit your options to those Datasets that contain the word storage. Select "Azure Blob Storage" and click on "Continue".

ADF Blob Storage Dataset



Then you have to select what type of files your will either be extracting from this Dataset, or what type of files you will be storing into this Dataset. In total you will have 8 type of file formats that you can select and this will largely depend on what is your use case. For most of the time I used Blob storage I have select csv, but the files that I have stored in my blob storage account are excel so I will choose that for now.

Select "Excel" and then click on "Continue"

Excel format in Azure Blob Storage


You will now name this Dataset. The name can just be a description of what this Dataset is (ex: Blob storage), the business function (ex: finance) or anything else that you would like. Just make sure you name it something that you will be able to recognize. In future posts I will even show how with parameterization you can have just one Dataset and use it for multiple reads and writes.

In this case name it "Dataset_demo". The next important part is to now select a Linked Service. 


Linked Service Demo

Now you are at a point where you have to select a Linked Service to be able to save your Dataset. To do so click on the Linked Service drop down menu and select "New".

Creating a new Linked Service
This will open up a new pane on the right hand side titled "New linked service (Azure Blob Storage)". Most of the information should be pre-filled (for example the integration run time will be set to Auto). The main thing that you will have to decide is the Authentication method. By default it is set to "Key", which means that you will need the specific key value (kinda like a password) from a blob storage account. This value would be hard coded, which is definitely not something that you want to have as an enterprise, but for a home lab this is fine.

The more secure method would be to set up either Managed Identity or Service Principal, but due to the complexity of setting that up it is outside the scope of this post. 

Now in my case since I am the owner of the subscription and resource group if I click on Account key, then select the subscription group, and then select the storage account it will automatically pick up the Account key and shows that my connection is successful. 


Linked Service Key authentication

However, you might not have that luxury so another way  to set this up is to select "Enter Manually". This will require you two write in two things, the name of the storage account and also copy and paste the storage account key. To find the storage account key pop on over to the azure portal and find your storage account.

Once there, look on the left pane for "Access Key". 

Getting access key from storage account

Once there you will be able to see all of the different storage keys that are assigned to this blob storage account. I hid mine, but you should be able to see them (and if you need to hide them click on Hide keys and to show them click on Show keys).

Access keys in blog storage account

Copy one of the keys and paste that into the "Storage account key" field in the Linked Service pane along with typing the name of the blob storage account into the "Storage account name" field. When ready click on test connection.

Blob storage account Linked Service

Since the connection tested successfully click on the "Create" button. This will take you to the DataSet properties page, but now it shows additional options like selecting which folder or directory/container you want the files to be extracted from.

To select the container click on the folder icon in the file path section.

File Path in blob storage account


This will bring up the pane where you can click through and select the container, directory/folder, and down to file level that you want this Dataset to point to. This can be changed later on, but in my case I only have a blob container so I will just select that. Once that loads you will see all the files that are contained in that container.

Files in blob container


Depending on the folder you select click on "Okay" and now you will see the container listed in the file path. Alternatively you can type that info in.


Updated file path for blob storage in ADF

Click on "Okay" and now you will see a new entry under the Dataset category under your pipelines. You will also see a nice excel icon to signify that this is an Excel Dataset.

Dataset under Pipelines section in ADF




To use that new Dataset go back to your demo pipeline and then click on "Source". Click on the drop down and now you will see "demo_dataset" as one of the options that you can select.

Excel source dataset in ADF



Summary

In this post we went through to explain what a Datasets and Linked Services are. We also looked at the definitive list of all Datasets (called Data Stores and linked at the top of this post) that ADF has access.

Finally we went through and added a Dataset and Linked Service to our demo pipeline.

In my next blob post I will show how to Publish (aka save) this pipeline and run it. In addition I will also try to do a blog post on the difference between Azure IR and Self Hosted IRs and how to set a home lab Self Hosted IR.

Hope you found this post useful and happy learnings.




Comments