Azure Open Data Set - Curated List Of Public Data Set


Azure Open Datasets is a curated list of open-source datasets which is easily available and accessible on Azure. In this blog, we are going to explore the Azure open dataset and how it can be used for different purposes while working with Azure.

What is Azure Dataset?

  • A curated list of open-source data sources to speed up the machine learning process
  • No need to spend time on data discovery and search
  • Increase the accuracy of the Data Science and Machine Learning Model 
  • Share data sets at a central location to increase the collaboration between the data science community and developers
  • Enable advanced analytics and insights with easily available open datasets

Azure open data sets are easily available and integrated with Azure Machine Learning and Databricks, Power BI, and Data Factory.

You can find a list of available Azure open data sets using this link.

Catalog of Azure Open Dataset

Azure open data set has below set of data sources available in different categories,


  • TartanAir: AirSim Simulation Dataset
  • NYC Taxi & Limousine Commission - yellow taxi trip records
  • NYC Taxi & Limousine Commission - green taxi trip records


  • Covid-19 Data lake
  • Covid-19 Research Data set

Likewise, Azure Data set also has various open data set available for sectors like Labor and economics, safety and population and most common datasets used for machine learning purpose.

Access Azure Open Data Set Package using opendatasets package and jupyter notebook

  • opendatasets python package allows users to access the data source using the data frame and enriching the customer experience.
  • There are various reference jupyter notebooks available to use these open data sources.

GitHub URL

Pricing for Azure open dataset can be found from this link:


In this blog, we explored how to use Azure open datasets and what all categories of data sets are available for machine learning and data science purpose. This will also provide a curated list of datasets so the data science team doesn't have to spend a lot of time preparing and cleaning the data.