Yahoo Announces Largest Machine Learning Dataset For Researchers

Yahoo announces the public release of the largest-ever machine learning dataset to the research community. According to the company, the dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, which was collected by recording the user-news items interactions of about 20M users from February 2015 to May 2015.
 
Yahoo in its official blog post states:
 
“Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community.”
 
The company informed us about its goal:
 
“Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use.”
 
Along with the interaction data, Yahoo is also proving categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. They are also releasing the title, summary, and key-phrases of the pertinent news article. Yahoo informed its users that the interaction data is timestamped with the relevant local time and it also contains partial information about the devices on which the users accessed the news feeds.
 
The company concluded by saying:
 
“We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, “real-world” dataset. We strongly believe that this dataset can become the benchmark for large-scale machine learning and recommender systems, and we look forward to hearing from the community about their applications of our data. Happy (large-scale) machine learning in 2016!”