Yahoo announced what it calls the largest-ever public release of a machine learning dataset, some 13.5 terabytes worth of user interactions on the news feeds of assets such as Yahoo News, Sports, Finance, Movies, and Real Estate. But digital marketers shouldn’t get too excited just yet. For now, Yahoo is offering the huge data dump only to the academic research community (like MIT’s Computer Science and Aritificial Intelligence Lab, pictured above).
“Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research, Yahoo Labs, in press release. “We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems.”
The data consists of details on 20 million users and 110 billion events occurring between February and May 2015. Information such as age, gender, and region for a subset of the anonymous users is provided. On the item side, the title, summary, and key phrases of the news articles in question are also included. Events are time-stamped and contain some information about the types of devices used for access.
“The release of this large Yahoo News Feed dataset will be a tremendous asset for the academic research community, and for us at UMass particularly, given our major research activities in natural language processing, information retrieval, databases and computational social science,” said Andrew McCallum, director of UMass’s Center for Data Scientist.
Whether and when the embattled Yahoo will share such treasured data with commercial enterprises remains to be seen.