• About
    • History of Dallas SEO
  • Contact
  • Topics
    • Bing
    • Blogging
    • Branding
    • Domain Names
    • Google
    • Internet Marketing
    • Link Building
    • Local Search
    • Marketing
    • Public Relations
    • Reputation Management
    • Search Engine Marketing
    • Search Engine Optimization
    • Search Engines
    • Social Media
    • Tech
  • Advertise
  • Services
    • Search Engine Optimization
    • Ongoing SEO Services
    • SEO Expert Witness
    • Google Penalty Recovery
    • Mini SEO Audit
    • Link Audit
    • Keyword Research
    • Combine Websites SEO Services
    • PPC Management
    • Online Reputation Management
    • Domain Name Consultant
    • Domain Names & Expired Domains
    • Domain Name Appraisal

Bill Hartzer

GoDaddy Airo: Register your .com domain name today!
Home » Search Engines » Yandex Drops a 5B-Event Dataset to Fix What’s Broken in AI Recommendations

Yandex Drops a 5B-Event Dataset to Fix What’s Broken in AI Recommendations

Posted on May 29, 2025 Written by Bill Hartzer

yandex-yambda-dataset

Yandex has released what it’s calling the largest publicly available dataset for training and evaluating recommender systems. The dataset, dubbed Yambda (Yandex Music Billion-Interactions Dataset), contains nearly 5 billion anonymized user interactions collected over ten months from its music streaming service, Yandex Music.

The data includes listens, likes, dislikes, and the timing of each event. Researchers, developers, and startups can now study user behavior at scale—something that was previously limited to tech giants with locked-down data.

Jump To

Toggle
    • Why This Matters
    • What’s Inside Yambda
    • How It’s Packaged
    • How It’s Evaluated
  • Making Big Data Useful Again
  • Where to Get It
  • Final Thoughts

Why This Matters

Strong recommendation models depend on training data that mirrors actual user behavior. Most existing public datasets don’t come close.

Spotify’s dataset contains playlists, but not enough behavioral detail. Netflix’s well-known dataset lacks timestamps and has been outdated for years. Criteo’s click logs focus narrowly on ads and are poorly documented.

That leaves researchers developing models in conditions that don’t reflect the real world. When those models are applied commercially, they fail to deliver. Yandex’s release aims to change that.

What’s Inside Yambda

Yambda draws from the listening habits of roughly one million users on Yandex Music. It features 4.79 billion user interactions across 9.39 million tracks. These include:

  • Implicit feedback like listening activity
  • Explicit actions such as likes, dislikes, and even the removal of a reaction
  • Audio embeddings, meaning vector data created from the music itself using neural networks
  • A flag that marks whether a user found a track on their own or through a recommendation
  • Timestamps on every action, allowing for behavioral sequence analysis

The data is anonymized. Both users and tracks are assigned numeric identifiers to comply with privacy standards.

How It’s Packaged

Researchers can access the dataset in three different sizes:

  • 5 billion events for enterprise-grade experimentation
  • 500 million events for mid-level projects
  • 50 million events for lightweight models or limited computing power

The files come in Apache Parquet format, compatible with tools like Spark, Hadoop, Pandas, and Polars. That flexibility makes it easier to work across platforms without reshaping the data.

How It’s Evaluated

Yandex didn’t stop at releasing data. The company also provided baseline models for testing. These include algorithms like ItemKNN, iALS, BPR, SANSA, and SASRec. To measure results, the dataset uses common evaluation metrics:

NDCG@k (ranking quality)

Recall@k (how many relevant items are retrieved)

Coverage@k (variety of items recommended)

Instead of artificially chopping off user history for testing, Yambda uses a method called Global Temporal Split (GTS). This keeps time-sequence data intact and mirrors how real-world systems work—where future data isn’t known in advance.

Making Big Data Useful Again

Yandex’s release gives smaller teams and academic labs access to data they couldn’t otherwise get. Startups can test recommender models without scraping together their own datasets. Researchers can try out new techniques without relying on stale data from 2010.

And because the dataset includes both what users liked and what they skipped or rejected, it supports a broader set of experiments—spanning music, retail, and content recommendations.

Where to Get It

Yambda is now available on Hugging Face, the popular open-access platform for machine learning models and datasets. Researchers can begin working with the dataset immediately.

Final Thoughts

This release marks a rare move in an industry where big data often stays locked behind company doors. By opening up billions of real-world interactions, Yandex is giving the research community a real chance to improve how recommendations are built—and tested.

For anyone working on AI that suggests music, videos, products, or anything personalized, this dataset changes the playing field. It’s one of the few times the public gets a peek behind the curtain of a major streaming platform—and the chance to build something better with it.

Filed Under: Search Engines

About Bill Hartzer

Bill Hartzer is the CEO of Hartzer Consulting and founder of DNAccess, a domain name protection and recovery service. A recognized authority in digital marketing and domain strategy, Bill is frequently called upon as an Expert Witness in internet-related legal cases. He's been sharing insights and research here on BillHartzer.com for over two decades.

Bill Hartzer on Search, Marketing, Tech, and Domains.

Recent Posts

  • Internet Marketing Ninjas Acquired by Previsible.IO July 9, 2025
  • Metricool Brings Real Analytics to Personal LinkedIn Profiles July 8, 2025
  • This Cleveland Agency Found a Smarter Way to Rank in Every Suburb—Without Opening More Offices July 8, 2025
  • Survey: Gen Z Reuses Passwords but Demands Bank-Level Security From Small Businesses July 8, 2025
  • Liftoff Reveals What’s Actually Working in Mobile Ads July 7, 2025
  • EasySend’s Big Move: AI Tools That Make Static Forms Obsolete July 7, 2025
  • Is Social Media Failing Small Businesses? New Survey Reveals a Hidden Blind Spot July 7, 2025
  • Why Cloudflare’s Pay Per Crawl Is a Trap for 99% of Websites July 2, 2025
  • The Hidden Risk of Double Letters in Brand and Domain Names July 2, 2025
  • GEO Verified™ Launches to Help Brands Survive the AI Search Shakeup July 1, 2025
  • RetailOnline.com Hits the Market After 25 Years—And It’s Built for the Future of E-Commerce July 1, 2025
  • AI-Powered Task Planning: The Future of Business Efficiency and Personal Productivity June 30, 2025
  • New Yoast Add-On Turns Google Docs Into an SEO Power Tool June 26, 2025
  • Simon Data Flips the Script on Marketing with AI Agents June 26, 2025
  • IAB Lays Down the Law for Gaming Ads—Here’s What Brands Need to Know June 26, 2025
  • Google Review Extortion Text Message – Scam Warning for Business Owners June 25, 2025
  • Google Names SearchKings Top AI Innovator for Transforming Lead Quality June 24, 2025
  • Marketing Exec Buys Social Media Firm in Deal That Signals Big Plans June 24, 2025
  • Amsive Takes on ChatGPT and Gemini with Next-Gen SEO for the AI Search Era June 23, 2025
  • Reddit Sued After Google’s AI Overviews Allegedly Gutted Traffic June 19, 2025

Hartzer Domains

Bare-Metal Servers by HostDime

DFWSEM logo

Bill Hartzer is a Brand Ambassador for:

Industry Friends

I Love SEO
WTFSEO
SEO By the Sea
Brian Harnish
Jeff Lenney
Jeff Gabriel
Scott Hendison
Dixon Jones
Brian Hartzer
Navah Hopkins
DNAccess
SEO Dallas
Confirmed Stolen

Connect With Bill Hartzer

Bill Hartzer on Twitter
Bill Hartzer on BlueSky
Bill Hartzer on Instagram
Hartzer Consulting on Facebook
Bill Hartzer on Facebook
Bill Hartzer on YouTube

Categories

  • Advertising (109)
  • AI (201)
  • Bing Search Engine (8)
  • Blogging (43)
  • Branding (19)
  • Domain Names (315)
  • Google (260)
  • Internet Marketing (51)
  • Internet Usage (95)
  • Link Building (53)
  • Local Search (63)
  • Marketing (232)
  • Marketing Foo (34)
  • Pay Per Click (9)
  • Podcast (19)
  • Public Relations (9)
  • Reputation Management (14)
  • Search Engine Marketing (46)
  • Search Engine Marketing Events (60)
  • Search Engine Marketing Firms (94)
  • Search Engine Marketing Jobs (33)
  • Search Engine Optimization (189)
  • Search Engines (223)
  • Social Media (302)
  • Social Media Marketing (58)
  • Tech (16)
  • Web Analytics (21)
  • Webinars (1)

Note: All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only, and are mentioned only to help my readers. All other trademarks cited herein are the property of their respective owners. Use of these names, logos, and brands does not imply endorsement.

 

Hartzer Consulting

Website, Content, and Marketing by Hartzer Consulting, LLC.

Disclaimer - Privacy Policy - Terms of Use

Copyright © 2025 ·