GitHub · GitHub Blog
Accelerating researchers and developers building multilingual AI with a new open dataset
Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.
★ Tier-1 Source
Software may be written in programming languages, but human language is at the heart of developer collaboration.
Key facts
- The dataset covers over 80 million classification rows across more than 40 million repositories
- Portuguese tops the non-English README list with more than 3 million repositories
- They'll be discussing the dataset, and the broader importance of open data for multilingual AI, at the Open Innovation Dialogue Hub in Strasbourg on June 16
- The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content
Summary
As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever. Today, GitHub is publishing the GitHub Multilingual Repositories Dataset, a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. The dataset is now.0. The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Language classifications of the README, the most-commented issue, and the most-commented pull request, with the first 150 characters of each used as the input sample.