Accelerating researchers and developers building multilingual AI with a new open dataset

Mon, Jun 15 · 7:17 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

★ Tier-1 Source

Thumbnail for a video that says 'What do slash commands do?'.

Software may be written in programming languages, but human language is at the heart of developer collaboration.

Key facts

The dataset covers over 80 million classification rows across more than 40 million repositories
Portuguese tops the non-English README list with more than 3 million repositories
They'll be discussing the dataset, and the broader importance of open data for multilingual AI, at the Open Innovation Dialogue Hub in Strasbourg on June 16
The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content

Summary

As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever. Today, GitHub is publishing the GitHub Multilingual Repositories Dataset, a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. The dataset is now.0. The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Language classifications of the README, the most-commented issue, and the most-commented pull request, with the first 150 characters of each used as the input sample.

Read full article at GitHub Blog →

#GitHub