Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.
CodePlex served as Microsoft’s primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.
Key Stats:
- 5,043,730 files from 38,087 repositories - 3.6 GB compressed Parquet - 91 programming languages (Heavily featuring C#, ASP.NET, and C++) - Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages) - Includes platform-specific license metadata (Ms-PL, Ms-RL)
Here's something different from the code datasets: 20+ years of public discussion archives from NNTP newsgroups. Clean Parquet format, but this time it's conversations instead of code.
The data is messy in the way real discussions are messy. Spam wasn't filtered out - you get the advertisements, the arguments, the off-topic threads, all of it. If you want sanitized text, this isn't it. If you want to see how people actually talked online before Discord and Reddit took over, here you go.
Processing kept it simple: convert everything to UTF-8, remove exact duplicates, strip binary attachments, redact emails. Legacy character encodings were a nightmare - had to handle Windows-1252, ISO-8859 variants, KOI8-R, Shift-JIS, GBK, and others just to get readable text. At least it was fun to do, and I think the result turned out pretty well. I hope someone else will also be able to have fun or gain something useful from this project.
Rain-100M is a raw base model (not instruction-tuned or safety-aligned), aimed at small-scale research, debugging training pipelines, and CPU/edge experiments. If you run evaluations, finetunes, or visualizations with it, I would be very interested in your results!