Building on HF

283 22 60

nyuuzyou PRO

nyuuzyou

https://ducks.party/donate

AI & ML interests

None yet

Recent Activity

new activity about 12 hours ago

nyuuzyou/archiveofourown:Source code, I guess

updated a dataset about 14 hours ago

nyuuzyou/google-code-archive

new activity about 15 hours ago

nyuuzyou/google-code-archive:Question about data collection.

View all activity

Organizations

New activity in nyuuzyou/archiveofourown about 12 hours ago

Source code, I guess

🔥 🤗 7

#231 opened 9 months ago by

nyuuzyou

updated a dataset about 14 hours ago

nyuuzyou/google-code-archive

Viewer • Updated about 14 hours ago • 65.8M • 1.35k • 59

New activity in nyuuzyou/google-code-archive about 15 hours ago

Question about data collection.

#1 opened 1 day ago by

sud0luke

posted an update 1 day ago

Post

2145

🏛️ Microsoft CodePlex Archive Dataset - nyuuzyou/ms-codeplex-archive

Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.

CodePlex served as Microsoft’s primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.

Key Stats:

- 5,043,730 files from 38,087 repositories
- 3.6 GB compressed Parquet
- 91 programming languages (Heavily featuring C#, ASP.NET, and C++)
- Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages)
- Includes platform-specific license metadata (Ms-PL, Ms-RL)

New activity in nyuuzyou/ms-codeplex-archive 2 days ago

[bot] Conversion to Parquet

#1 opened 2 days ago by

parquet-converter

reacted to raincandy-u's post with 😎 3 days ago

Post

2785

Introducing Rain-v2: Democratizing LLM training on gaming GPUs! ⚡

Following Rain-100M, we’re scaling up. Rain-v2 features a larger training dataset.

We’ve published a comprehensive blog covering the end-to-end journey—from raw data collection to rigorous evaluation and safety testing.

HF Repo: 🤗 raincandy-u/Rain-v2

Blog: 📚
https://angelkawaii.xyz/2026/01/29/rain-v2/

Special thanks to the open-source community and the SmolLM2 team for their foundational work! 🚀

HuggingFaceTB
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737)

updated a dataset 3 days ago

nyuuzyou/ms-codeplex-archive

Viewer • Updated 3 days ago • 5.04M • 14 • 2

published a dataset 3 days ago

nyuuzyou/ms-codeplex-archive

Viewer • Updated 3 days ago • 5.04M • 14 • 2

liked a model 3 days ago

moonshotai/Kimi-K2.5

Image-Text-to-Text • Updated 2 days ago • 96.2k • • 1.43k

upvoted a collection 3 days ago

Foundation Text-Generation Models Below 360M Parameters

Collection

Great candidates for fine-tuning targeting Wllama and Transformers.js for mobile devices, ordered by number of parameters. • 42 items • Updated 8 days ago • 40

posted an update 8 days ago

Post

1872

🌐 NNTP Discussion Archives - 387M Messages from Public Newsgroups - nyuuzyou/nntp-text-387m

Here's something different from the code datasets: 20+ years of public discussion archives from NNTP newsgroups. Clean Parquet format, but this time it's conversations instead of code.

Key Stats:
- 386,629,949 messages from 159,345 newsgroups
- 191 GB compressed Parquet storage
- Spans 2002-2026
- Multilingual: English, German, French, Italian, Dutch, Polish, Russian, and others
- Email addresses redacted for privacy

The data is messy in the way real discussions are messy. Spam wasn't filtered out - you get the advertisements, the arguments, the off-topic threads, all of it. If you want sanitized text, this isn't it. If you want to see how people actually talked online before Discord and Reddit took over, here you go.

Processing kept it simple: convert everything to UTF-8, remove exact duplicates, strip binary attachments, redact emails. Legacy character encodings were a nightmare - had to handle Windows-1252, ISO-8859 variants, KOI8-R, Shift-JIS, GBK, and others just to get readable text. At least it was fun to do, and I think the result turned out pretty well. I hope someone else will also be able to have fun or gain something useful from this project.

New activity in nyuuzyou/nntp-text-387m 8 days ago

[bot] Conversion to Parquet

#1 opened 8 days ago by

parquet-converter

updated a dataset 9 days ago

nyuuzyou/nntp-text-387m

Viewer • Updated 9 days ago • 387M • 772 • 3

liked a model 9 days ago

raincandy-u/Rain-100M

Text Generation • 97.2M • Updated 9 days ago • 215 • 17

reacted to raincandy-u's post with 🔥 9 days ago

Post

5282

🤗 Just released Rain-100M, an experimental ~97M-parameter Qwen3-style language model trained from random initialization.

Repo: raincandy-u/Rain-100M

Data: HuggingFaceFW/fineweb-edu, ~3B tokens, English only

Tokenizer: custom 16k BPE, context length 4096

Architecture: 12 Transformer layers, hidden size 768, 12 heads, MLP 2048, SiLU, bf16

Rain-100M is a raw base model (not instruction-tuned or safety-aligned), aimed at small-scale research, debugging training pipelines, and CPU/edge experiments. If you run evaluations, finetunes, or visualizations with it, I would be very interested in your results!