63. Train a small 110M parameter model from scratch using the DeepSeek-V4 architecture, making it easy to experiment … (t.co) by @QingQ77 (Geek Lite) · backlist 2026-05-07 · rubric 92.0