How much smaller can you make your LM with overtraining?
This figure from Chinchilla gives you a clue on what to expect. Say, you have C = 6e20.
If N = 350M, it performs on par with L_opt of C = 1e20 (N_opt = 900M).
=> 6x training FLOPS for 2.5x less inference FLOPS
https://twitter.com/arankomatsuzaki/status/1630257908238696449