Show HN: Small LLM with Large Power

14 points by inHUMAN 5 months ago

itishappy 5 months ago

I'm no lawyer, but this repository sure appears to be relicensing the Harry Potter series under the GPL.

gunalx 5 months ago

If all the training data is in the txt files, it is obviously trained on copyrigthed material, and immensly low amounts of text. Im impressed if the outputs even start to make sense at all.

nickpsecurity 5 months ago

True. The best solution for small models is to use Project Gutenberg to reduce risk:
https://huggingface.co/datasets/manu/project_gutenberg
It’s also enough data to be Chincilla optimal in under one epoch for SLM’s.

nickpsecurity 5 months ago

Are you the author of the GitHub? If so, I might have a few suggestions.

burgerrito 5 months ago

....is that the whole Harry Potter book in one .txt file, hosted on GitHub!?

ClearAndPresent 5 months ago

That is all the Harry Potter books in one .txt file, hosted on Github.

lostmsu 5 months ago

Large power? 20MW?

cjtrowbridge 5 months ago

Bro use fine web. Random books are not objectively good training data.

parpfish 5 months ago

“Small LLM” means “Small Large Language Model”