itishappy 7 hours ago I'm no lawyer, but this repository sure appears to be relicensing the Harry Potter series under the GPL.
gunalx 8 hours ago If all the training data is in the txt files, it is obviously trained on copyrigthed material, and immensly low amounts of text. Im impressed if the outputs even start to make sense at all. nickpsecurity 7 hours ago True. The best solution for small models is to use Project Gutenberg to reduce risk:https://huggingface.co/datasets/manu/project_gutenbergIt’s also enough data to be Chincilla optimal in under one epoch for SLM’s.
nickpsecurity 7 hours ago True. The best solution for small models is to use Project Gutenberg to reduce risk:https://huggingface.co/datasets/manu/project_gutenbergIt’s also enough data to be Chincilla optimal in under one epoch for SLM’s.
burgerrito 8 hours ago ....is that the whole Harry Potter book in one .txt file, hosted on GitHub!? ClearAndPresent 7 hours ago That is all the Harry Potter books in one .txt file, hosted on Github.
I'm no lawyer, but this repository sure appears to be relicensing the Harry Potter series under the GPL.
If all the training data is in the txt files, it is obviously trained on copyrigthed material, and immensly low amounts of text. Im impressed if the outputs even start to make sense at all.
True. The best solution for small models is to use Project Gutenberg to reduce risk:
https://huggingface.co/datasets/manu/project_gutenberg
It’s also enough data to be Chincilla optimal in under one epoch for SLM’s.
....is that the whole Harry Potter book in one .txt file, hosted on GitHub!?
That is all the Harry Potter books in one .txt file, hosted on Github.
Large power? 20MW?
Bro use fine web. Random books are not objectively good training data.
Are you the author of the GitHub? If so, I might have a few suggestions.
“Small LLM” means “Small Large Language Model”