itishappy 5 months ago I'm no lawyer, but this repository sure appears to be relicensing the Harry Potter series under the GPL.
gunalx 5 months ago If all the training data is in the txt files, it is obviously trained on copyrigthed material, and immensly low amounts of text. Im impressed if the outputs even start to make sense at all. nickpsecurity 5 months ago True. The best solution for small models is to use Project Gutenberg to reduce risk:https://huggingface.co/datasets/manu/project_gutenbergIt’s also enough data to be Chincilla optimal in under one epoch for SLM’s.
nickpsecurity 5 months ago True. The best solution for small models is to use Project Gutenberg to reduce risk:https://huggingface.co/datasets/manu/project_gutenbergIt’s also enough data to be Chincilla optimal in under one epoch for SLM’s.
burgerrito 5 months ago ....is that the whole Harry Potter book in one .txt file, hosted on GitHub!? ClearAndPresent 5 months ago That is all the Harry Potter books in one .txt file, hosted on Github.
I'm no lawyer, but this repository sure appears to be relicensing the Harry Potter series under the GPL.
If all the training data is in the txt files, it is obviously trained on copyrigthed material, and immensly low amounts of text. Im impressed if the outputs even start to make sense at all.
True. The best solution for small models is to use Project Gutenberg to reduce risk:
https://huggingface.co/datasets/manu/project_gutenberg
It’s also enough data to be Chincilla optimal in under one epoch for SLM’s.
Are you the author of the GitHub? If so, I might have a few suggestions.
....is that the whole Harry Potter book in one .txt file, hosted on GitHub!?
That is all the Harry Potter books in one .txt file, hosted on Github.
Large power? 20MW?
Bro use fine web. Random books are not objectively good training data.
“Small LLM” means “Small Large Language Model”