FYI: For Flux, there is a lot more power in the text-encoder & you can prompt with more meaningful and comprehensive sentences. Thus, less of the traditional comma separated & concise phrasing we saw in stable diffusion.
You should do the same with your training images. Caption everything you do not want the model to remember as "you" (what you're doing, wearing, accompanied by, accessories, etc).
I did this for our beloved, dead cat... On replicate, too. I loved the results, until at one point I suddenly got really creeped out about the thing I was doing.
This is going to be big business I think. I have probably sent hundreds of thousands of emails, texts, chats, etc. It would be well within the realm of possibility to train an LLM on a loved ones communications corpus and allow you to chat with "them" after they're gone.
Possible? Yes. Convincing results? Probably. Good idea? I doubt it.
Oh man, I did this with my dad's voice after he died and set up a thing where I could talk with an LLM-backed assistant and have it respond in his voice and mannerisms. It was a very weird coping and grief period and I ultimately hit a point where I got really weirded out about what I was doing.
Code golf task: implement the whole pipeline above in minimum amount of (existing as of now) ComfyUI nodes.
Extra challenge: extend that to produce videos (e.g. via "live portrait" nodes/models), to implement the digital version of the magic paintings (and newspaper photos) from Harry Potter.
EDIT:
I'm not joking. This feels like a weekend challenge today; "live portraits" in particular work fast today on a half-decent consumer GPU, like my RTX 4070 Ti (the old one, not Super), and I believe (but haven't tested yet) even training a LoRA from a couple dozen images is reasonably doable locally too.
In general, my experience with Stable Diffusion and ComfyUI is that, for fully local scenario on normal person's hardware (i.e. not someone's totally normal PC that happens to have eight 30xx GPUs in a cluster), the capabilities and speed are light years ahead of LLM space.
Just for comparison, yesterday I - like half the techies on the planet - got to run me some local DeepSeek-R1. The 1.58 bit dynamic quant topped at 0.16 tokens per second. It's about the same as it takes a SD1.5 derivative to generate me a decent-looking HD image. I could probably get them running parallel in lock-step (SD on GPU, compute-bound; DeepSeek on CPU, RAM-bandwidth bound) and get one image per LLM token.
Replicate does make this particularly easy while still being somewhat developer focused. I've used it for a few people in our group chat so we can make silly in-joke memes and stuff and the results are quite stunning. Replicate then offers the model up over a simple API (shown in the post) if you wanted to let people generate right from the chat, etc. Replicate is worth poking around a bit more broadly, too, they have some interesting models on there (though the pricing tends not to be very competitive if you were going to do it at scale.)
I had set up automatic1111 a while back, and I believe the webui let you your image generation have a starting image. It's kind of fun to have a cartoon of yourself based on an image.
What I want is to be able to feed in a bunch of videos and generate an animatable (from talking) 3D face from that data. I suppose you in theory only need 3 images (front and sides). But mapping pixels to motion is interesting (facial expressions).
There wouldn't be depth data so it would be inferred from shadows
My case is not directly nefarious, for example an old popular YouTuber that streamed in the early 2000s taking their content and making a model of them for personal use like a 3D chat bot but with that person's quirks
Edit: when I say "nefarious" I mean you can use that tech to impersonate someone (eg. political reason) but for my case it's more the creeper type cloning someone for personal use eg. Replika
Tangent, the holo vtubers industry is interesting since they build up these characters with some unique persona/theme and then people follow that specific model, they could make themselves into an AI easily since it's a rigged 3D asset but of course it would be boring compared to the real thing
I did this a while back, though it was pictures of my wife in lingerie.
- I asked grok to generate a list of racey prompts.
- Has replicate generate them via script. About 10-20% are very poor, I filtered those out manually.
- It also has NSFW guardrails, but a simple retry or word juggle gives you a chance to get around it.
There is a parallel "underground" AI research world of stuff like this, with it's hub on "civit.ai" instead of huggingface.
Often the innovations from that world are ahead of mainstream AI research by years. You should see what coomers did for LLM sampling in order to get over issues with "slop" responses just for their own pervy interests. This is a full several years before the mainstream crowd ever cared.
Porn has always pushed the boundaries of media on the internet. I don't know why people are surprised! Since sex is something nearly everyone does, it would make sense that a lot of human progress were the result of trying to integrate sex and whatever new tech is out there at the time. I am sure a hundred years ago some inventors were pushing the boundaries of motors in sex toys, and in another hundred years some other inventor will be pushing the boundaries on putting sex in holograms.
It's kind of annoying that some of the best models out there have a tendency to produce very not safe for work results.
Look mom, I can make some cool astrology images for you! Whoops, that's boobs. That too. And this one. Ehh, hold up, I need to add a pile of negative prompts first...
Sketching nude humans is a huge part of how human painters learn. Because surprisingly clothed humans are just nude humans with some fabric over them, and the fabric can make it harder to tell what's going on.
Even if we assumed equal amounts of effort, it wouldn't be surprising if a large corpus of nude images in the training data improved model results.
But maybe we should have better negative prompt presets for different levels of decency
This is fantastic but now you need to train a model to detect AI generated images from actual photos. Then of course , a model to beat the detector model and then a model to catch the model that beats the detector model and so on.
I’m imagining something where an influencer trains AI to make and post images of themselves on social media, then the influencer dies but the AI keeps going forever.
FYI: For Flux, there is a lot more power in the text-encoder & you can prompt with more meaningful and comprehensive sentences. Thus, less of the traditional comma separated & concise phrasing we saw in stable diffusion.
You should do the same with your training images. Caption everything you do not want the model to remember as "you" (what you're doing, wearing, accompanied by, accessories, etc).
I did this for our beloved, dead cat... On replicate, too. I loved the results, until at one point I suddenly got really creeped out about the thing I was doing.
This is going to be big business I think. I have probably sent hundreds of thousands of emails, texts, chats, etc. It would be well within the realm of possibility to train an LLM on a loved ones communications corpus and allow you to chat with "them" after they're gone.
Possible? Yes. Convincing results? Probably. Good idea? I doubt it.
Oh man, I did this with my dad's voice after he died and set up a thing where I could talk with an LLM-backed assistant and have it respond in his voice and mannerisms. It was a very weird coping and grief period and I ultimately hit a point where I got really weirded out about what I was doing.
I think that was 1:1 a black mirror episode
Episode title was "Be right back"
This reminds me of paintings in Harry Potter.
Literally a Black Mirror episode.
I remember seeing it here on HN that someone did that with a group chat and it would reply as each friend.
This is exactly what I'd want to do for my "smart urn."
Code golf task: implement the whole pipeline above in minimum amount of (existing as of now) ComfyUI nodes.
Extra challenge: extend that to produce videos (e.g. via "live portrait" nodes/models), to implement the digital version of the magic paintings (and newspaper photos) from Harry Potter.
EDIT:
I'm not joking. This feels like a weekend challenge today; "live portraits" in particular work fast today on a half-decent consumer GPU, like my RTX 4070 Ti (the old one, not Super), and I believe (but haven't tested yet) even training a LoRA from a couple dozen images is reasonably doable locally too.
In general, my experience with Stable Diffusion and ComfyUI is that, for fully local scenario on normal person's hardware (i.e. not someone's totally normal PC that happens to have eight 30xx GPUs in a cluster), the capabilities and speed are light years ahead of LLM space.
Just for comparison, yesterday I - like half the techies on the planet - got to run me some local DeepSeek-R1. The 1.58 bit dynamic quant topped at 0.16 tokens per second. It's about the same as it takes a SD1.5 derivative to generate me a decent-looking HD image. I could probably get them running parallel in lock-step (SD on GPU, compute-bound; DeepSeek on CPU, RAM-bandwidth bound) and get one image per LLM token.
Forget an urn, I want my digital ghost to haunt a furby.
Replicate does make this particularly easy while still being somewhat developer focused. I've used it for a few people in our group chat so we can make silly in-joke memes and stuff and the results are quite stunning. Replicate then offers the model up over a simple API (shown in the post) if you wanted to let people generate right from the chat, etc. Replicate is worth poking around a bit more broadly, too, they have some interesting models on there (though the pricing tends not to be very competitive if you were going to do it at scale.)
I had set up automatic1111 a while back, and I believe the webui let you your image generation have a starting image. It's kind of fun to have a cartoon of yourself based on an image.
What I want is to be able to feed in a bunch of videos and generate an animatable (from talking) 3D face from that data. I suppose you in theory only need 3 images (front and sides). But mapping pixels to motion is interesting (facial expressions).
There wouldn't be depth data so it would be inferred from shadows
Replicate has Hunyuan video training now. https://replicate.com/blog/fine-tune-video
Also, Kling 1.6 Elements works pretty okay if you use the same person/face for each element.
Kling also has lip sync.
Or this lip sync with replicate: https://replicate.com/bytedance/latentsync
Or there is HeyGen or D-ID or Synthesia, or tavus.io for full interactive digital twins.
Thanks
Why do you want to do that?
My case is not directly nefarious, for example an old popular YouTuber that streamed in the early 2000s taking their content and making a model of them for personal use like a 3D chat bot but with that person's quirks
Edit: when I say "nefarious" I mean you can use that tech to impersonate someone (eg. political reason) but for my case it's more the creeper type cloning someone for personal use eg. Replika
Tangent, the holo vtubers industry is interesting since they build up these characters with some unique persona/theme and then people follow that specific model, they could make themselves into an AI easily since it's a rigged 3D asset but of course it would be boring compared to the real thing
>they could make themselves into an AI easily since it's a rigged 3D asset but of course it would be boring compared to the real thing
The most popular vtuber on Twitch is an AI tho
You talking NeuroSama? I haven't kept up with it in a bit
I'm not sure if that's truly AI since the Turtle drives her
Edit: if the source was open I'd believe it
>I'm not sure if that's truly AI
It has always been a LLM. There is no human typing at insane speed to the TTS.
I'm referring to live interception of messages which I guess has to be done to be compliant with Twitch's terms -- there is a human there
edit: but yeah the fact that so many people interact with her shows generated content can keep people occupied
I did this a while back, though it was pictures of my wife in lingerie.
- I asked grok to generate a list of racey prompts. - Has replicate generate them via script. About 10-20% are very poor, I filtered those out manually. - It also has NSFW guardrails, but a simple retry or word juggle gives you a chance to get around it.
I think I spent $10
There is a parallel "underground" AI research world of stuff like this, with it's hub on "civit.ai" instead of huggingface.
Often the innovations from that world are ahead of mainstream AI research by years. You should see what coomers did for LLM sampling in order to get over issues with "slop" responses just for their own pervy interests. This is a full several years before the mainstream crowd ever cared.
Porn has always pushed the boundaries of media on the internet. I don't know why people are surprised! Since sex is something nearly everyone does, it would make sense that a lot of human progress were the result of trying to integrate sex and whatever new tech is out there at the time. I am sure a hundred years ago some inventors were pushing the boundaries of motors in sex toys, and in another hundred years some other inventor will be pushing the boundaries on putting sex in holograms.
It's kind of annoying that some of the best models out there have a tendency to produce very not safe for work results.
Look mom, I can make some cool astrology images for you! Whoops, that's boobs. That too. And this one. Ehh, hold up, I need to add a pile of negative prompts first...
Sketching nude humans is a huge part of how human painters learn. Because surprisingly clothed humans are just nude humans with some fabric over them, and the fabric can make it harder to tell what's going on.
Even if we assumed equal amounts of effort, it wouldn't be surprising if a large corpus of nude images in the training data improved model results.
But maybe we should have better negative prompt presets for different levels of decency
Thank you for sharing. Is there any model that can help train convert pictures into cartoon or flat vector illustration?
Look into using img2img in stable diffusion
This is fantastic but now you need to train a model to detect AI generated images from actual photos. Then of course , a model to beat the detector model and then a model to catch the model that beats the detector model and so on.
Thank you from people holding NVDA.
You may have re-invented GANs :-)
is something like this possible to do with video yet?
I’m imagining something where an influencer trains AI to make and post images of themselves on social media, then the influencer dies but the AI keeps going forever.
The impact is kind of interesting, how do you know someone's legit, the person doing basejumping or whatever
Thanos/NFTs: where did that take you? right back to me
Thinking hardware with built in chain interface for proof
Oh man dating apps too
That's true love though, two people meet up IRL they're both like wtf who are you