How to Train an AI Image Model on Yourself

207 points by aberoham 5 months ago

ramoz 5 months ago

FYI: For Flux, there is a lot more power in the text-encoder & you can prompt with more meaningful and comprehensive sentences. Thus, less of the traditional comma separated & concise phrasing we saw in stable diffusion.

You should do the same with your training images. Caption everything you do not want the model to remember as "you" (what you're doing, wearing, accompanied by, accessories, etc).

AuryGlenz 5 months ago

You can do that, but you can also just do your name (or, I do two prompts - one the name, the other “A photo of _name_” as that makes it easier to prompt non-photographic images). This has the bonus of not training that particular shirt or whatever in, so if you prompt for something similar it won’t be identical.
That said I usually train with about 100 really varied images of the person so it tends not to overlearn any other particular thing.

isoprophlex 5 months ago

I did this for our beloved, dead cat... On replicate, too. I loved the results, until at one point I suddenly got really creeped out about the thing I was doing.

ryandvm 5 months ago

This is going to be big business I think. I have probably sent hundreds of thousands of emails, texts, chats, etc. It would be well within the realm of possibility to train an LLM on a loved ones communications corpus and allow you to chat with "them" after they're gone.
Possible? Yes. Convincing results? Probably. Good idea? I doubt it.
- mipmap04 5 months ago
  
  Oh man, I did this with my dad's voice after he died and set up a thing where I could talk with an LLM-backed assistant and have it respond in his voice and mannerisms. It was a very weird coping and grief period and I ultimately hit a point where I got really weirded out about what I was doing.
- portaouflop 5 months ago
  
  I think that was 1:1 a black mirror episode
  - PaulDavisThe1st 5 months ago
    
    Episode title was "Be right back"
- slig 5 months ago
  
  I remember seeing it here on HN that someone did that with a group chat and it would reply as each friend.
- waspleg 5 months ago
  
  Literally a Black Mirror episode.
- knicholes 5 months ago
  
  This is exactly what I'd want to do for my "smart urn."
  - TeMPOraL 5 months ago
    
    Code golf task: implement the whole pipeline above in minimum amount of (existing as of now) ComfyUI nodes.
    Extra challenge: extend that to produce videos (e.g. via "live portrait" nodes/models), to implement the digital version of the magic paintings (and newspaper photos) from Harry Potter.
    EDIT:
    I'm not joking. This feels like a weekend challenge today; "live portraits" in particular work fast today on a half-decent consumer GPU, like my RTX 4070 Ti (the old one, not Super), and I believe (but haven't tested yet) even training a LoRA from a couple dozen images is reasonably doable locally too.
    In general, my experience with Stable Diffusion and ComfyUI is that, for fully local scenario on normal person's hardware (i.e. not someone's totally normal PC that happens to have eight 30xx GPUs in a cluster), the capabilities and speed are light years ahead of LLM space.
    Just for comparison, yesterday I - like half the techies on the planet - got to run me some local DeepSeek-R1. The 1.58 bit dynamic quant topped at 0.16 tokens per second. It's about the same as it takes a SD1.5 derivative to generate me a decent-looking HD image. I could probably get them running parallel in lock-step (SD on GPU, compute-bound; DeepSeek on CPU, RAM-bandwidth bound) and get one image per LLM token.
    
    czue 5 months ago
    
    Can you explain more about comfy ui? I heard it could work for running inference locally, but I couldn't get it running because I don't have Nvidia GPUs. Does it only work if you still have those?
    
    TeMPOraL 5 months ago
    
    https://github.com/comfyanonymous/ComfyUI
    I only use it on Windows with Nvidia GPU, but it should work both on Windows and Linux with CPU only and with Intel GPUs, as well as on Linux only with AMD. Though skimming the README some more, I also see Apple Silicon section, and one called "DirectML (AMD Cards on Windows)", so maybe AMD+Win works too.
    As for use: you install ComfyUI from the link above, and then this:
    https://github.com/ltdrdata/ComfyUI-Manager
    to have UI for searching and downloading custom nodes (instead of having to install them by hand), and you're good to go.
    
    swyx 5 months ago
    
    https://www.latent.space/p/comfyui
  - mystified5016 5 months ago
    
    Forget an urn, I want my digital ghost to haunt a furby.
- numpad0 5 months ago
  
  Is it going to be good for your sanity? I very much doubt it.
- oskarkk 5 months ago
  
  This reminds me of paintings in Harry Potter.

petercooper 5 months ago

Replicate does make this particularly easy while still being somewhat developer focused. I've used it for a few people in our group chat so we can make silly in-joke memes and stuff and the results are quite stunning. Replicate then offers the model up over a simple API (shown in the post) if you wanted to let people generate right from the chat, etc. Replicate is worth poking around a bit more broadly, too, they have some interesting models on there (though the pricing tends not to be very competitive if you were going to do it at scale.)

manishsharan 5 months ago

This is fantastic but now you need to train a model to detect AI generated images from actual photos. Then of course , a model to beat the detector model and then a model to catch the model that beats the detector model and so on.

Thank you from people holding NVDA.

beng-nl 5 months ago

You may have re-invented GANs :-)

thefourthchime 5 months ago

I did this a while back, though it was pictures of my wife in lingerie.

- I asked grok to generate a list of racey prompts. - Has replicate generate them via script. About 10-20% are very poor, I filtered those out manually. - It also has NSFW guardrails, but a simple retry or word juggle gives you a chance to get around it.

I think I spent $10

Der_Einzige 5 months ago

There is a parallel "underground" AI research world of stuff like this, with it's hub on "civit.ai" instead of huggingface.
Often the innovations from that world are ahead of mainstream AI research by years. You should see what coomers did for LLM sampling in order to get over issues with "slop" responses just for their own pervy interests. This is a full several years before the mainstream crowd ever cared.
- ok_dad 5 months ago
  
  Porn has always pushed the boundaries of media on the internet. I don't know why people are surprised! Since sex is something nearly everyone does, it would make sense that a lot of human progress were the result of trying to integrate sex and whatever new tech is out there at the time. I am sure a hundred years ago some inventors were pushing the boundaries of motors in sex toys, and in another hundred years some other inventor will be pushing the boundaries on putting sex in holograms.
- DrSiemer 5 months ago
  
  It's kind of annoying that some of the best models out there have a tendency to produce very not safe for work results.
  Look mom, I can make some cool astrology images for you! Whoops, that's boobs. That too. And this one. Ehh, hold up, I need to add a pile of negative prompts first...
  - wongarsu 5 months ago
    
    Sketching nude humans is a huge part of how human painters learn. Because surprisingly clothed humans are just nude humans with some fabric over them, and the fabric can make it harder to tell what's going on.
    Even if we assumed equal amounts of effort, it wouldn't be surprising if a large corpus of nude images in the training data improved model results.
    But maybe we should have better negative prompt presets for different levels of decency

ge96 5 months ago

What I want is to be able to feed in a bunch of videos and generate an animatable (from talking) 3D face from that data. I suppose you in theory only need 3 images (front and sides). But mapping pixels to motion is interesting (facial expressions).

There wouldn't be depth data so it would be inferred from shadows

ilaksh 5 months ago

Replicate has Hunyuan video training now. https://replicate.com/blog/fine-tune-video
Also, Kling 1.6 Elements works pretty okay if you use the same person/face for each element.
Kling also has lip sync.
Or this lip sync with replicate: https://replicate.com/bytedance/latentsync
Or there is HeyGen or D-ID or Synthesia, or tavus.io for full interactive digital twins.
- ge96 5 months ago
  
  Thanks
timdiggerm 5 months ago

Why do you want to do that?
- ge96 5 months ago
  
  My case is not directly nefarious, for example an old popular YouTuber that streamed in the early 2000s taking their content and making a model of them for personal use like a 3D chat bot but with that person's quirks
  Edit: when I say "nefarious" I mean you can use that tech to impersonate someone (eg. political reason) but for my case it's more the creeper type cloning someone for personal use eg. Replika
  Tangent, the holo vtubers industry is interesting since they build up these characters with some unique persona/theme and then people follow that specific model, they could make themselves into an AI easily since it's a rigged 3D asset but of course it would be boring compared to the real thing
  - GaggiX 5 months ago
    
    >they could make themselves into an AI easily since it's a rigged 3D asset but of course it would be boring compared to the real thing
    The most popular vtuber on Twitch is an AI tho
    
    ge96 5 months ago
    
    You talking NeuroSama? I haven't kept up with it in a bit
    I'm not sure if that's truly AI since the Turtle drives her
    Edit: if the source was open I'd believe it
    
    GaggiX 5 months ago
    
    >I'm not sure if that's truly AI
    It has always been a LLM. There is no human typing at insane speed to the TTS.
    
    ge96 5 months ago
    
    I'm referring to live interception of messages which I guess has to be done to be compliant with Twitch's terms -- there is a human there
    edit: but yeah the fact that so many people interact with her shows generated content can keep people occupied

deadbabe 5 months ago

I’m imagining something where an influencer trains AI to make and post images of themselves on social media, then the influencer dies but the AI keeps going forever.

ge96 5 months ago

The impact is kind of interesting, how do you know someone's legit, the person doing basejumping or whatever
Thanos/NFTs: where did that take you? right back to me
Thinking hardware with built in chain interface for proof
Oh man dating apps too
That's true love though, two people meet up IRL they're both like wtf who are you

m463 5 months ago

I had set up automatic1111 a while back, and I believe the webui let you your image generation have a starting image. It's kind of fun to have a cartoon of yourself based on an image.

njx 5 months ago

Thank you for sharing. Is there any model that can help train convert pictures into cartoon or flat vector illustration?

Our_Benefactors 5 months ago

Look into using img2img in stable diffusion

DoodahMan 5 months ago

is something like this possible to do with video yet?

dvrp 5 months ago

you can use some providers like kling or platforms like krea.ai to do consistent characters with one frame
AuryGlenz 5 months ago

Yeah, with Hunyuan video.