12B β†’ 8.9B code?

#12
by deltanym - opened

re:

TL;DR: There are 3.3B parameters that only encode a single input vector, which I replaced with 250M params.

Since FLUX is so big, I had to modify the architecture and ensure minimal knowledge was lost in the process. The most obvious thing to prune was this modulation layer. In the diagram, it may look small, but in total, FLUX has 3.3B parameters allocated to it. Without glazing over the details too much, this layer's job is to let the model know which timestep it's at during the denoising process. This layer also receives information from pooled CLIP vectors.

is the code for doing this available anywhere?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment