Input tokens limited

#43

by dunzic - opened Aug 8, 2024

Discussion

dunzic

Aug 8, 2024

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens:

imnotednamode

Aug 8, 2024

This happens if your prompt is too long for CLIP. Flux has a second, T5, text encoder that can handle up to 512 tokens, though (only 256 on Schnell). You have to explicitly pass this during the generation call with max_sequence_length=512

QES

Aug 18, 2024

Hi, thanks, I'm trying to force my FLUX Dev colab (uses CLIP by default) to use T5. I added the max_sequence to my pipe, but the colab keeps using CLIP ("The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens"), even with:

                image = pipe(
                    prompt=processed_caption,
                    num_inference_steps=num_inference_steps,
                    guidance_scale=guidance_scale,
                    width=width1 if i == 0 else width2,
                    height=height1 if i == 0 else height2,
                    generator=generator,
                    max_sequence_length=512
                ).images[0]

imnotednamode

Aug 22, 2024

@QES Both the clip and T5 embeddings are passed to the model, just that T5 supports a longer length. Without seeing your pipeline load statement, I can't say for sure that T5 is being loaded, but it likely is. This error message will still show up but will generate fine, including the additional T5 tokens past 77. Don't try to disable clip, it likely won't work well.

QES

Aug 28, 2024

Thanks. I made some tests and realized that: the 77 tokens message keeps showing, BUT the whole prompt is processed (I did put precise details at the end of a long ass prompt ;-)

anubhav0711

Oct 7, 2024

This is my code in this im unable to pass prompt more than 77 tokens

import torch
from diffusers.utils import load_image
from diffusers.pipelines.flux.pipeline_flux_controlnet import FluxControlNetPipeline
from diffusers.models.controlnet_flux import FluxControlNetModel
from huggingface_hub import login
import logging
from accelerate import infer_auto_device_map

Set up logging

logging.info("Loading models...")
base_model = "black-forest-labs/FLUX.1-dev"

controlnet_model = "promeai/FLUX.1-controlnet-lineart-promeai"

controlnet_model = 'InstantX/FLUX.1-dev-Controlnet-Canny'

controlnet = FluxControlNetModel.from_pretrained(
controlnet_model, torch_dtype=torch.float32
)
pipe = FluxControlNetPipeline.from_pretrained(
base_model, controlnet=controlnet, torch_dtype=torch.float32, device_map="balanced"
)
pipe.to("cpu")

logging.info("Loading control image...")
control_image = load_image("./images/house.jpg")

logging.info("Running inference...")
prompt = "ultra realistic modern residential building at morning on a lively suburban street with nearby tall buildings. Feature a textured concrete facade with warm wooden panels and staggered balconies with glass balustrades lit by green LED lights. Include large, white illuminated transparent windows and an elegant entrance with a garden. Show street with parked cars, detailed asphalt, and ambient street lighting with a sharp background with clouds"

control_net = 0.8 # Strong adherence to the raw sketch
inference = 30 # Reduced for CPU performance
guidance_scale = 6 # Strict adherence to the prompt

seed = 76286282

image_number = 28

torch.manual_seed(seed) # Keep the seed value the same for reproducibility

Set the file name with appended parameters

image_name = f"./image_{image_number}_controlnet-{control_net}_inference-{inference}_guidance-{guidance_scale}.jpg"

height = 568
width = 1024

image = pipe(
prompt,
control_image=control_image,
controlnet_conditioning_scale= control_net,
num_inference_steps= inference,
guidance_scale= guidance_scale,
height = height,
width = width
).images[0]

logging.info("Saving image...")
image.save(image_name)
logging.info("Image saved successfully.")

YaTharThShaRma999

Oct 7, 2024

@anubhav0711 The 77 token limit is only for CLIP text encoder. It won’t really matter since the main text encoder is T5 XXL which can handle 512 tokens.

So it should still work, even with the warning.

QES

Oct 7, 2024

Exactly, I did some tests. You get the 77 token limit message, but if you add visual elements at the end of a long-ass prompt, they appear in the image.

dieptran

Nov 7, 2024

so is it mean " 77 token limit message" is just warning and have no effect?

imnotednamode

Nov 11, 2024

@dieptran it does technically have an effect, but you can safely ignore it, large prompts will still work (up to 512 T5 tokens that is). Just make sure to set max_sequence_length=512 so T5 can read the whole prompt

AlexM11

Jan 19

I can confirm it STILL WORKS, even with the warning, locally. It should then probably work through inference as well. To test, simply add something very unique and noticeable at the end of your long prompt. It still gets generated in the image.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment