multimodalart HF staff commited on Feb 12

Commit

e704b78

•

1 Parent(s): 02c1544

Upload folder using huggingface_hub

Browse files

Files changed (30) hide show

LICENSE +23 -0
README.md +137 -0
feature_extractor/preprocessor_config.json +27 -0
figures/collage_2.jpg +0 -0
figures/collage_4.jpg +0 -0
figures/comparison.png +0 -0
figures/controlnet-canny.jpg +0 -0
figures/controlnet-face.jpg +0 -0
figures/controlnet-paint.jpg +0 -0
figures/fernando.jpg +0 -0
figures/fernando_original.jpg +0 -0
figures/image-to-image-example-rodent.jpg +0 -0
figures/image-variations-example-headset.jpg +0 -0
figures/model-overview.jpg +0 -0
figures/original.jpg +0 -0
figures/reconstructed.jpg +0 -0
figures/text-to-image-example-penguin.jpg +0 -0
image_encoder/config.json +23 -0
image_encoder/model.safetensors +3 -0
model_index.json +30 -0
prior/config.json +61 -0
prior/diffusion_pytorch_model.safetensors +3 -0
scheduler/scheduler_config.json +6 -0
text_encoder/config.json +25 -0
text_encoder/model.safetensors +3 -0
tokenizer/merges.txt +0 -0
tokenizer/special_tokens_map.json +30 -0
tokenizer/tokenizer.json +0 -0
tokenizer/tokenizer_config.json +30 -0
tokenizer/vocab.json +0 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,23 @@

+Stable Cascade NON-COMMERCIAL COMMUNITY LICENSE AGREEMENT
+Dated: February 10, 2024
+“AUP” means the Stability AI Acceptable Use Policy available at https://stability.ai/use-policy, as may be updated from time to time.
+"Agreement" means the terms and conditions for use, reproduction, distribution and modification of the Software Products set forth herein.
+"Derivative Work(s)” means (a) any derivative work of the Software Products as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Model’s output. For clarity, Derivative Works do not include the output of any Model.
+“Documentation” means any specifications, manuals, documentation, and other written information provided by Stability AI related to the Software.
+"Licensee" or "you" means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity's behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+"Stability AI" or "we" means Stability AI Ltd.
+"Software" means, collectively, Stability AI’s proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing, made available under this Agreement.
+“Software Products” means Software and Documentation.
+By using or distributing any portion or element of the Software Products, you agree to be bound by this Agreement.
+License Rights and Redistribution.
+Subject to your compliance with this Agreement, the AUP (which is hereby incorporated herein by reference), and the Documentation, Stability AI grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Software Products to reproduce, distribute, and create Derivative Works of the Software Products for purposes other than commercial or production use.
+b.	If you distribute or make the Software Products, or any Derivative Works thereof, available to a third party, the Software Products, Derivative Works, or any portion thereof, respectively, will remain subject to this Agreement and you must (i) provide a copy of this Agreement to such third party, and (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "Stable Cascade is licensed under the Stable Cascade Research License, Copyright (c) Stability AI Ltd. All Rights Reserved.” If you create a Derivative Work of a Software Product, you may add your own attribution notices to the Notice file included with the Software Product, provided that you clearly indicate which attributions apply to the Software Product and you must state in the NOTICE file that you changed the Software Product and how it was modified.
+2. 	  Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE SOFTWARE PRODUCTS  AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE SOFTWARE PRODUCTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE SOFTWARE PRODUCTS AND ANY OUTPUT AND RESULTS.
+3.   Limitation of Liability. IN NO EVENT WILL STABILITY AI OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF STABILITY AI OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+3.   Intellectual Property.
+a. 	No trademark licenses are granted under this Agreement, and in connection with the Software Products, neither Stability AI nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Software Products.
+Subject to Stability AI’s ownership of the Software Products and Derivative Works made by or for Stability AI, with respect to any Derivative Works that are made by you, as between you and Stability AI, you are and will be the owner of such Derivative Works.
+If you institute litigation or other proceedings against Stability AI (including a cross-claim or counterclaim in a lawsuit) alleging that the Software Products or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Stability AI from and against any claim by any third party arising out of or related to your use or distribution of the Software Products in violation of this Agreement.
+4.   Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Software Products and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Stability AI may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Software Products. Sections 2-4 shall survive the termination of this Agreement.

README.md ADDED Viewed

	@@ -0,0 +1,137 @@

+---
+pipeline_tag: text-to-image
+license: other
+license_name: stable-cascade-nc-community
+license_link: LICENSE
+---
+# Stable Cascade Prior
+<!-- Provide a quick summary of what the model is/does. -->
+<img src="figures/collage_1.jpg" width="800">
+This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
+difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
+important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
+How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
+encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
+1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
+highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
+Diffusion 1.5. <br> <br>
+Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
+like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
+## Model Details
+### Model Description
+Stable Cascade is a diffusion model trained to generate images given a text prompt.
+- **Developed by:** Stability AI
+- **Funded by:** Stability AI
+- **Model type:** Generative text-to-image model
+### Model Sources
+For research purposes, we recommend our `StableCascade` Github repository (https://github.com/Stability-AI/StableCascade).
+- **Repository:** https://github.com/Stability-AI/StableCascade
+- **Paper:** https://openreview.net/forum?id=gU58d5QeGv
+### Model Overview
+Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
+hence the name "Stable Cascade".
+Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
+However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
+spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
+a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
+image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
+for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
+<img src="figures/model-overview.jpg" width="600">
+For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
+a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
+put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
+great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
+best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
+its small size.
+## Evaluation
+<img height="300" src="figures/comparison.png"/>
+According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
+comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
+aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
+steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
+## Code Example
+```shell
+#install `diffusers` from this branch while the PR is WIP
+pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3
+```
+```python
+import torch
+from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
+device = "cuda"
+dtype = torch.bfloat16
+num_images_per_prompt = 2
+prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=dtype).to(device)
+decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=dtype).to(device)
+prompt = "Anthropomorphic cat dressed as a pilot"
+negative_prompt = ""
+prior_output = prior_pipeline(
+    prompt=caption,
+    height=1024,
+    width=1024,
+    negative_prompt=negative_prompt,
+    guidance_scale=4.0,
+    num_images_per_prompt=num_images_per_prompt,
+)
+decoder_output = decoder_pipeline(
+    image_embeddings=prior_output.image_embeddings,
+    prompt=caption,
+    negative_prompt=negative_prompt,
+    guidance_scale=0.0,
+    output_type="pil",
+).images
+```
+## Uses
+### Direct Use
+The model is intended for research purposes for now. Possible research areas and tasks include
+- Research on generative models.
+- Safe deployment of models which have the potential to generate harmful content.
+- Probing and understanding the limitations and biases of generative models.
+- Generation of artworks and use in design and other artistic processes.
+- Applications in educational or creative tools.
+Excluded uses are described below.
+### Out-of-Scope Use
+The model was not trained to be factual or true representations of people or events,
+and therefore using the model to generate such content is out-of-scope for the abilities of this model.
+The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
+## Limitations and Bias
+### Limitations
+- Faces and people in general may not be generated properly.
+- The autoencoding part of the model is lossy.
+### Recommendations
+The model is intended for research purposes only.
+## How to Get Started with the Model
+Check out https://github.com/Stability-AI/StableCascade

feature_extractor/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "crop_size": {
+    "height": 224,
+    "width": 224
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "CLIPImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 224
+  }
+}

figures/collage_2.jpg ADDED Viewed

figures/collage_4.jpg ADDED Viewed

figures/comparison.png ADDED Viewed

figures/controlnet-canny.jpg ADDED Viewed

figures/controlnet-face.jpg ADDED Viewed

figures/controlnet-paint.jpg ADDED Viewed

figures/fernando.jpg ADDED Viewed

figures/fernando_original.jpg ADDED Viewed

figures/image-to-image-example-rodent.jpg ADDED Viewed

figures/image-variations-example-headset.jpg ADDED Viewed

figures/model-overview.jpg ADDED Viewed

figures/original.jpg ADDED Viewed

figures/reconstructed.jpg ADDED Viewed

figures/text-to-image-example-penguin.jpg ADDED Viewed

image_encoder/config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "_name_or_path": "StableCascade-prior/image_encoder",
+  "architectures": [
+    "CLIPVisionModelWithProjection"
+  ],
+  "attention_dropout": 0.0,
+  "dropout": 0.0,
+  "hidden_act": "quick_gelu",
+  "hidden_size": 1024,
+  "image_size": 224,
+  "initializer_factor": 1.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "model_type": "clip_vision_model",
+  "num_attention_heads": 16,
+  "num_channels": 3,
+  "num_hidden_layers": 24,
+  "patch_size": 14,
+  "projection_dim": 768,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.38.0.dev0"
+}

image_encoder/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e4b33d864f89a793357a768cb07d0dc18d6a14e6664f4110a0d535ca9ba78da8
+size 607980488

model_index.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "_class_name": "StableCascadePriorPipeline",
+  "_diffusers_version": "0.26.0.dev0",
+  "_name_or_path": "StableCascade-prior/",
+  "feature_extractor": [
+    "transformers",
+    "CLIPImageProcessor"
+  ],
+  "image_encoder": [
+    "transformers",
+    "CLIPVisionModelWithProjection"
+  ],
+  "prior": [
+    "stable_cascade",
+    "StableCascadeUnet"
+  ],
+  "resolution_multiple": 42.67,
+  "scheduler": [
+    "diffusers",
+    "DDPMWuerstchenScheduler"
+  ],
+  "text_encoder": [
+    "transformers",
+    "CLIPTextModelWithProjection"
+  ],
+  "tokenizer": [
+    "transformers",
+    "CLIPTokenizerFast"
+  ]
+}

prior/config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "_class_name": "StableCascadeUnet",
+  "_diffusers_version": "0.26.0.dev0",
+  "_name_or_path": "StableCascade-prior/prior",
+  "block_repeat": [
+    [
+      1,
+      1
+    ],
+    [
+      1,
+      1
+    ]
+  ],
+  "blocks": [
+    [
+      8,
+      24
+    ],
+    [
+      24,
+      8
+    ]
+  ],
+  "c_clip_img": 768,
+  "c_clip_seq": 4,
+  "c_clip_text": 1280,
+  "c_clip_text_pooled": 1280,
+  "c_cond": 2048,
+  "c_effnet": null,
+  "c_hidden": [
+    2048,
+    2048
+  ],
+  "c_in": 16,
+  "c_out": 16,
+  "c_pixels": null,
+  "c_r": 64,
+  "dropout": [
+    0.1,
+    0.1
+  ],
+  "kernel_size": 3,
+  "level_config": [
+    "CTA",
+    "CTA"
+  ],
+  "nhead": [
+    32,
+    32
+  ],
+  "patch_size": 1,
+  "self_attn": true,
+  "switch_level": [
+    false
+  ],
+  "t_conds": [
+    "sca",
+    "crp"
+  ]
+}

prior/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44a4cd9540f327f2fb4ac09179e4e87912a01cdb1b3b86c79f0f853976fb4c98
+size 7178377816

scheduler/scheduler_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_class_name": "DDPMWuerstchenScheduler",
+  "_diffusers_version": "0.26.0.dev0",
+  "s": 0.008,
+  "scaler": 1.0
+}

text_encoder/config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "StableCascade-prior/text_encoder",
+  "architectures": [
+    "CLIPTextModelWithProjection"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 49406,
+  "dropout": 0.0,
+  "eos_token_id": 49407,
+  "hidden_act": "gelu",
+  "hidden_size": 1280,
+  "initializer_factor": 1.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 5120,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 77,
+  "model_type": "clip_text_model",
+  "num_attention_heads": 20,
+  "num_hidden_layers": 32,
+  "pad_token_id": 1,
+  "projection_dim": 1280,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.38.0.dev0",
+  "vocab_size": 49408
+}

text_encoder/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:260e0127aca3c89db813637ae659ebb822cb07af71fedc16cbd980e9518dfdcd
+size 1389382688

tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "49406": {
+      "content": "<|startoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49407": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|startoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": true,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "model_max_length": 77,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "CLIPTokenizer",
+  "unk_token": "<|endoftext|>"
+}

tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff