Spaces:

VAST-AI
/

SeqTex

Running on Zero

App Files Files Community

yuanze1024 commited on 28 days ago

Commit

6d4bcdf

1 Parent(s): 1d5bb62

init space 2

Browse files

Files changed (16) hide show

.gitignore +1 -0
README.md +23 -0
THIRD_PARTY_LICENSES.md +127 -0
app.py +183 -222
examples/shoe.glb +3 -0
packages.txt +1 -0
requirements.txt +24 -0
utils/__init__.py +67 -0
utils/file_utils.py +15 -0
utils/image_generation.py +38 -13
utils/mesh_utils.py +26 -6
utils/rasterize.py +13 -5
utils/render_utils.py +36 -4
utils/texture_generation.py +55 -98
wan/pipeline_wan_t2tex_extra.py +30 -7
wan/wan_t2tex_transformer_3d_extra.py +84 -114

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__/

README.md ADDED Viewed

	@@ -0,0 +1,23 @@

+---
+title: SeqTex
+emoji: 🗺️
+colorFrom: yellow
+colorTo: green
+sdk: gradio
+sdk_version: 5.34.2
+python_version: 3.12
+models:
+- Wan-AI/Wan2.1-T2V-1.3B-Diffusers
+- VAST-AI/SeqTex-Transformer
+- black-forest-labs/FLUX.1-dev
+- Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0
+- madebyollin/sdxl-vae-fp16-fix
+- stabilityai/stable-diffusion-xl-base-1.0
+- xinsir/controlnet-union-sdxl-1.0
+app_file: app.py
+pinned: false
+license: mit
+short_description: SeqTex generates texture based on textual conditions
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

THIRD_PARTY_LICENSES.md ADDED Viewed

	@@ -0,0 +1,127 @@

+# Third Party Licenses
+This project uses third-party libraries that are subject to their own licenses.
+## nvdiffrast
+**Project:** https://github.com/NVlabs/nvdiffrast
+**License:** NVIDIA Source Code License (1-Way Commercial)
+**Usage:** Non-commercial use (research or evaluation purposes only)
+```text
+Copyright (c) 2020, NVIDIA Corporation. All rights reserved.
+This work is made available under the Nvidia Source Code License (1-Way Commercial).
+The Work and any derivative works thereof only may be used or intended for use
+non-commercially. "Non-commercially" means for research or evaluation purposes only
+and not for any direct or indirect monetary gain.
+Full license: https://github.com/NVlabs/nvdiffrast/blob/main/LICENSE.txt
+```
+**Key Points:**
+- ✅ Research/Academic Use: Permitted
+- ❌ Commercial Use: Requires separate licensing from NVIDIA
+- 📞 Commercial Licensing: https://www.nvidia.com/en-us/research/inquiries/
+## Wan Team Libraries
+**Project:** Various components in `wan/` directory
+**License:** Apache License 2.0
+**Copyright:** Copyright (c) 2024 Wan Team
+```text
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+```
+## Hugging Face Diffusers
+**Project:** https://github.com/huggingface/diffusers
+**License:** Apache License 2.0
+**Copyright:** Copyright 2024 The HuggingFace Team
+```text
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+```
+## Hugging Face Transformers
+**Project:** https://github.com/huggingface/transformers
+**License:** Apache License 2.0
+**Copyright:** Copyright 2024 The HuggingFace Team
+```text
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+```
+## PEFT (Parameter-Efficient Fine-Tuning)
+**Project:** https://github.com/huggingface/peft
+**License:** Apache License 2.0
+**Copyright:** Copyright 2024 The HuggingFace Team
+```text
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+```
+## Other Dependencies
+The following dependencies are used under their respective licenses:
+- **PyTorch & TorchVision**: BSD-3-Clause License
+- **Einops**: MIT License
+- **OmegaConf**: BSD-3-Clause License
+- **Trimesh**: MIT License
+- **Gradio**: Apache License 2.0
+- **OpenCV**: BSD-3-Clause License
+- **NumPy**: BSD-3-Clause License
+- **ImageIO**: BSD-2-Clause License
+## Individual Contributors
+**Qi Xin** - Condition Transformer implementation in `utils/controlnet_union.py`
+- Copyright by Qi Xin (2024/07/06)
+- Condition Transformer component for fusing single/multi conditions with input image
+For the complete list of dependencies and their licenses, please refer to the respective package repositories.

app.py CHANGED Viewed

@@ -1,231 +1,192 @@
-import numpy as np
-import torch
-from einops import rearrange
-from PIL import Image
-from utils.image_generation import generate_image_condition
 from utils.mesh_utils import Mesh
 from utils.render_utils import render_views
-from utils.texture_generation import generate_texture
-import gradio as gr
-from gradio_litmodel3d import LitModel3D
 EXAMPLES = [
     ["examples/birdhouse.glb", True, False, False, False, 42, "First View", "SDXL", False, "A rustic birdhouse featuring a snow-covered roof, wood textures, and two decorative cardinal birds. It has a circular entryway and conveys a winter-themed aesthetic."],
-    ["examples/mario.glb", False, False, False, True, 6666, "Third View", "FLUX", True, "Mario, a cartoon character wearing a red cap and blue overalls, with brown hair and a mustache, and white gloves, in a fighting pose. The clothes he wears are not in a reflection mode."],
 ]
-def tensor_to_pil(tensor, mask=None, normalize: bool = True):
-    """
-    Convert tensor to PIL Image.
-    :param tensor: torch.Tensor, shape can be (Nv, H, W, C), (Nv, C, H, W), (H, W, C), (C, H, W)
-    :param mask: torch.Tensor, shape same as tensor, effective when C=3
-    :return: PIL.Image
-    """
-    # Move to cpu
-    tensor = tensor.detach()
-    if tensor.is_cuda:
-        tensor = tensor.cpu()
-    if mask is not None and mask.is_cuda:
-        mask = mask.cpu()
-    # Convert to float32
-    tensor = tensor.float()
-    if mask is not None:
-        mask = mask.float()
-    if normalize:
-        tensor = (tensor + 1.0) / 2.0
-    tensor = torch.clamp(tensor, 0.0, 1.0)
-    if mask is not None:
-        if mask.shape[-1] not in [1, 3]:
-            mask = mask.unsqueeze(-1)
-        tensor = torch.cat([tensor, mask], dim=-1)
-    shape = tensor.shape
-    # 4D: (Nv, H, W, C) or (Nv, C, H, W)
-    if len(shape) == 4:
-        Nv = shape[0]
-        if shape[-1] in [3, 4]:  # (Nv, H, W, C)
-            tensor = rearrange(tensor, 'nv h w c -> h (nv w) c')
-        else:  # (Nv, C, H, W)
-            tensor = rearrange(tensor, 'nv c h w -> h (nv w) c')
-    # 3D: (H, W, C) or (C, H, W)
-    elif len(shape) == 3:
-        if shape[-1] in [3, 4]:  # (H, W, C)
-            tensor = rearrange(tensor, 'h w c -> h w c')
-        else:  # (C, H, W)
-            tensor = rearrange(tensor, 'c h w -> h w c')
-    else:
-        raise ValueError(f"Unsupported tensor shape: {shape}")
-    # Convert to numpy
-    np_img = (tensor.numpy() * 255).round().astype(np.uint8)
-    # Create PIL Image
-    if np_img.shape[2] == 3:
-        return Image.fromarray(np_img, mode="RGB")
-    elif np_img.shape[2] == 4:
-        return Image.fromarray(np_img, mode="RGBA")
-    else:
-        raise ValueError("Only support 3 or 4 channel images.")
-if __name__ == '__main__':
-    with gr.Blocks() as demo:
-        gr.Markdown("# 🎨 SeqTex: Generate Mesh Textures in Video Sequence")
-        gr.Markdown("""
-        ## 🚀 Welcome to SeqTex!
-        **SeqTex** is a cutting-edge AI system that generates high-quality textures for 3D meshes using image prompts (here we use image generator to get them from textual prompts).
-        Choose to either **try our example models** below or **upload your own 3D mesh** to create stunning textures.
-        """)
-        gr.Markdown("---")
-        gr.Markdown("## 🔧 Step 1: Upload & Process 3D Mesh")
-        gr.Markdown("""
-        **📋 How to prepare your 3D mesh:**
-        - Upload your 3D mesh in **.obj** or **.glb** format
-        - **💡 Pro Tip**:
-          - For optimal results, ensure your mesh includes only one part with <span style="color:#e74c3c; font-weight:bold;">UV parameterization</span>
-          - Otherwise, we'll combine all parts and generate UV parameterization using *xAtlas* (may take longer for high-poly meshes; may also fail for certain meshes)
-        - **⚠️ Important**: We recommend adjusting your model using *Mesh Orientation Adjustments* to be **Z-UP oriented** for best results
-        """)
-        position_map_tensor, normal_map_tensor, position_images_tensor, normal_images_tensor, mask_images_tensor, w2cs, mesh, mvp_matrix = gr.State(), gr.State(), gr.State(), gr.State(), gr.State(), gr.State(), gr.State(), gr.State()
-        # fixed_texture_map = Image.open("image.webp").convert("RGB")
-        # Step 1
-        with gr.Row():
-            with gr.Column():
-                mesh_upload = gr.File(label="📁 Upload 3D Mesh", file_types=[".obj", ".glb"])
-                # uv_tool = gr.Radio(["xAtlas", "UVAtlas"], label="UV parameterizer", value="xAtlas")
-                gr.Markdown("**🔄 Mesh Orientation Adjustments** (if needed):")
-                y2z = gr.Checkbox(label="Y → Z Transform", value=False, info="Rotate: Y becomes Z, -Z becomes Y")
-                y2x = gr.Checkbox(label="Y → X Transform", value=False, info="Rotate: Y becomes X, -X becomes Y")
-                z2x = gr.Checkbox(label="Z → X Transform", value=False, info="Rotate: Z becomes X, -X becomes Z")
-                upside_down = gr.Checkbox(label="🔃 Flip Vertically", value=False, info="Fix upside-down mesh orientation")
-            with gr.Column():
-                step1_button = gr.Button("🔄 Process Mesh & Generate Views", variant="primary")
-                step1_progress = gr.Textbox(label="📊 Processing Status", interactive=False)
-                model_input = gr.Model3D(label="📐 Processed 3D Model", height=500)
-        with gr.Row(equal_height=True):
-            rgb_views = gr.Image(label="📷 Generated Views (Front, Back, Left, Right)", type="pil", scale=3)
-            position_map = gr.Image(label="🗺️ Position Map", type="pil", scale=1)
-            normal_map = gr.Image(label="🧭 Normal Map", type="pil", scale=1)
-        step1_button.click(
-            Mesh.process,
-            inputs=[mesh_upload, gr.State("xAtlas"), y2z, y2x, z2x, upside_down],
-            outputs=[position_map_tensor, normal_map_tensor, position_images_tensor, normal_images_tensor, mask_images_tensor, w2cs, mesh, mvp_matrix, step1_progress]
-        ).then(
-            tensor_to_pil,
-            inputs=[normal_images_tensor, mask_images_tensor],
-            outputs=[rgb_views]
-        ).then(
-            tensor_to_pil,
-            inputs=[position_map_tensor],
-            outputs=[position_map]
-        ).then(
-            tensor_to_pil,
-            inputs=[normal_map_tensor],
-            outputs=[normal_map]
-        ).then(
-            Mesh.export,
-            inputs=[mesh],
-            outputs=[model_input]
-        )
-        # Step 2
-        gr.Markdown("---")
-        gr.Markdown("## 👁️ Step 2: Select View & Generate Image Condition")
-        gr.Markdown("""
-        **📋 How to generate image condition:**
-        - Your mesh will be rendered from **four viewpoints** (front, back, left, right)
-        - Choose **one view** as your image condition
-        - Enter a **descriptive text prompt** for the desired texture
-        - Select your preferred AI model:
-          - <span style="color:#27ae60; font-weight:bold;">🎯 SDXL</span>: Fast generation with depth + normal control, better details
-          - <span style="color:#3498db; font-weight:bold;">⚡ FLUX</span>: High-quality generation with depth control (slower due to CPU offloading). Better work with **Edge Refinement**
-        """)
-        with gr.Row():
-            with gr.Column():
-                img_condition_seed = gr.Number(label="🎲 Random Seed", minimum=0, maximum=9999, step=1, value=42, info="Change for different results")
-                selected_view = gr.Radio(["First View", "Second View", "Third View", "Fourth View"], label="📐 Camera View", value="First View", info="Choose which viewpoint to use as reference")
-                with gr.Row():
-                    model_choice = gr.Radio(["SDXL", "FLUX"], label="🤖 AI Model", value="SDXL", info="SDXL: Fast, depth+normal control | FLUX: High-quality, slower processing")
-                    edge_refinement = gr.Checkbox(label="✨ Edge Refinement", value=True, info="Smooth boundary artifacts (recommended for cleaner results)")
-                text_prompt = gr.Textbox(label="💬 Texture Description", placeholder="Describe the desired texture appearance (e.g., 'rustic wooden surface with weathered paint')", lines=2)
-                step2_button = gr.Button("🎯 Generate Image Condition", variant="primary")
-                step2_progress = gr.Textbox(label="📊 Generation Status", interactive=False)
-            with gr.Column():
-                condition_image = gr.Image(label="🖼️ Generated Image Condition", type="pil") # , interactive=False
-        step2_button.click(
-            generate_image_condition,
-            inputs=[position_images_tensor, normal_images_tensor, mask_images_tensor, w2cs, text_prompt, selected_view, img_condition_seed, model_choice, edge_refinement],
-            outputs=[condition_image, step2_progress],
-            concurrency_id="gpu_intensive"
-        )
-        # Step 3
-        gr.Markdown("---")
-        gr.Markdown("## 🎨 Step 3: Generate Final Texture")
-        gr.Markdown("""
-        **📋 How to generate final texture:**
-        - The **SeqTex pipeline** will create a complete texture map for your model
-        - View the results from multiple angles and download your textured 3D model (the viewport is a little bit dark)
-        """)
-        texture_map_tensor, mv_out_tensor = gr.State(), gr.State()
-        with gr.Row():
-            with gr.Column(scale=1):
-                step3_button = gr.Button("🎨 Generate Final Texture", variant="primary")
-                step3_progress = gr.Textbox(label="📊 Texture Generation Status", interactive=False)
-                texture_map = gr.Image(label="🏆 Generated Texture Map", interactive=False)
-            with gr.Column(scale=2):
-                rendered_imgs = gr.Image(label="🖼️ Final Rendered Views")
-                mv_branch_imgs = gr.Image(label="🖼️ SeqTex Direct Output")
-            with gr.Column(scale=1.5):
-                # model_display = gr.Model3D(label="🏆 Final Textured Model", height=500)
-                model_display = LitModel3D(label="Model with Texture",
-                                           exposure=30.0,
-                                           height=500)
-        step3_button.click(
-            generate_texture,
-            inputs=[position_map_tensor, normal_map_tensor, position_images_tensor, normal_images_tensor, condition_image, text_prompt, selected_view],
-            outputs=[texture_map_tensor, mv_out_tensor, step3_progress],
-            concurrency_id="gpu_intensive"
-        ).then(
-            tensor_to_pil,
-            inputs=[texture_map_tensor, gr.State(None), gr.State(False)],
-            outputs=[texture_map]
-        ).then(
-            tensor_to_pil,
-            inputs=[mv_out_tensor, gr.State(None), gr.State(False)],
-            outputs=[mv_branch_imgs]
-        ).then(
-            render_views,
-            inputs=[mesh, texture_map_tensor, mvp_matrix],
-            outputs=[rendered_imgs]
-        ).then(
-            Mesh.export,
-            inputs=[mesh, gr.State(None), texture_map],
-            outputs=[model_display]
-        )
-        # Add example inputs for user convenience
-        gr.Markdown("---")
-        gr.Markdown("## 🚀 Try Our Examples")
-        gr.Markdown("**Quick Start**: Click on any example below to see SeqTex in action with pre-configured settings!")
-        gr.Examples(
-            examples=EXAMPLES,
-            inputs=[mesh_upload, y2z, y2x, z2x, upside_down, img_condition_seed, selected_view, model_choice, edge_refinement, text_prompt],
-            cache_examples=False
-        )
-    demo.launch(server_name="0.0.0.0", server_port=52424)

+import os
+import gradio as gr
+from utils import tensor_to_pil
+from utils.image_generation import generate_image_condition, get_flux_pipe, get_sdxl_pipe
 from utils.mesh_utils import Mesh
 from utils.render_utils import render_views
+from utils.texture_generation import generate_texture, get_seqtex_pipe
 EXAMPLES = [
     ["examples/birdhouse.glb", True, False, False, False, 42, "First View", "SDXL", False, "A rustic birdhouse featuring a snow-covered roof, wood textures, and two decorative cardinal birds. It has a circular entryway and conveys a winter-themed aesthetic."],
+    ["examples/shoe.glb", True, False, False, False, 42, "Second View", "SDXL", False, "Modern sneaker exhibiting a mesh upper and wavy rubber outsole. Features include lacing for adjustability and padded components for comfort. Normal maps emphasize geometric detail."],
+    # ["examples/mario.glb", False, False, False, True, 6666, "Third View", "FLUX", True, "Mario, a cartoon character wearing a red cap and blue overalls, with brown hair and a mustache, and white gloves, in a fighting pose. The clothes he wears are not in a reflection mode."],
 ]
+LOAD_FIRST = True
+with gr.Blocks(delete_cache=(600, 600)) as demo:
+    gr.Markdown("# 🎨 SeqTex: Generate Mesh Textures in Video Sequence")
+    gr.Markdown("""
+    ## 🚀 Welcome to SeqTex!
+    **SeqTex** is a cutting-edge AI system that generates high-quality textures for 3D meshes using image prompts (here we use image generator to get them from textual prompts).
+    Choose to either **try our example models** below or **upload your own 3D mesh** to create stunning textures.
+    """)
+    gr.Markdown("---")
+    gr.Markdown("## 🔧 Step 1: Upload & Process 3D Mesh")
+    gr.Markdown("""
+    **📋 How to prepare your 3D mesh:**
+    - Upload your 3D mesh in **.obj** or **.glb** format
+    - **💡 Pro Tip**:
+        - For optimal results, ensure your mesh includes only one part with <span style="color:#e74c3c; font-weight:bold;">UV parameterization</span>
+        - Otherwise, we'll combine all parts and generate UV parameterization using *xAtlas* (may take longer for high-poly meshes; may also fail for certain meshes)
+    - **⚠️ Important**: We recommend adjusting your model using *Mesh Orientation Adjustments* to be **Z-UP oriented** for best results
+    """)
+    position_map_tensor_path = gr.State()
+    normal_map_tensor_path = gr.State()
+    position_images_tensor_path = gr.State()
+    normal_images_tensor_path = gr.State()
+    mask_images_tensor_path = gr.State()
+    w2c_tensor_path = gr.State()
+    mesh = gr.State()
+    mvp_matrix_tensor_path = gr.State()
+    # fixed_texture_map = Image.open("image.webp").convert("RGB")
+    # Step 1
+    with gr.Row():
+        with gr.Column():
+            mesh_upload = gr.File(label="📁 Upload 3D Mesh", file_types=[".obj", ".glb"])
+            # uv_tool = gr.Radio(["xAtlas", "UVAtlas"], label="UV parameterizer", value="xAtlas")
+            gr.Markdown("**🔄 Mesh Orientation Adjustments** (if needed):")
+            y2z = gr.Checkbox(label="Y → Z Transform", value=False, info="Rotate: Y becomes Z, -Z becomes Y")
+            y2x = gr.Checkbox(label="Y → X Transform", value=False, info="Rotate: Y becomes X, -X becomes Y")
+            z2x = gr.Checkbox(label="Z → X Transform", value=False, info="Rotate: Z becomes X, -X becomes Z")
+            upside_down = gr.Checkbox(label="🔃 Flip Vertically", value=False, info="Fix upside-down mesh orientation")
+            step1_button = gr.Button("🔄 Process Mesh & Generate Views", variant="primary")
+            step1_progress = gr.Textbox(label="📊 Processing Status", interactive=False)
+        with gr.Column():
+            model_input = gr.Model3D(label="📐 Processed 3D Model", height=500)
+    with gr.Row(equal_height=True):
+        rgb_views = gr.Image(label="📷 Generated Views", type="pil", scale=3)
+        position_map = gr.Image(label="🗺️ Position Map", type="pil", scale=1)
+        normal_map = gr.Image(label="🧭 Normal Map", type="pil", scale=1)
+    step1_button.click(
+        Mesh.process,
+        inputs=[mesh_upload, gr.State("xAtlas"), y2z, y2x, z2x, upside_down],
+        outputs=[position_map_tensor_path, normal_map_tensor_path, position_images_tensor_path, normal_images_tensor_path, mask_images_tensor_path, w2c_tensor_path, mesh, mvp_matrix_tensor_path, step1_progress]
+    ).success(
+        tensor_to_pil,
+        inputs=[normal_images_tensor_path, mask_images_tensor_path],
+        outputs=[rgb_views]
+    ).success(
+        tensor_to_pil,
+        inputs=[position_map_tensor_path],
+        outputs=[position_map]
+    ).success(
+        tensor_to_pil,
+        inputs=[normal_map_tensor_path],
+        outputs=[normal_map]
+    ).success(
+        Mesh.export,
+        inputs=[mesh, gr.State(None), gr.State(None)],
+        outputs=[model_input]
+    )
+    # Step 2
+    gr.Markdown("---")
+    gr.Markdown("## 👁️ Step 2: Select View & Generate Image Condition")
+    gr.Markdown("""
+    **📋 How to generate image condition:**
+    - Your mesh will be rendered from **four viewpoints** (front, back, left, right)
+    - Choose **one view** as your image condition
+    - Enter a **descriptive text prompt** for the desired texture
+    - Select your preferred AI model:
+        - <span style="color:#27ae60; font-weight:bold;">🎯 SDXL</span>: Fast generation with depth + normal control, better details (often suffer from wrong highlights)
+        - <span style="color:#3498db; font-weight:bold;">⚡ FLUX</span>: ~~High-quality generation with depth control (slower due to CPU offloading). Better work with **Edge Refinement**~~ (Not supported due to the memory limit of HF Space. You can try it locally)
+    """)
+    with gr.Row():
+        with gr.Column():
+            img_condition_seed = gr.Number(label="🎲 Random Seed", minimum=0, maximum=9999, step=1, value=42, info="Change for different results")
+            selected_view = gr.Radio(["First View", "Second View", "Third View", "Fourth View"], label="📐 Camera View", value="First View", info="Choose which viewpoint to use as reference")
+            with gr.Row():
+                # model_choice = gr.Radio(["SDXL", "FLUX"], label="🤖 AI Model", value="SDXL", info="SDXL: Fast, depth+normal control | FLUX: High-quality, slower processing")
+                model_choice = gr.Radio(["SDXL"], label="🤖 AI Model", value="SDXL", info="SDXL: Fast, depth+normal control | FLUX: High-quality, slower processing (Not supported due to the memory limit of HF Space)")
+                edge_refinement = gr.Checkbox(label="✨ Edge Refinement", value=True, info="Smooth boundary artifacts (recommended for delightning highlights in the boundary)")
+            text_prompt = gr.Textbox(label="💬 Texture Description", placeholder="Describe the desired texture appearance (e.g., 'rustic wooden surface with weathered paint')", lines=2)
+            step2_button = gr.Button("🎯 Generate Image Condition", variant="primary")
+            step2_progress = gr.Textbox(label="📊 Generation Status", interactive=False)
+        with gr.Column():
+            condition_image = gr.Image(label="🖼️ Generated Image Condition", type="pil") # , interactive=False
+    step2_button.click(
+        generate_image_condition,
+        inputs=[position_images_tensor_path, normal_images_tensor_path, mask_images_tensor_path, w2c_tensor_path, text_prompt, selected_view, img_condition_seed, model_choice, edge_refinement],
+        outputs=[condition_image, step2_progress],
+    )
+    # Step 3
+    gr.Markdown("---")
+    gr.Markdown("## 🎨 Step 3: Generate Final Texture")
+    gr.Markdown("""
+    **📋 How to generate final texture:**
+    - The **SeqTex pipeline** will create a complete texture map for your model
+    - View the results from multiple angles and download your textured 3D model (the viewport is a little bit dark)
+    """)
+    texture_map_tensor_path = gr.State()
+    with gr.Row():
+        with gr.Column(scale=1):
+            step3_button = gr.Button("🎨 Generate Final Texture", variant="primary")
+            step3_progress = gr.Textbox(label="📊 Texture Generation Status", interactive=False)
+            texture_map = gr.Image(label="🏆 Generated Texture Map", interactive=False)
+        with gr.Column(scale=2):
+            rendered_imgs = gr.Image(label="🖼️ Final Rendered Views")
+            mv_branch_imgs = gr.Image(label="🖼️ SeqTex Direct Output")
+        with gr.Column(scale=1.5):
+            model_display = gr.Model3D(label="🏆 Final Textured Model", height=500)
+            # model_display = LitModel3D(label="Model with Texture",
+            #                             exposure=30.0,
+            #                             height=500)
+    step3_button.click(
+        generate_texture,
+        inputs=[position_map_tensor_path, normal_map_tensor_path, position_images_tensor_path, normal_images_tensor_path, condition_image, text_prompt, selected_view],
+        outputs=[texture_map_tensor_path, texture_map, mv_branch_imgs, step3_progress],
+    ).success(
+        render_views,
+        inputs=[mesh, texture_map_tensor_path, mvp_matrix_tensor_path],
+        outputs=[rendered_imgs]
+    ).success(
+        Mesh.export,
+        inputs=[mesh, gr.State(None), texture_map],
+        outputs=[model_display]
+    )
+    # Add example inputs for user convenience
+    gr.Markdown("---")
+    gr.Markdown("## 🚀 Try Our Examples")
+    gr.Markdown("**Quick Start**: Click on any example below to see SeqTex in action with pre-configured settings!")
+    gr.Examples(
+        examples=EXAMPLES,
+        inputs=[mesh_upload, y2z, y2x, z2x, upside_down, img_condition_seed, selected_view, model_choice, edge_refinement, text_prompt],
+        cache_examples=False
+    )
+    # Acknowledgments
+    gr.Markdown("---")
+    gr.Markdown("## 🙏 Acknowledgments")
+    gr.Markdown("""
+    **Special thanks to [Toshihiro Hayashi](mailto:[email protected])** for his valuable support and assistance in fixing bugs for this demo.
+    """)
+if LOAD_FIRST is True:
+    import gc
+    get_seqtex_pipe()
+    print("SeqTex pipeline loaded successfully.")
+    get_sdxl_pipe()
+    print("SDXL pipeline loaded successfully.")
+    # get_flux_pipe()
+    # Note: FLUX pipeline is available in code but not loaded due to GPU memory constraints on HF Space
+    print("Note: FLUX and other models are available for local deployment.")
+    gc.collect()
+assert os.environ["OPENCV_IO_ENABLE_OPENEXR"] == "1", "OpenEXR support is required for this demo."
+demo.launch(server_name="0.0.0.0")

examples/shoe.glb ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3945c77a6f98eb18aae9c97f253e4c6b06daf83194c21f91ea4b955756bace7e
+size 8842904

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ ninja-build

requirements.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+torchvision==0.21.0
+einops
+omegaconf==2.3.0
+jaxtyping
+typeguard
+imageio
+trimesh==4.6.4
+peft==0.14.0
+diffusers==0.33.1
+bitsandbytes
+transformers==4.52.4
+ftfy
+accelerate
+sentencepiece
+ipdb
+clean-fid
+apex==0.9.10.dev0
+xatlas
+gradio_litmodel3d
+spaces
+numpy
+opencv-python
+https://huggingface.co/spaces/VAST-AI/MV-Adapter-Img2Texture/resolve/main/wheels/nvdiffrast-0.3.3-cp310-cp310-linux_x86_64.whl?download=true

utils/__init__.py CHANGED Viewed

	@@ -0,0 +1,67 @@

+import numpy as np
+import torch
+from einops import rearrange
+from PIL import Image
+def tensor_to_pil(tensor, mask=None, normalize: bool = True):
+    """
+    Convert tensor to PIL Image.
+    :param tensor: torch.Tensor or str (file path to tensor), shape can be (Nv, H, W, C), (Nv, C, H, W), (H, W, C), (C, H, W)
+    :param mask: torch.Tensor or str (file path to tensor), shape same as tensor, effective when C=3
+    :return: PIL.Image
+    """
+    # If input is a file path, load the tensor
+    if isinstance(tensor, str):
+        from utils.file_utils import load_tensor_from_file
+        tensor = load_tensor_from_file(tensor, map_location="cpu")
+    if mask is not None and isinstance(mask, str):
+        from utils.file_utils import load_tensor_from_file
+        mask = load_tensor_from_file(mask, map_location="cpu")
+    # Move to cpu
+    tensor = tensor.detach()
+    if tensor.is_cuda:
+        tensor = tensor.cpu()
+    if mask is not None and mask.is_cuda:
+        mask = mask.cpu()
+    # Convert to float32
+    tensor = tensor.float()
+    if mask is not None:
+        mask = mask.float()
+    if normalize:
+        tensor = (tensor + 1.0) / 2.0
+    tensor = torch.clamp(tensor, 0.0, 1.0)
+    if mask is not None:
+        if mask.shape[-1] not in [1, 3]:
+            mask = mask.unsqueeze(-1)
+        tensor = torch.cat([tensor, mask], dim=-1)
+    shape = tensor.shape
+    # 4D: (Nv, H, W, C) or (Nv, C, H, W)
+    if len(shape) == 4:
+        Nv = shape[0]
+        if shape[-1] in [3, 4]:  # (Nv, H, W, C)
+            tensor = rearrange(tensor, 'nv h w c -> h (nv w) c')
+        else:  # (Nv, C, H, W)
+            tensor = rearrange(tensor, 'nv c h w -> h (nv w) c')
+    # 3D: (H, W, C) or (C, H, W)
+    elif len(shape) == 3:
+        if shape[-1] in [3, 4]:  # (H, W, C)
+            tensor = rearrange(tensor, 'h w c -> h w c')
+        else:  # (C, H, W)
+            tensor = rearrange(tensor, 'c h w -> h w c')
+    else:
+        raise ValueError(f"Unsupported tensor shape: {shape}")
+    # Convert to numpy
+    np_img = (tensor.numpy() * 255).round().astype(np.uint8)
+    # Create PIL Image
+    if np_img.shape[2] == 3:
+        return Image.fromarray(np_img, mode="RGB")
+    elif np_img.shape[2] == 4:
+        return Image.fromarray(np_img, mode="RGBA")
+    else:
+        raise ValueError("Only support 3 or 4 channel images.")

utils/file_utils.py ADDED Viewed

	@@ -0,0 +1,15 @@

+import os
+import uuid
+import torch
+from gradio.utils import get_upload_folder
+def save_tensor_to_file(tensor, prefix="tensor"):
+    upload_dir = get_upload_folder()
+    os.makedirs(upload_dir, exist_ok=True)
+    path = os.path.join(upload_dir, f"{prefix}_{uuid.uuid4().hex}.pt")
+    torch.save(tensor, path)
+    return path
+def load_tensor_from_file(path, map_location=None):
+    # Use weights_only=True for security and to suppress FutureWarning (only tensors are loaded in this app)
+    return torch.load(path, map_location=map_location, weights_only=True)

utils/image_generation.py CHANGED Viewed

@@ -1,6 +1,8 @@
 import threading
 import cv2
 import numpy as np
 import spaces
 import torch
@@ -12,12 +14,11 @@ from einops import rearrange
 from PIL import Image
 from torchvision.transforms import ToPILImage
-import gradio as gr
 from .controlnet_union import ControlNetModel_Union
 from .pipeline_controlnet_union_sd_xl import \
     StableDiffusionXLControlNetUnionPipeline
 from .render_utils import get_silhouette_image
 IMG_PIPE = None
 IMG_PIPE_LOCK = threading.Lock()
@@ -26,8 +27,9 @@ FLUX_PIPE = None
 FLUX_PIPE_LOCK = threading.Lock()
 FLUX_SUFFIX = None
 FLUX_NEGATIVE = None
-def lazy_get_flux_pipe():
     """
     Lazy load the FLUX pipeline with ControlNet for image generation.
     """
@@ -44,16 +46,21 @@ def lazy_get_flux_pipe():
         controlnet_model_union = 'Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0'
         controlnet = FluxControlNetModel.from_pretrained(controlnet_model_union, torch_dtype=torch.bfloat16)
         FLUX_PIPE = FluxControlNetPipeline.from_pretrained(
             base_model,
             controlnet=controlnet,
-            torch_dtype=torch.bfloat16
         )
         # Use model CPU offload for better GPU utilization during inference
-        FLUX_PIPE.enable_model_cpu_offload()
     return FLUX_PIPE
-def lazy_get_sdxl_pipe():
     """
     Lazy load the SDXL pipeline with ControlNet for image generation.
     """
@@ -74,8 +81,11 @@ def lazy_get_sdxl_pipe():
             torch_dtype=torch.float16,
             scheduler=eulera_scheduler,
         )
-        # Move pipeline to CUDA device
-        IMG_PIPE = IMG_PIPE.to("cuda")
     return IMG_PIPE
@@ -94,11 +104,11 @@ def generate_sdxl_condition(depth_img, normal_img, text_prompt, mask, seed=42, e
     :return: Generated image condition (e.g., PIL Image).
     """
     progress(0.1, desc="Loading SDXL pipeline...")
-    pipeline = lazy_get_sdxl_pipe()
     progress(0.3, desc="SDXL pipeline loaded successfully.")
-    positive_prompt = text_prompt + ", photo-realistic style, high quality, 8K, highly detailed texture, soft lightning, uniform color, foreground"
-    negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'
     img_generation_resolution = 1024  # SDXL performs better at 1024x1024
     image = pipeline(prompt=[positive_prompt]*1,
@@ -154,7 +164,7 @@ def generate_flux_condition(depth_img, text_prompt, mask, seed=42, edge_refineme
     :return: Generated image condition (PIL Image).
     """
     progress(0.1, desc="Loading FLUX pipeline...")
-    pipeline = lazy_get_flux_pipe()
     progress(0.3, desc="FLUX pipeline loaded successfully.")
     # Enhanced prompt for better results
@@ -262,7 +272,7 @@ def refine_image_edges(rgb_tensor, mask_tensor):
     return refined_rgb_tensor
-@spaces.GPU(duration=120)
 def generate_image_condition(position_imgs, normal_imgs, mask_imgs, w2c, text_prompt, selected_view="First View", seed=42, model="SDXL", edge_refinement=True, progress=gr.Progress()):
     """
     Generate the image condition based on the selected view's silhouette and text prompt.
@@ -278,6 +288,20 @@ def generate_image_condition(position_imgs, normal_imgs, mask_imgs, w2c, text_pr
     :param edge_refinement: Whether to apply edge refinement to smooth mask boundaries (default: True).
     :return: Generated condition image and status message.
     """
     progress(0, desc="Handling geometry information...")
     silhouette = get_silhouette_image(position_imgs, normal_imgs, mask_imgs=mask_imgs, w2c=w2c, selected_view=selected_view)
@@ -291,6 +315,7 @@ def generate_image_condition(position_imgs, normal_imgs, mask_imgs, w2c, text_pr
             return condition, "SDXL condition generated successfully."
         elif model == "FLUX":
             # FLUX only supports depth control, not normal
             condition = generate_flux_condition(depth_img, text_prompt, mask, seed, edge_refinement=edge_refinement, progress=progress)
             return condition, "FLUX condition generated successfully (depth-only control)."
         else:

+import os
 import threading
 import cv2
+import gradio as gr
 import numpy as np
 import spaces
 import torch
 from PIL import Image
 from torchvision.transforms import ToPILImage
 from .controlnet_union import ControlNetModel_Union
 from .pipeline_controlnet_union_sd_xl import \
     StableDiffusionXLControlNetUnionPipeline
 from .render_utils import get_silhouette_image
+from utils.file_utils import load_tensor_from_file
 IMG_PIPE = None
 IMG_PIPE_LOCK = threading.Lock()
 FLUX_PIPE_LOCK = threading.Lock()
 FLUX_SUFFIX = None
 FLUX_NEGATIVE = None
+CPU_OFFLOAD = False
+def get_flux_pipe():
     """
     Lazy load the FLUX pipeline with ControlNet for image generation.
     """
         controlnet_model_union = 'Shakker-Labs/FLUX.1-dev-ControlNet-Union-Pro-2.0'
         controlnet = FluxControlNetModel.from_pretrained(controlnet_model_union, torch_dtype=torch.bfloat16)
+        assert os.environ["SEQTEX_SPACE_TOKEN"] != "", "Please set the SEQTEX_SPACE_TOKEN environment variable with your Hugging Face token, which has access to black-forest-labs/FLUX.1-dev."
         FLUX_PIPE = FluxControlNetPipeline.from_pretrained(
             base_model,
             controlnet=controlnet,
+            torch_dtype=torch.bfloat16,
+            token=os.environ["SEQTEX_SPACE_TOKEN"]
         )
         # Use model CPU offload for better GPU utilization during inference
+        if CPU_OFFLOAD:
+            FLUX_PIPE.enable_model_cpu_offload()
+        else:
+            FLUX_PIPE.to("cuda")
     return FLUX_PIPE
+def get_sdxl_pipe():
     """
     Lazy load the SDXL pipeline with ControlNet for image generation.
     """
             torch_dtype=torch.float16,
             scheduler=eulera_scheduler,
         )
+        # Use model CPU offload for better GPU utilization during inference
+        if CPU_OFFLOAD:
+            IMG_PIPE.enable_model_cpu_offload()
+        else:
+            IMG_PIPE.to("cuda")
     return IMG_PIPE
     :return: Generated image condition (e.g., PIL Image).
     """
     progress(0.1, desc="Loading SDXL pipeline...")
+    pipeline = get_sdxl_pipe()
     progress(0.3, desc="SDXL pipeline loaded successfully.")
+    positive_prompt = text_prompt + ", photo-realistic style, high quality, 8K, highly detailed texture, soft diffuse lighting, uniform lighting, flat lighting, even illumination, matte surface, low contrast, uniform color, foreground"
+    negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, harsh lighting, high contrast, bright highlights, specular reflections, shiny surface, glossy, reflective, strong shadows, dramatic lighting, spotlight, direct sunlight, glare, bloom, lens flare'
     img_generation_resolution = 1024  # SDXL performs better at 1024x1024
     image = pipeline(prompt=[positive_prompt]*1,
     :return: Generated image condition (PIL Image).
     """
     progress(0.1, desc="Loading FLUX pipeline...")
+    pipeline = get_flux_pipe()
     progress(0.3, desc="FLUX pipeline loaded successfully.")
     # Enhanced prompt for better results
     return refined_rgb_tensor
+@spaces.GPU()
 def generate_image_condition(position_imgs, normal_imgs, mask_imgs, w2c, text_prompt, selected_view="First View", seed=42, model="SDXL", edge_refinement=True, progress=gr.Progress()):
     """
     Generate the image condition based on the selected view's silhouette and text prompt.
     :param edge_refinement: Whether to apply edge refinement to smooth mask boundaries (default: True).
     :return: Generated condition image and status message.
     """
+    # If any input is a file path, load the tensor from file
+    if isinstance(position_imgs, str):
+        position_imgs = load_tensor_from_file(position_imgs, map_location="cuda")
+    if isinstance(normal_imgs, str):
+        normal_imgs = load_tensor_from_file(normal_imgs, map_location="cuda")
+    if isinstance(mask_imgs, str):
+        mask_imgs = load_tensor_from_file(mask_imgs, map_location="cuda")
+    if isinstance(w2c, str):
+        w2c = load_tensor_from_file(w2c, map_location="cuda")
+    position_imgs = position_imgs.to("cuda")
+    normal_imgs = normal_imgs.to("cuda")
+    mask_imgs = mask_imgs.to("cuda")
+    w2c = w2c.to("cuda")
     progress(0, desc="Handling geometry information...")
     silhouette = get_silhouette_image(position_imgs, normal_imgs, mask_imgs=mask_imgs, w2c=w2c, selected_view=selected_view)
             return condition, "SDXL condition generated successfully."
         elif model == "FLUX":
             # FLUX only supports depth control, not normal
+            raise NotImplementedError("FLUX model not supported in HF space, please delete it and use it locally")
             condition = generate_flux_condition(depth_img, text_prompt, mask, seed, edge_refinement=edge_refinement, progress=progress)
             return condition, "FLUX condition generated successfully (depth-only control)."
         else:

utils/mesh_utils.py CHANGED Viewed

@@ -1,16 +1,16 @@
-import os
 import tempfile
 import numpy as np
 import torch
 import trimesh
 import xatlas
 from PIL import Image
-import gradio as gr
 from .render_utils import (get_mvp_matrix, get_pure_texture, render_geo_map,
                            render_geo_views_tensor, render_views, setup_lights)
 class Mesh:
@@ -19,7 +19,7 @@ class Mesh:
         Initialize the Mesh object with a mesh file path.
         :param mesh_path: Path to the mesh file (e.g., .obj or .glb).
         """
-        self.device = device
         if mesh_path is not None:
             # Initialize _parts dictionary to store all parts
             self._parts = {}
@@ -91,11 +91,16 @@ class Mesh:
             progress(0.4, f"The model has SINGLE UV parameterization, no need to reparameterize.")
             self._vmapping = None  # No vmapping needed when not reparameterizing
     def to(self, device):
         """
         Move the mesh data to the specified device.
         :param device: The target device (e.g., 'cuda' or 'cpu').
         """
         self._v_pos = self._v_pos.to(device)
         self._t_pos_idx = self._t_pos_idx.to(device)
         if self._v_tex is not None:
@@ -104,6 +109,7 @@ class Mesh:
         if hasattr(self, '_vmapping') and self._vmapping is not None:
             self._vmapping = self._vmapping.to(device)
         self._v_normal = self._v_normal.to(device)
     @property
     def has_multi_parts(self):
@@ -404,7 +410,12 @@ class Mesh:
                 texture_map = Image.fromarray(texture_map)
             assert type(texture_map) is Image.Image, "texture_map should be a PIL.Image"
             texture_map = texture_map.transpose(Image.FLIP_TOP_BOTTOM).convert("RGB")
-            material = trimesh.visual.texture.SimpleMaterial(image=texture_map)
         else:
             default_texture = Image.new("RGB", (1024, 1024), (200, 200, 200))
             material = trimesh.visual.texture.SimpleMaterial(image=default_texture)
@@ -445,6 +456,7 @@ class Mesh:
         return save_path
     @classmethod
     def process(cls, mesh_file, uv_tool="xAtlas", y2z=True, y2x=False, z2x=False, upside_down=False, img_size=(512, 512), uv_size=(1024, 1024), device='cuda', progress=gr.Progress()):
         """
         Handle the mesh processing, which includes normalization, parts merging, and UV mapping.
@@ -486,7 +498,15 @@ class Mesh:
         position_map, normal_map = render_geo_map(mesh)
         progress(1, f"Mesh processing completed.")
-        return position_map, normal_map, position_images, normal_images, mask_images.squeeze(-1), w2c, mesh, mvp_matrix, "Mesh processing completed."
 if __name__ == '__main__':

 import tempfile
+import gradio as gr
 import numpy as np
+import spaces
 import torch
 import trimesh
 import xatlas
 from PIL import Image
 from .render_utils import (get_mvp_matrix, get_pure_texture, render_geo_map,
                            render_geo_views_tensor, render_views, setup_lights)
+from utils.file_utils import save_tensor_to_file
 class Mesh:
         Initialize the Mesh object with a mesh file path.
         :param mesh_path: Path to the mesh file (e.g., .obj or .glb).
         """
+        self._device = device
         if mesh_path is not None:
             # Initialize _parts dictionary to store all parts
             self._parts = {}
             progress(0.4, f"The model has SINGLE UV parameterization, no need to reparameterize.")
             self._vmapping = None  # No vmapping needed when not reparameterizing
+    @property
+    def device(self):
+        return self._device
     def to(self, device):
         """
         Move the mesh data to the specified device.
         :param device: The target device (e.g., 'cuda' or 'cpu').
         """
+        self._device = device
         self._v_pos = self._v_pos.to(device)
         self._t_pos_idx = self._t_pos_idx.to(device)
         if self._v_tex is not None:
         if hasattr(self, '_vmapping') and self._vmapping is not None:
             self._vmapping = self._vmapping.to(device)
         self._v_normal = self._v_normal.to(device)
+        return self
     @property
     def has_multi_parts(self):
                 texture_map = Image.fromarray(texture_map)
             assert type(texture_map) is Image.Image, "texture_map should be a PIL.Image"
             texture_map = texture_map.transpose(Image.FLIP_TOP_BOTTOM).convert("RGB")
+            material = trimesh.visual.material.PBRMaterial(
+                baseColorTexture=texture_map,
+                baseColorFactor=[255, 255, 255, 255],  # 设置为白色以避免颜色混合
+                metallicFactor=0.0,
+                roughnessFactor=1.0
+            )
         else:
             default_texture = Image.new("RGB", (1024, 1024), (200, 200, 200))
             material = trimesh.visual.texture.SimpleMaterial(image=default_texture)
         return save_path
     @classmethod
+    @spaces.GPU(duration=30)
     def process(cls, mesh_file, uv_tool="xAtlas", y2z=True, y2x=False, z2x=False, upside_down=False, img_size=(512, 512), uv_size=(1024, 1024), device='cuda', progress=gr.Progress()):
         """
         Handle the mesh processing, which includes normalization, parts merging, and UV mapping.
         position_map, normal_map = render_geo_map(mesh)
         progress(1, f"Mesh processing completed.")
+        position_map_path = save_tensor_to_file(position_map, prefix="position_map")
+        normal_map_path = save_tensor_to_file(normal_map, prefix="normal_map")
+        position_images_path = save_tensor_to_file(position_images, prefix="position_images")
+        normal_images_path = save_tensor_to_file(normal_images, prefix="normal_images")
+        mask_images_path = save_tensor_to_file(mask_images.squeeze(-1), prefix="mask_images")
+        w2c_path = save_tensor_to_file(w2c, prefix="w2c")
+        mvp_matrix_path = save_tensor_to_file(mvp_matrix, prefix="mvp_matrix")
+        # Return mesh instance as is
+        return position_map_path, normal_map_path, position_images_path, normal_images_path, mask_images_path, w2c_path, mesh.to("cpu"), mvp_matrix_path, "Mesh processing completed."
 if __name__ == '__main__':

utils/rasterize.py CHANGED Viewed

@@ -1,17 +1,25 @@
 import nvdiffrast.torch as dr
 import torch
-from torch import Tensor
 from jaxtyping import Float, Integer
-from typing import Union, Tuple
 class NVDiffRasterizerContext:
-    def __init__(self, context_type: str, device: torch.device) -> None:
         self.device = device
         self.ctx = self.initialize_context(context_type, device)
     def initialize_context(
-        self, context_type: str, device: torch.device
     ) -> Union[dr.RasterizeGLContext, dr.RasterizeCudaContext]:
         if context_type == "gl":
             return dr.RasterizeGLContext(device=device)

+# This file uses nvdiffrast library, which is licensed under the NVIDIA Source Code License (1-Way Commercial).
+# nvdiffrast is available for non-commercial use (research or evaluation purposes only).
+# For commercial use, please contact NVIDIA for licensing: https://www.nvidia.com/en-us/research/inquiries/
+#
+# nvdiffrast copyright: Copyright (c) 2020, NVIDIA Corporation. All rights reserved.
+# Full license: https://github.com/NVlabs/nvdiffrast/blob/main/LICENSE.txt
+from typing import Tuple, Union
 import nvdiffrast.torch as dr
 import torch
 from jaxtyping import Float, Integer
+from torch import Tensor
 class NVDiffRasterizerContext:
+    def __init__(self, context_type: str, device) -> None:
         self.device = device
         self.ctx = self.initialize_context(context_type, device)
     def initialize_context(
+        self, context_type: str, device
     ) -> Union[dr.RasterizeGLContext, dr.RasterizeCudaContext]:
         if context_type == "gl":
             return dr.RasterizeGLContext(device=device)

utils/render_utils.py CHANGED Viewed

@@ -3,6 +3,7 @@ from functools import cache
 from typing import Dict, Union
 import numpy as np
 import torch
 import torch.nn.functional as F
 from einops import rearrange
@@ -15,8 +16,22 @@ from .rasterize import (NVDiffRasterizerContext,
                         rasterize_position_and_normal_maps,
                         render_geo_from_mesh,
                         render_rgb_from_texture_mesh_with_mask)
-CTX = NVDiffRasterizerContext('cuda', 'cuda')
 def setup_lights():
     """
@@ -24,6 +39,7 @@ def setup_lights():
     """
     raise NotImplementedError("setup_lights function is not implemented yet.")
 def render_views(mesh, texture, mvp_matrix, lights=None, img_size=(512, 512)) -> Image.Image:
     """
     Render the RGB color images of the mesh. The background will be transparent.
@@ -34,11 +50,22 @@ def render_views(mesh, texture, mvp_matrix, lights=None, img_size=(512, 512)) ->
     :param img_size: The size of the output image, a tuple (height, width).
     :return: A concatenated PIL Image.
     """
     if texture.shape[-1] != 3:
         texture = texture.permute(1, 2, 0)
     image_height, image_width = img_size
     rgb_cond, mask = render_rgb_from_texture_mesh_with_mask(
-            CTX, mesh, texture, mvp_matrix, image_height, image_width, torch.tensor([0.0, 0.0, 0.0], device='cuda'))
     if mvp_matrix.shape[0] == 0:
         return None
@@ -65,20 +92,24 @@ def render_views(mesh, texture, mvp_matrix, lights=None, img_size=(512, 512)) ->
     return concatenated_image
 def render_geo_views_tensor(mesh, mvp_matrix, img_size=(512, 512)) -> tuple[torch.Tensor, torch.Tensor]:
     """
     render the geometry information including position and normal from views that mvp matrix implies.
     """
     image_height, image_width = img_size
-    position_images, normal_images, mask_images = render_geo_from_mesh(CTX, mesh, mvp_matrix, image_height, image_width)
     return position_images, normal_images, mask_images
 def render_geo_map(mesh, map_size=(1024, 1024)) -> tuple[torch.Tensor, torch.Tensor]:
     """
     Render the geometry information including position and normal from UV parameterization.
     """
     map_height, map_width = map_size
-    position_images, normal_images, mask = rasterize_position_and_normal_maps(CTX, mesh, map_height, map_width)
     # out_imgs = []
     # if mask.ndim == 4:
     #     mask = mask[0]
@@ -318,6 +349,7 @@ def _get_depth_noraml_map_with_mask(xyz_map, normal_map, mask, w2c, device="cuda
     return depth_map, normal_map, mask
 def get_silhouette_image(position_imgs, normal_imgs, mask_imgs, w2c, selected_view="First View") -> tuple[Image.Image, Image.Image]:
     """
     Get the silhouette image based on geometry image.

 from typing import Dict, Union
 import numpy as np
+import spaces
 import torch
 import torch.nn.functional as F
 from einops import rearrange
                         rasterize_position_and_normal_maps,
                         render_geo_from_mesh,
                         render_rgb_from_texture_mesh_with_mask)
+from utils.file_utils import load_tensor_from_file
+# Global variable to store the singleton context
+_CTX_INSTANCE = None
+@spaces.GPU
+def get_rasterizer_context():
+    """
+    Get the NVDiffRasterizer context using singleton pattern.
+    This ensures only one context is created and reused across the application.
+    """
+    global _CTX_INSTANCE
+    if _CTX_INSTANCE is None:
+        # Use string 'cuda' instead of torch.device to avoid early CUDA initialization
+        _CTX_INSTANCE = NVDiffRasterizerContext('cuda', 'cuda')
+    return _CTX_INSTANCE
 def setup_lights():
     """
     """
     raise NotImplementedError("setup_lights function is not implemented yet.")
+@spaces.GPU
 def render_views(mesh, texture, mvp_matrix, lights=None, img_size=(512, 512)) -> Image.Image:
     """
     Render the RGB color images of the mesh. The background will be transparent.
     :param img_size: The size of the output image, a tuple (height, width).
     :return: A concatenated PIL Image.
     """
+    # If texture or mvp_matrix is a file path, load the tensor from file
+    if isinstance(texture, str):
+        texture = load_tensor_from_file(texture, map_location="cuda")
+    if isinstance(mvp_matrix, str):
+        mvp_matrix = load_tensor_from_file(mvp_matrix, map_location="cuda")
+    mesh = mesh.to("cuda")
+    texture = texture.to("cuda")
+    mvp_matrix = mvp_matrix.to("cuda")
+    print("Trying to render views...")
+    ctx = get_rasterizer_context()
     if texture.shape[-1] != 3:
         texture = texture.permute(1, 2, 0)
     image_height, image_width = img_size
     rgb_cond, mask = render_rgb_from_texture_mesh_with_mask(
+            ctx, mesh, texture, mvp_matrix, image_height, image_width, torch.tensor([0.0, 0.0, 0.0], device=texture.device))
     if mvp_matrix.shape[0] == 0:
         return None
     return concatenated_image
+@spaces.GPU
 def render_geo_views_tensor(mesh, mvp_matrix, img_size=(512, 512)) -> tuple[torch.Tensor, torch.Tensor]:
     """
     render the geometry information including position and normal from views that mvp matrix implies.
     """
+    ctx = get_rasterizer_context()
     image_height, image_width = img_size
+    position_images, normal_images, mask_images = render_geo_from_mesh(ctx, mesh, mvp_matrix, image_height, image_width)
     return position_images, normal_images, mask_images
+@spaces.GPU
 def render_geo_map(mesh, map_size=(1024, 1024)) -> tuple[torch.Tensor, torch.Tensor]:
     """
     Render the geometry information including position and normal from UV parameterization.
     """
+    ctx = get_rasterizer_context()
     map_height, map_width = map_size
+    position_images, normal_images, mask = rasterize_position_and_normal_maps(ctx, mesh, map_height, map_width)
     # out_imgs = []
     # if mask.ndim == 4:
     #     mask = mask[0]
     return depth_map, normal_map, mask
+@spaces.GPU
 def get_silhouette_image(position_imgs, normal_imgs, mask_imgs, w2c, selected_view="First View") -> tuple[Image.Image, Image.Image]:
     """
     Get the silhouette image based on geometry image.

utils/texture_generation.py CHANGED Viewed

@@ -11,13 +11,15 @@ from diffusers.models import AutoencoderKLWan
 from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
 from einops import rearrange
 from jaxtyping import Float
-from peft import LoraConfig
 from PIL import Image
 from torch import Tensor
 from wan.pipeline_wan_t2tex_extra import WanT2TexPipeline
 from wan.wan_t2tex_transformer_3d_extra import WanT2TexTransformer3DModel
 TEX_PIPE = None
 VAE = None
 LATENTS_MEAN, LATENTS_STD = None, None
@@ -26,14 +28,8 @@ TEX_PIPE_LOCK = threading.Lock()
 @dataclass
 class Config:
     video_base_name: str = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
-    seqtex_path: str = "https://huggingface.co/VAST-AI/SeqTex/resolve/main/.gitattributes/edm2_ema_12176_clean.pth"
-    min_noise_level_index: int = 15 # which is same as paper [WorldMem](https://arxiv.org/pdf/2504.12369v1)
-    use_causal_mask: bool = False
-    addtional_qk_geometry: bool = False
-    use_normal: bool = True
-    use_position: bool = True
-    randomly_init: bool = True # we load the weights from a corresponding ckpt
     num_views: int = 4
     uv_num_views: int = 1
@@ -47,45 +43,12 @@ class Config:
     eval_num_inference_steps: int = 30
     eval_seed: int = 42
-    lora_rank: int = 128
-    lora_alpha: int = 64
 cfg = Config()
-def load_model_weights(model_path: str, map_location="cpu"):
-    """
-    Load model weights from either a URL or local file path.
-    Args:
-        model_path (str): Path to model weights, can be URL or local file path
-        map_location (str): Device to map the model to
-    Returns:
-        Dict: Loaded state dictionary
-    """
-    # Check if the path is a URL
-    parsed_url = urlparse(model_path)
-    if parsed_url.scheme in ('http', 'https'):
-        # Load from URL using torch.hub
-        try:
-            state_dict = torch.hub.load_state_dict_from_url(
-                model_path,
-                map_location=map_location,
-                progress=True
-            )
-            return state_dict
-        except Exception as e:
-            gr.Warning(f"Failed to load from URL: {e}")
-            raise e
-    else:
-        # Load from local file path
-        if not os.path.exists(model_path):
-            raise FileNotFoundError(f"Local model file not found: {model_path}")
-        return torch.load(model_path, map_location=map_location)
-def lazy_get_seqtex_pipe():
     """
     Lazy load the SeqTex pipeline for texture generation.
     """
     global TEX_PIPE, VAE, LATENTS_MEAN, LATENTS_STD
     if TEX_PIPE is not None:
@@ -95,42 +58,31 @@ def lazy_get_seqtex_pipe():
         if TEX_PIPE is not None:
             return TEX_PIPE
-        # Pipeline
-        TEX_PIPE = WanT2TexPipeline.from_pretrained(cfg.video_base_name)
-        # Models
-        transformer = WanT2TexTransformer3DModel(
-            TEX_PIPE.transformer,
-            use_causal_mask=cfg.use_causal_mask,
-            addtional_qk_geo=cfg.addtional_qk_geometry,
-            use_normal=cfg.use_normal,
-            use_position=cfg.use_position,
-            randomly_init=cfg.randomly_init,
         )
-        transformer.add_adapter(
-            LoraConfig(
-                r=cfg.lora_rank,
-                lora_alpha=cfg.lora_alpha,
-                init_lora_weights=True,
-                target_modules=["attn1.to_q", "attn1.to_k", "attn1.to_v", "attn1.to_out.0", "attn1.to_out.2",
-                                "ffn.net.0.proj", "ffn.net.2"],
-            )
         )
-        # load transformer
-        state_dict = load_model_weights(cfg.seqtex_path, map_location="cpu")
-        transformer.load_state_dict(state_dict, strict=True)
-        TEX_PIPE.transformer = transformer
-        VAE = AutoencoderKLWan.from_pretrained(cfg.video_base_name, subfolder="vae", torch_dtype=torch.float32).to("cuda").requires_grad_(False)
         TEX_PIPE.vae = VAE
-        # Some useful parameters
         LATENTS_MEAN = torch.tensor(VAE.config.latents_mean).view(
             1, VAE.config.z_dim, 1, 1, 1
-        ).to("cuda", dtype=torch.float32)
         LATENTS_STD = 1.0 / torch.tensor(VAE.config.latents_std).view(
             1, VAE.config.z_dim, 1, 1, 1
-        ).to("cuda", dtype=torch.float32)
         scheduler: FlowMatchEulerDiscreteScheduler = (
             FlowMatchEulerDiscreteScheduler.from_config(
@@ -141,10 +93,8 @@ def lazy_get_seqtex_pipe():
         setattr(TEX_PIPE, "min_noise_level_index", min_noise_level_index)
         min_noise_level_timestep = scheduler.timesteps[min_noise_level_index]
         setattr(TEX_PIPE, "min_noise_level_timestep", min_noise_level_timestep)
-        setattr(TEX_PIPE, "min_noise_level_sigma", min_noise_level_timestep / 1000.)
-        TEX_PIPE = TEX_PIPE.to("cuda", dtype=torch.float32) # use float32 for inference
-        return TEX_PIPE
 @torch.amp.autocast('cuda', dtype=torch.float32)
 def encode_images(
@@ -157,6 +107,11 @@ def encode_images(
     :param encode_as_first: Whether to encode all frames as the first frame.
     :return: Encoded latents with shape [B, C', F, H/8, W/8].
     """
     if images.min() < - 0.1:
         # images are in [-1, 1] range
         images = (images + 1.0) / 2.0  # Normalize to [0, 1] range
@@ -171,19 +126,6 @@ def encode_images(
     return latents
-# @torch.no_grad()
-# @torch.amp.autocast('cuda', dtype=torch.float32)
-# def decode_images(self, latents: Float[Tensor, "B C F H W"], decode_as_first: bool = False):
-#     if decode_as_first:
-#         F = latents.shape[2]
-#         latents = latents.to(self.vae.dtype)
-#         latents = latents / self.latents_std + self.latents_mean
-#         latents = rearrange(latents, "B C F H W -> (B F) C 1 H W")
-#         images = self.vae.decode(latents, return_dict=False)[0]
-#         images = rearrange(images, "(B F) C Nv H W -> B C (F Nv) H W", F=F, Nv=1)
-#     else:
-#        raise NotImplementedError("Currently only support decode as first frame.")
-#     return images
 @torch.amp.autocast('cuda', dtype=torch.float32)
 def decode_images(latents: Float[Tensor, "B C F H W"], decode_as_first: bool = False):
     """
@@ -192,6 +134,11 @@ def decode_images(latents: Float[Tensor, "B C F H W"], decode_as_first: bool = F
     :param decode_as_first: Whether to decode all frames as the first frame.
     :return: Decoded images with shape [B, C, F*Nv, H*8, W*8].
     """
     if decode_as_first:
         F = latents.shape[2]
         latents = latents.to(VAE.dtype)
@@ -207,6 +154,7 @@ def convert_img_to_tensor(image: Image.Image, device="cuda") -> Float[Tensor, "H
     """
     Convert a PIL Image to a tensor. If Image is RGBA, mask it with black background using a-channel mask.
     :param image: PIL Image to convert. [0, 255]
     :return: Tensor representation of the image. [0.0, 1.0], still [H, W, C]
     """
     # Convert to RGBA to ensure alpha channel exists
@@ -217,25 +165,33 @@ def convert_img_to_tensor(image: Image.Image, device="cuda") -> Float[Tensor, "H
     # Blend with black background using alpha mask
     rgb = rgb * alpha
     rgb = rgb.astype(np.float32) / 255.0  # Normalize to [0, 1]
-    tensor = torch.from_numpy(rgb).to(device)
     return tensor
-@spaces.GPU(duration=120)
-@torch.cuda.amp.autocast(dtype=torch.float32)
-@torch.inference_mode
 @torch.no_grad
-def generate_texture(position_map, normal_map, position_images, normal_images, condition_image, text_prompt, selected_view, negative_prompt=None, device="cuda", progress=gr.Progress()):
     """
     Use SeqTex to generate texture for the mesh based on the image condition.
-    :param position_images: List of position images from different views.
-    :param normal_images: List of normal images from different views.
     :param condition_image: Image condition generated from the selected view.
     :param text_prompt: Text prompt for texture generation.
     :param selected_view: The view selected for generating the image condition.
-    :return: Generated texture map, and multi-view frames in tensor.
     """
     progress(0, desc="Loading SeqTex pipeline...")
-    tex_pipe = lazy_get_seqtex_pipe()
     progress(0.2, desc="SeqTex pipeline loaded successfully.")
     view_id_map = {
         "First View": 0,
@@ -306,4 +262,5 @@ def generate_texture(position_map, normal_map, position_images, normal_images, c
     uv_map_pred = torch.clamp(uv_map_pred, 0.0, 1.0)
     progress(1, desc="Texture generated successfully.")
-    return uv_map_pred.float(), mv_out.float(), "Step 3: Texture generated successfully."

 from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
 from einops import rearrange
 from jaxtyping import Float
 from PIL import Image
 from torch import Tensor
 from wan.pipeline_wan_t2tex_extra import WanT2TexPipeline
 from wan.wan_t2tex_transformer_3d_extra import WanT2TexTransformer3DModel
+from . import tensor_to_pil
+from utils.file_utils import save_tensor_to_file, load_tensor_from_file
 TEX_PIPE = None
 VAE = None
 LATENTS_MEAN, LATENTS_STD = None, None
 @dataclass
 class Config:
     video_base_name: str = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
+    seqtex_transformer_path: str = "VAST-AI/SeqTex-Transformer"
+    min_noise_level_index: int = 15 # refer to paper [WorldMem](https://arxiv.org/pdf/2504.12369v1)
     num_views: int = 4
     uv_num_views: int = 1
     eval_num_inference_steps: int = 30
     eval_seed: int = 42
 cfg = Config()
+def get_seqtex_pipe():
     """
     Lazy load the SeqTex pipeline for texture generation.
+    Must be called within @spaces.GPU context.
     """
     global TEX_PIPE, VAE, LATENTS_MEAN, LATENTS_STD
     if TEX_PIPE is not None:
         if TEX_PIPE is not None:
             return TEX_PIPE
+        # Load transformer with auto-configured LoRA adapter first
+        transformer = WanT2TexTransformer3DModel.from_pretrained(
+            cfg.seqtex_transformer_path,
+            token=os.environ["SEQTEX_SPACE_TOKEN"]
         )
+        assert os.environ["SEQTEX_SPACE_TOKEN"] != "", "Please set the SEQTEX_SPACE_TOKEN environment variable with your Hugging Face token, which has access to VAST-AI/SeqTex-Transformer."
+        # Pipeline - pass the pre-loaded transformer to avoid re-loading
+        TEX_PIPE = WanT2TexPipeline.from_pretrained(
+            cfg.video_base_name,
+            transformer=transformer,
+            torch_dtype=torch.bfloat16
         )
+        del(transformer)
+        VAE = AutoencoderKLWan.from_pretrained(cfg.video_base_name, subfolder="vae", torch_dtype=torch.float32)
         TEX_PIPE.vae = VAE
+        # Some useful parameters - delay CUDA initialization until GPU context
         LATENTS_MEAN = torch.tensor(VAE.config.latents_mean).view(
             1, VAE.config.z_dim, 1, 1, 1
+        ).to(torch.float32)
         LATENTS_STD = 1.0 / torch.tensor(VAE.config.latents_std).view(
             1, VAE.config.z_dim, 1, 1, 1
+        ).to(torch.float32)
         scheduler: FlowMatchEulerDiscreteScheduler = (
             FlowMatchEulerDiscreteScheduler.from_config(
         setattr(TEX_PIPE, "min_noise_level_index", min_noise_level_index)
         min_noise_level_timestep = scheduler.timesteps[min_noise_level_index]
         setattr(TEX_PIPE, "min_noise_level_timestep", min_noise_level_timestep)
+        setattr(TEX_PIPE, "min_noise_level_sigma", min_noise_level_timestep / 1000.)
+        return TEX_PIPE.to("cuda")
 @torch.amp.autocast('cuda', dtype=torch.float32)
 def encode_images(
     :param encode_as_first: Whether to encode all frames as the first frame.
     :return: Encoded latents with shape [B, C', F, H/8, W/8].
     """
+    global VAE, LATENTS_MEAN, LATENTS_STD
+    VAE = VAE.to("cuda").requires_grad_(False)
+    LATENTS_MEAN = LATENTS_MEAN.to("cuda")
+    LATENTS_STD = LATENTS_STD.to("cuda")
     if images.min() < - 0.1:
         # images are in [-1, 1] range
         images = (images + 1.0) / 2.0  # Normalize to [0, 1] range
     return latents
 @torch.amp.autocast('cuda', dtype=torch.float32)
 def decode_images(latents: Float[Tensor, "B C F H W"], decode_as_first: bool = False):
     """
     :param decode_as_first: Whether to decode all frames as the first frame.
     :return: Decoded images with shape [B, C, F*Nv, H*8, W*8].
     """
+    global VAE, LATENTS_MEAN, LATENTS_STD
+    VAE = VAE.to("cuda").requires_grad_(False)
+    LATENTS_MEAN = LATENTS_MEAN.to("cuda")
+    LATENTS_STD = LATENTS_STD.to("cuda")
     if decode_as_first:
         F = latents.shape[2]
         latents = latents.to(VAE.dtype)
     """
     Convert a PIL Image to a tensor. If Image is RGBA, mask it with black background using a-channel mask.
     :param image: PIL Image to convert. [0, 255]
+    :param device: Target device for the tensor.
     :return: Tensor representation of the image. [0.0, 1.0], still [H, W, C]
     """
     # Convert to RGBA to ensure alpha channel exists
     # Blend with black background using alpha mask
     rgb = rgb * alpha
     rgb = rgb.astype(np.float32) / 255.0  # Normalize to [0, 1]
+    tensor = torch.from_numpy(rgb)
+    if device != "cpu":
+        tensor = tensor.to(device)
     return tensor
+@spaces.GPU(duration=90)
 @torch.no_grad
+@torch.inference_mode
+def generate_texture(position_map_path, normal_map_path, position_images_path, normal_images_path, condition_image, text_prompt, selected_view, negative_prompt=None, device="cuda", progress=gr.Progress()):
     """
     Use SeqTex to generate texture for the mesh based on the image condition.
+    :param position_images_path: File path to position images tensor
+    :param normal_images_path: File path to normal images tensor
     :param condition_image: Image condition generated from the selected view.
     :param text_prompt: Text prompt for texture generation.
     :param selected_view: The view selected for generating the image condition.
+    :return: File paths of generated texture map and multi-view frames, and PIL images
     """
+    position_map = load_tensor_from_file(position_map_path, map_location=device)
+    normal_map = load_tensor_from_file(normal_map_path, map_location=device)
+    position_images = load_tensor_from_file(position_images_path, map_location=device)
+    normal_images = load_tensor_from_file(normal_images_path, map_location=device)
     progress(0, desc="Loading SeqTex pipeline...")
+    tex_pipe = get_seqtex_pipe()
+    # assert tex_pipe is in gpu
+    assert tex_pipe.device.type == "cuda", "SeqTex pipeline must be loaded in GPU context."
     progress(0.2, desc="SeqTex pipeline loaded successfully.")
     view_id_map = {
         "First View": 0,
     uv_map_pred = torch.clamp(uv_map_pred, 0.0, 1.0)
     progress(1, desc="Texture generated successfully.")
+    uv_map_pred_path = save_tensor_to_file(uv_map_pred, prefix="uv_map_pred")
+    return uv_map_pred_path, tensor_to_pil(uv_map_pred, normalize=False), tensor_to_pil(mv_out, normalize=False), "Step 3: Texture generated successfully."

wan/pipeline_wan_t2tex_extra.py CHANGED Viewed

@@ -1,19 +1,39 @@
 import copy
-from typing import Any, Callable, Dict, List, Optional, Union, Tuple
-from einops import rearrange
 import regex as re
 import torch
-from diffusers.pipelines.wan.pipeline_wan import WanPipeline
 from diffusers.pipelines.wan.pipeline_output import WanPipelineOutput
-from diffusers.callbacks import PipelineCallback, MultiPipelineCallbacks
 from diffusers.utils.torch_utils import randn_tensor
 from torch import Tensor
 from transformers import AutoTokenizer, UMT5EncoderModel
-from jaxtyping import Float
-import gradio as gr
 def get_sigmas(scheduler, timesteps, dtype=torch.float32, device="cuda"):
     sigmas = scheduler.sigmas.to(device=device, dtype=dtype)
     schedule_timesteps = scheduler.timesteps.to(device)
     timesteps = timesteps.to(device)
@@ -83,6 +103,8 @@ class WanT2TexPipeline(WanPipeline):
         negative_prompt_embeds: Optional[torch.Tensor] = None,
         output_type: Optional[str] = "np",
         return_dict: bool = True,
         attention_kwargs: Optional[Dict[str, Any]] = None,
         callback_on_step_end: Optional[
             Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
@@ -191,7 +213,8 @@ class WanT2TexPipeline(WanPipeline):
         self._current_timestep = None
         self._interrupt = False
-        device = self._execution_device
         # 2. Define call parameters
         if prompt is not None and isinstance(prompt, str):

+# Copyright 2025 The Wan Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import copy
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+import gradio as gr
 import regex as re
 import torch
+from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
 from diffusers.pipelines.wan.pipeline_output import WanPipelineOutput
+from diffusers.pipelines.wan.pipeline_wan import WanPipeline
 from diffusers.utils.torch_utils import randn_tensor
+from einops import rearrange
+from jaxtyping import Float
 from torch import Tensor
 from transformers import AutoTokenizer, UMT5EncoderModel
 def get_sigmas(scheduler, timesteps, dtype=torch.float32, device="cuda"):
+    # Ensure device is available before using it
+    if isinstance(device, str) and device.startswith("cuda"):
+        if not torch.cuda.is_available():
+            device = "cpu"
     sigmas = scheduler.sigmas.to(device=device, dtype=dtype)
     schedule_timesteps = scheduler.timesteps.to(device)
     timesteps = timesteps.to(device)
         negative_prompt_embeds: Optional[torch.Tensor] = None,
         output_type: Optional[str] = "np",
         return_dict: bool = True,
+        device: Optional[str] = "cuda",
         attention_kwargs: Optional[Dict[str, Any]] = None,
         callback_on_step_end: Optional[
             Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
         self._current_timestep = None
         self._interrupt = False
+        device = torch.device(device) if isinstance(device, str) else device
+        self.to(device)
         # 2. Define call parameters
         if prompt is not None and isinstance(prompt, str):

wan/wan_t2tex_transformer_3d_extra.py CHANGED Viewed

@@ -13,30 +13,23 @@
 # limitations under the License.
 import copy
-import math
-from typing import Any, Dict, Optional, Tuple, Union
 from functools import cache
-from einops import rearrange, repeat
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.loaders import FromOriginalModelMixin, PeftAdapterMixin
 from diffusers.models import WanTransformer3DModel
 from diffusers.models.attention import FeedForward
 from diffusers.models.attention_processor import Attention
-from diffusers.models.cache_utils import CacheMixin
-from diffusers.models.embeddings import (PixArtAlphaTextProjection,
-                                         TimestepEmbedding, Timesteps,
-                                         get_1d_rotary_pos_embed)
-from diffusers.models.modeling_outputs import Transformer2DModelOutput
-from diffusers.models.modeling_utils import ModelMixin
 from diffusers.models.normalization import FP32LayerNorm
 from diffusers.models.transformers.transformer_wan import \
     WanTimeTextImageEmbedding
 from diffusers.utils import (USE_PEFT_BACKEND, logging, scale_lora_layers,
                              unscale_lora_layers)
 class WanT2TexAttnProcessor2_0:
@@ -228,43 +221,6 @@ class WanRotaryPosEmbed(nn.Module):
         uv_freqs = torch.cat([uv_freqs_f, uv_freqs_h, uv_freqs_w], dim=-1).reshape(1, 1, uppf * upph * uppw, -1)
         return torch.cat([freqs, uv_freqs], dim=-2)
-# def pseudo_code(freqs, mv_tokens_shape, uv_tokens_shape, dimmension):
-#     """
-#     Input:
-#         freqs: [S, D/2], S is the number of tokens, D is the dimension of tokens, 2 indicates Cos and Sin in original RoPE.
-#         mv_tokens_shape: (mv_num_frames, mv_height, mv_width)
-#         uv_tokens_shape: (uv_num_frames, uv_height, uv_width)
-#         dimension: the dimension of tokens
-#     Output:
-#     """
-#     mpf, mph, mpw = mv_tokens_shape # mv_num_frames, mv_height, mv_width
-#     upf, uph, upw = uv_tokens_shape # uv_num_frames, uv_height, uv_width
-#     # 1. To evenly split the freqs into 3 parts
-#     freqs = freqs.split_with_sizes(
-#         [
-#             dimmension // 2 - 2 * (dimmension // 6),
-#             dimmension // 6,
-#             dimmension // 6,
-#         ],
-#         dim=1,
-#     )
-#     # 2. In time dimension, the freqs for UV are subsequent to the freqs for MV
-#     freqs_f = freqs[0][:mpf].view(mpf, 1, 1, -1).expand(mpf, mph, mpw, -1)
-#     uv_freqs_f = freqs[0][mpf:mpf+upf].view(upf, 1, 1, -1).expand(upf, uph, upw, -1)
-#     # 3. The freqs in height and width dimension are the same for mv and uv
-#     freqs_h = freqs[1][:mph].view(1, mph, 1, -1).expand(mpf, mph, mpw, -1)
-#     uv_freqs_h = freqs[1][:uph].view(1, uph, 1, -1).expand(upf, uph, upw, -1)
-#     freqs_w = freqs[2][:mpw].view(1, 1, mpw, -1).expand(mpf, mph, mpw, -1)
-#     uv_freqs_w = freqs[2][:upw].view(1, 1, upw, -1).expand(upf, uph, upw, -1)
-#     # 4. rearrange three 1D RoPEs into 3D RoPE in channel dimension
-#     mv_rope = torch.cat([freqs_f, freqs_h, freqs_w], dim=-1).reshape(mpf * mph * mpw, -1)
-#     uv_rope = torch.cat([uv_freqs_f, uv_freqs_h, uv_freqs_w], dim=-1).reshape(upf * uph * upw, -1)
-#     return torch.cat([mv_rope, uv_rope], dim=-2)
 class WanT2TexTransformerBlock(nn.Module):
     def __init__(
         self,
@@ -400,71 +356,104 @@ class WanT2TexTransformer3DModel(WanTransformer3DModel):
     """
     3D Transformer model for T2Tex.
     """
-    def __init__(self, original_model, use_causal_mask=False, addtional_qk_geo=False, randomly_init=False, **kwargs):
-        super(WanT2TexTransformer3DModel, self).__init__(**original_model.config)
-        if not randomly_init:
-            self.load_state_dict(original_model.state_dict(), strict=True)
-        self.addtional_qk_geo = addtional_qk_geo
-        if addtional_qk_geo:
-            raise ValueError("addtional_qk_geo did not work")
-            warn("addtional_qk_geo is set to True, this will drastically increase the memory usage and slow down the training, without significant performance gain.")
         # 1. Patch & position embedding
-        self.rope = WanRotaryPosEmbed(self.rope.attention_head_dim, self.rope.patch_size, self.rope.max_seq_len, addtional_qk_geo=addtional_qk_geo)
-        self.use_normal, self.use_position = kwargs.get("use_normal", True), kwargs.get("use_position", True)
-        if self.use_normal:
-            self.norm_patch_embedding = copy.deepcopy(self.patch_embedding)
-            # torch.nn.init.zeros_(self.norm_patch_embedding.weight.data)
-            # torch.nn.init.zeros_(self.norm_patch_embedding.bias.data)
-        if self.use_position:
-            self.pos_patch_embedding = copy.deepcopy(self.patch_embedding)
-            # torch.nn.init.zeros_(self.pos_patch_embedding.weight.data)
-            # torch.nn.init.zeros_(self.pos_patch_embedding.bias.data)
         # 2. Condition embeddings
-        inner_dim = original_model.config.num_attention_heads * original_model.config.attention_head_dim
         self.condition_embedder = WanTimeTaskTextImageEmbedding(
             original_model=self.condition_embedder,
             dim=inner_dim,
-            time_freq_dim=original_model.config.freq_dim,
             time_proj_dim=inner_dim * 6,
-            text_embed_dim=original_model.config.text_dim,
-            image_embed_dim=original_model.config.image_dim,
-            randomly_init=randomly_init,
         )
         # 3. Transformer blocks
-        self.use_causal_mask = use_causal_mask
-        self.num_attention_heads = original_model.config.num_attention_heads
         block = WanT2TexTransformerBlock(
             inner_dim,
-            original_model.config.ffn_dim,
-            original_model.config.num_attention_heads,
-            original_model.config.qk_norm,
-            original_model.config.cross_attn_norm,
-            original_model.config.eps,
-            original_model.config.added_kv_proj_dim,
         )
         self.blocks = None
         self.blocks = nn.ModuleList(
             [
                 copy.deepcopy(block)
-                for _ in range(original_model.config.num_layers)
             ]
         )
         self.scale_shift_table_uv = nn.Parameter(torch.randn(1, 2, inner_dim) / inner_dim**0.5)
-        if not randomly_init:
-            self.scale_shift_table_uv.data.copy_(self.scale_shift_table.data)
-            self.blocks.load_state_dict(original_model.blocks.state_dict(), strict=False)
-            for block in self.blocks:
-                block.attnuv.load_state_dict(block.attn1.state_dict())
-                block.scale_shift_table_uv.data.copy_(block.scale_shift_table.data)
-                block.normuv2.load_state_dict(block.norm2.state_dict())
-                block.ffnuv.load_state_dict(block.ffn.state_dict())
-        # 4. Output norm & projection
-        pass
     @cache
     def get_attention_bias(self, mv_length, uv_length):
@@ -521,23 +510,10 @@ class WanT2TexTransformer3DModel(WanTransformer3DModel):
         rotary_emb = self.rope(mv_hidden_states, uv_hidden_states)
         # Patchify
-        if self.use_normal and self.use_position:
-            mv_rgb_hidden_states, mv_pos_hidden_states, mv_norm_hidden_states = torch.chunk(mv_hidden_states, 3, dim=1)
-            uv_rgb_hidden_states, uv_pos_hidden_states, uv_norm_hidden_states = torch.chunk(uv_hidden_states, 3, dim=1)
-            mv_geometry_embedding = self.pos_patch_embedding(mv_pos_hidden_states) + self.norm_patch_embedding(mv_norm_hidden_states)
-            uv_geometry_embedding = self.pos_patch_embedding(uv_pos_hidden_states) + self.norm_patch_embedding(uv_norm_hidden_states)
-        elif self.use_normal:
-            mv_rgb_hidden_states, mv_norm_hidden_states = torch.chunk(mv_hidden_states, 2, dim=1)
-            uv_rgb_hidden_states, uv_norm_hidden_states = torch.chunk(uv_hidden_states, 2, dim=1)
-            mv_geometry_embedding = self.norm_patch_embedding(mv_norm_hidden_states)
-            uv_geometry_embedding = self.norm_patch_embedding(uv_norm_hidden_states)
-        elif self.use_position:
-            mv_rgb_hidden_states, mv_pos_hidden_states = torch.chunk(mv_hidden_states, 2, dim=1)
-            uv_rgb_hidden_states, uv_pos_hidden_states = torch.chunk(uv_hidden_states, 2, dim=1)
-            mv_geometry_embedding = self.pos_patch_embedding(mv_pos_hidden_states)
-            uv_geometry_embedding = self.pos_patch_embedding(uv_pos_hidden_states)
-        else:
-            raise ValueError("use_normal and use_position are both False, please set at least one of them to True.")
         mv_hidden_states = self.patch_embedding(mv_rgb_hidden_states)
         uv_hidden_states = self.patch_embedding(uv_rgb_hidden_states)
@@ -564,12 +540,6 @@ class WanT2TexTransformer3DModel(WanTransformer3DModel):
         if encoder_hidden_states_image is not None:
             encoder_hidden_states = torch.concat([encoder_hidden_states_image, encoder_hidden_states], dim=1)
-        # # Get attention bias
-        # if self.use_causal_mask:
-        #     # This may be gainless, because the patch embedding is not causal, which will leak information to MV
-        #     attn_bias = self.get_attention_bias(post_patch_num_frames * post_patch_height * post_patch_width,
-        #                                         post_uv_num_frames * post_uv_height * post_uv_width)
-        # else:
         attn_bias = None
         # 4. Transformer blocks

 # limitations under the License.
 import copy
 from functools import cache
+from typing import Any, Dict, Optional, Tuple, Union
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from diffusers.models import WanTransformer3DModel
 from diffusers.models.attention import FeedForward
 from diffusers.models.attention_processor import Attention
+from diffusers.models.embeddings import get_1d_rotary_pos_embed
 from diffusers.models.normalization import FP32LayerNorm
 from diffusers.models.transformers.transformer_wan import \
     WanTimeTextImageEmbedding
 from diffusers.utils import (USE_PEFT_BACKEND, logging, scale_lora_layers,
                              unscale_lora_layers)
+from einops import rearrange, repeat
+from peft import LoraConfig
 class WanT2TexAttnProcessor2_0:
         uv_freqs = torch.cat([uv_freqs_f, uv_freqs_h, uv_freqs_w], dim=-1).reshape(1, 1, uppf * upph * uppw, -1)
         return torch.cat([freqs, uv_freqs], dim=-2)
 class WanT2TexTransformerBlock(nn.Module):
     def __init__(
         self,
     """
     3D Transformer model for T2Tex.
     """
+    def __init__(self,
+                 patch_size: Tuple[int] = (1, 2, 2),
+                 num_attention_heads: int = 40,
+                 attention_head_dim: int = 128,
+                 in_channels: int = 16,
+                 out_channels: int = 16,
+                 text_dim: int = 4096,
+                 freq_dim: int = 256,
+                 ffn_dim: int = 13824,
+                 num_layers: int = 40,
+                 cross_attn_norm: bool = True,
+                 qk_norm: Optional[str] = "rms_norm_across_heads",
+                 eps: float = 1e-6,
+                 image_dim: Optional[int] = None,
+                 added_kv_proj_dim: Optional[int] = None,
+                 rope_max_seq_len: int = 1024,
+                 **kwargs
+                ):
+        super(WanT2TexTransformer3DModel, self).__init__(
+            patch_size=patch_size,
+            num_attention_heads=num_attention_heads,
+            attention_head_dim=attention_head_dim,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            text_dim=text_dim,
+            freq_dim=freq_dim,
+            ffn_dim=ffn_dim,
+            num_layers=num_layers,
+            cross_attn_norm=cross_attn_norm,
+            qk_norm=qk_norm,
+            eps=eps,
+            image_dim=image_dim,
+            added_kv_proj_dim=added_kv_proj_dim,
+            rope_max_seq_len=rope_max_seq_len
+        )
         # 1. Patch & position embedding
+        self.rope = WanRotaryPosEmbed(self.rope.attention_head_dim, self.rope.patch_size, self.rope.max_seq_len)
+        self.norm_patch_embedding = copy.deepcopy(self.patch_embedding)
+        self.pos_patch_embedding = copy.deepcopy(self.patch_embedding)
         # 2. Condition embeddings
+        inner_dim = num_attention_heads * attention_head_dim
         self.condition_embedder = WanTimeTaskTextImageEmbedding(
             original_model=self.condition_embedder,
             dim=inner_dim,
+            time_freq_dim=freq_dim,
             time_proj_dim=inner_dim * 6,
+            text_embed_dim=text_dim,
+            image_embed_dim=image_dim,
         )
         # 3. Transformer blocks
+        self.num_attention_heads = num_attention_heads
         block = WanT2TexTransformerBlock(
             inner_dim,
+            ffn_dim,
+            num_attention_heads,
+            qk_norm,
+            cross_attn_norm,
+            eps,
+            added_kv_proj_dim,
         )
         self.blocks = None
         self.blocks = nn.ModuleList(
             [
                 copy.deepcopy(block)
+                for _ in range(num_layers)
             ]
         )
         self.scale_shift_table_uv = nn.Parameter(torch.randn(1, 2, inner_dim) / inner_dim**0.5)
+        # 4. Auto-configure LoRA adapter for SeqTex
+        self.configure_lora_adapter()
+    def configure_lora_adapter(self, lora_rank: int = 128, lora_alpha: int = 64):
+        """
+        Configure LoRA adapter with custom settings or auto-configuration.
+        Args:
+            lora_rank (int, optional): LoRA rank parameter, default (128)
+            lora_alpha (int, optional): LoRA alpha parameter, default (64)
+        """
+        # Get parameters from args, environment variables, or defaults
+        target_modules = [
+            "attn1.to_q", "attn1.to_k", "attn1.to_v",
+            "attn1.to_out.0", "attn1.to_out.2",
+            "ffn.net.0.proj", "ffn.net.2"
+        ]
+        lora_config = LoraConfig(
+            r=lora_rank,
+            lora_alpha=lora_alpha,
+            init_lora_weights=True,
+            target_modules=target_modules,
+        )
+        self.add_adapter(lora_config)
     @cache
     def get_attention_bias(self, mv_length, uv_length):
         rotary_emb = self.rope(mv_hidden_states, uv_hidden_states)
         # Patchify
+        mv_rgb_hidden_states, mv_pos_hidden_states, mv_norm_hidden_states = torch.chunk(mv_hidden_states, 3, dim=1)
+        uv_rgb_hidden_states, uv_pos_hidden_states, uv_norm_hidden_states = torch.chunk(uv_hidden_states, 3, dim=1)
+        mv_geometry_embedding = self.pos_patch_embedding(mv_pos_hidden_states) + self.norm_patch_embedding(mv_norm_hidden_states)
+        uv_geometry_embedding = self.pos_patch_embedding(uv_pos_hidden_states) + self.norm_patch_embedding(uv_norm_hidden_states)
         mv_hidden_states = self.patch_embedding(mv_rgb_hidden_states)
         uv_hidden_states = self.patch_embedding(uv_rgb_hidden_states)
         if encoder_hidden_states_image is not None:
             encoder_hidden_states = torch.concat([encoder_hidden_states_image, encoder_hidden_states], dim=1)
         attn_bias = None
         # 4. Transformer blocks