StyleTex: Style Image-Guided Texture Generation for 3D Models

1State Key Lab of CAD&CG, Zhejiang University    2Zhejiang University
*The first two authors contribute equally.    Corresponding authors.   

StyleTex utilizes the untextured mesh, a single reference image, and a text prompt describing the mesh and desired style as inputs to generate a stylized texture.

Abstract

Style-guided texture generation aims to generate a texture that is harmonious with both the style of the reference image and the geometry of the input mesh, given a reference style image and a 3D mesh with its text description. Although diffusion-based 3D texture generation methods, such as distillation sampling, have numerous promising applications in stylized games and films, it requires addressing two challenges: 1) decouple style and content completely from the reference image for 3D models, and 2) align the generated texture with the color tone, style of the reference image, and the given text prompt.

To this end, we introduce StyleTex, an innovative diffusion-model-based framework for creating stylized textures for 3D models. Our key insight is to decouple style information from the reference image while disregarding content in diffusion-based distillation sampling. Specifically, given a reference image, we first decompose its style feature from the image CLIP embedding by subtracting the embedding's orthogonal projection in the direction of the content feature, which is represented by a text CLIP embedding. Our novel approach to disentangling the reference image's style and content information allows us to generate distinct style and content features. We then inject the style feature into the cross-attention mechanism to incorporate it into the generation process, while utilizing the content feature as a negative prompt to further dissociate content information. Finally, we incorporate these strategies into StyleTex to obtain stylized textures. We utilize Interval Score Matching to address over-smoothness and over-saturation, in combination with a geometry-aware ControlNet that ensures consistent geometry throughout the generative process.

The resulting textures generated by StyleTex retain the style of the reference image, while also aligning with the text prompts and intrinsic details of the given 3D mesh. Quantitative and qualitative experiments show that our method outperforms existing baseline methods by a significant margin.

Video

Pipeline

StyleTex's inputs include a reference style image $I_{ref}$, a text prompt $y$, and an untextured 3D mesh $\mathcal{M}$. During training, we utilize our innovative ODCR method to extract a content-unrelated style feature $f_s^{ref}$, from the reference image. The style feature and text embeddings are fed into the Unet to guide the optimization of the texture field. During inference, texture maps can be sampled from the texture field and directly employed in downstream game or film production, enabling the creation of stylized digital environments.

Comparison

More results

Re-rendering the input video

Using the textures generated by StyleTex, you can create a variety of imaginative stylized scenes in the rendering engine.

BibTeX


      @article{10.1145/3687931,
        author = {Xie, Zhiyu and Zhang, Yuqing and Tang, Xiangjun and Wu, Yiqian and Chen, Dehan and Li, Gongsheng and Jin, Xiaogang},
        title = {StyleTex: Style Image-Guided Texture Generation for 3D Models},
        year = {2024},
        issue_date = {December 2024},
        publisher = {Association for Computing Machinery},
        volume = {43},
        number = {6},
        issn = {0730-0301},
        url = {https://doi.org/10.1145/3687931},
        doi = {10.1145/3687931},
        journal = {ACM Trans. Graph.},
        month = nov,
        articleno = {212},
        numpages = {14},
        }