ViscoNet:
Bridging and Harmonizing
Visual and Textual Conditioning
for ControlNet

University of Surrey
December, 2023
TL;DR: We add visual prompt to ControlNet. By adjusting the control strength and different spatial resolutions, our method harmonizes between visual and text prompt, avoiding mode collapse suffered by ControlNet and T2I-Adapter and achieve various image tasks. All trained with small dataset on a single GPU.

Abstract

This paper introduces ViscoNet, a novel method that enhances text-to-image human generation models with visual prompting. Unlike existing methods that rely on lengthy text descriptions to control the image structure, ViscoNet allows users to specify the visual appearance of the target object with a reference image. ViscoNet disentangles the object’s appearance from the image background and injects it into a pre-trained latent diffusion model (LDM) model via a ControlNet branch. This way, ViscoNet mitigates the style mode collapse problem and enables precise and flexible visual control. We demonstrate the effectiveness of ViscoNet on human image generation, where it can manipulate visual attributes and artistic styles by adjusting the control strength of visual prompt at different spatial resolutions. We also show that ViscoNet can learn visual conditioning from small and specific object domains while preserving the generative power of the LDM backbone.

Method

MY ALT TEXT

We replaces text embedding in ControlNet with image embedding. This severes the entanglement between ControlNet and backbone LDM (StableDiffusion). We applies human masking to control signals to avoid overfitting LDM with blank background of small training dataset.

Latent Space Interpolation

MY ALT TEXT
MY ALT TEXT
MY ALT TEXT
MY ALT TEXT
MY ALT TEXT

Fixing visual prompt and change age in text prompt.

DeepFakes, Virtual Try-on, Pose Transfer

Stylization

We avoid mode collapse by reducing the visual control signal strength. We remove the face visual prompt in some challenging image styles.

Barbie stylization

Texture Transfer

We achieve texture transfer by only applying visual control signals at high spatial resolutions.

Demo App

BibTeX


        @inproceedings{cheong2023visconet,
        author    = {Cheong, Soon Yau and Mustafa, Armin and Gilbert, Andrew},
        title     = {ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet},
        booktitle   = {ECCV Workshop Proceedings},
        month     = {September},
        year      = {2024}}