Foundation Cures Personalization:
Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge
Current facial personalization models often lack precise control over facial attributes. This is because identity embedding can negatively affect cross-attention layers and undermine other tokens' normal function. To address this limitation, we propose FreeCure, a training-free framework that effectively resolves the issue. FreeCure can be easily integrated into diverse baselines(including those based on Stable Diffusion and FLUX) and enabling robust manipulation of various facial attributes while preserving their strong identity fidelity.
We analyze the challenges current methods face in achieving precise control over facial features. The key issue lies in how personalization models process identity data: their cross-attention layers focus too much on identity-related tokens while undermining those corresponding to facial attributes (e.g., hairstyle, expression, see the left part). Consequently, these attributes become difficult to manipulate. However, these cross-attention layers (or adapters) are critical for preserving identity accuracy in the generated output, making them resistant to modification. Adjusting the cross-attention maps frequently leads to the loss of essential identity characteristics (see the right parts).
While keeping cross-attention modules intact, we propose a novel foundation-aware self-attention (FASA), enabling attributes with high prompt consistency to replace those that are ill-aligned during personalization generation. To protect the identity unharmed, this strategy also leverages semantic segmentation models to generate the scaling masks of these attributes, therefore making such replacement happen in a highly localized and harmonious manner. Furthermore, we use a simple but effective approach called asymmetric prompt guidance (APG) to restore abstract attributes such as expression.
We validate that the subsequent enhancement processes of FreeCure do not impact attributes that have already been enhanced. The figure visualizes the intermediate results at FreeCure's different stages. When APG is used in the latter stages, attributes previously enhanced through FASA remain unaffected. This illustrates that FreeCure's improvement across multiple attributes is highly robust and consistent.
We observe that, with identical initial noise, all baselines' FD and PD processes generate faces with similar attribute locations. It is important to validate that FASA's robust performance without these condition. Thus, we relax this condition and regenerate faces with different initial noises. This figure shows that results under the two settings are comparable, which confirms that FASA can effectively enhance the generated results of PD, even when its FD counterpart produces faces with variable spatial structures.
FASA effectively captures and faithfully transfers information of attributes localized in small areas (e.g. eyes, pearl earrings). Furthermore, in regions unrelated to the target attributes, FASA maintains a strong alignment with the original PD attention maps, demonstrating that it preserves the core functionality of the personalization models.