Towards Equitable Representation in Text-to-Image Generation
Abstract
The accurate representation of diverse cultures in media not only enhances the well-being of global audiences but also fosters a deeper understanding and appreciation of cultural diversity. Generative image models, particularly those trained on extensive, web-crawled datasets such as LAION, often inadvertently perpetuate harmful stereotypes and misrepresentations due to the biased nature of their training data. In this work, we propose an innovative approach aimed at enhancing inclusivity and cultural representation within generated images. Our methodology encompasses two pivotal components: firstly, the collection of a culturally diverse and representative dataset, termed the Cross-Cultural Understanding Benchmark (CCUB), and secondly, the introduction of a novel fine-tuning technique, Self-Contrastive Fine-Tuning (SCoFT). SCoFT leverages the model’s existing biases as a mechanism for self-improvement, effectively mitigating the risk of overfitting when applied to smaller datasets, encoding only the most salient, high-level information from the data, and steering the generative process away from pre-existing biases and towards more accurate cultural representations. Our empirical studies, involving 51 participants from five distinct countries who evaluated the cultural relevance and presence of stereotypes in images generated by our method versus a baseline, demonstrate significant improvements. The images generated after fine-tuning with the CCUB dataset were consistently rated as more culturally relevant and less stereotypical. These findings underscore the potential of our approach in paving the way for more equitable and accurate representations of cultures in generative media.
BibTeX
@mastersthesis{Liu-2024-140690,author = {Zhixuan Liu},
title = {Towards Equitable Representation in Text-to-Image Generation},
year = {2024},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-24-14},
keywords = {computer vision for social good, text-to-image, fairness},
}