Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
arXiv:2406.03628v3 Announce Type: replace Abstract: Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. […]