The multi-modal tasks have started to play a significant role in the research on artificial intelligence. A particular example of that domain is visual–linguistic tasks, such as visual question answering. The progress of modern machine learning systems is determined, among other things, by the data on which these systems are trained. Most modern visual question answering data sets contain limited type questions that can be answered either by directly accessing the image itself or by using external data. At the same time, insufficient attention is paid to the issues of social interactions between people, which limits the scope of visual question answering systems. In this paper, we propose criteria by which images suitable for social interaction visual question answering can be selected for composing such questions, based on psychological research. We believe this should serve the progress of visual question answering systems.
Download PDF from Oxford Academic (registration required): https://academic.oup.com/jigpal/advance-article-abstract/doi/10.1093/jigpal/jzae026/7632103
Anfisa A Chuganskaya, Alexey K Kovalev, Aleksandr I Panov. Sign-based image criteria for social interaction visual question answering // Logic Journal of the IGPL, jzae026, 2024.