Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments

Авторы

Панов А. И. Ковалёв А. К. Староверов А. В. Городецкий А. С.

Аннотация

In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model was adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.

Внешние ссылки

DOI: 10.1109/ACCESS.2023.3334791

Скачать PDF на IEEE Explore (англ.): https://ieeexplore.ieee.org/document/10323309

Скачать код на GitHub: https://github.com/cog-isa/rozumarm-vima

Ссылка при цитировании

Aleksei Staroverov, Andrey S. Gorodetsky, Andrei S. Krishtopik, Uliana A. Izmesteva, Dmitry A. Yudin, Alexey K. Kovalev, and Aleksandr I. Panov. Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments // IEEE Access, Volume 11, 2023.