Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments

Авторы

Панов А. И. , Староверов А. В. , Ковалёв А. К. , Городецкий А. С.

Аннотация

In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model was adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.

Внешние ссылки

DOI: 10.1109/ACCESS.2023.3334791

Скачать PDF на IEEE Explore (англ.): https://ieeexplore.ieee.org/document/10323309

Скачать код на GitHub: https://github.com/cog-isa/rozumarm-vima

Авторы

Аннотация

Внешние ссылки

Ссылка при цитировании