Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments

Authors

Panov A. , Staroverov A. , Kovalyov A. , Gorodetsky A.

Annotation

In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model was adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.

External links

DOI: 10.1109/ACCESS.2023.3334791

Download PDF at IEEE Explore: https://ieeexplore.ieee.org/document/10323309

Download the code at GitHub: https://github.com/cog-isa/rozumarm-vima

Authors

Annotation

External links

Reference link