In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model was adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.
Download PDF at IEEE Explore: https://ieeexplore.ieee.org/document/10323309
Download the code at GitHub: https://github.com/cog-isa/rozumarm-vima
Aleksei Staroverov, Andrey S. Gorodetsky, Andrei S. Krishtopik, Uliana A. Izmesteva, Dmitry A. Yudin, Alexey K. Kovalev, and Aleksandr I. Panov. Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments // IEEE Access, Volume 11, 2023.