Generalization is one of the main goals of contemporary machine learning research and is seen as a pathway to general artificial intelligence. Although today’s large pre-trained language models (LMs) continue to push the state of the art in natural language processing (NLP), most of these models target specific problem classes and experience significant performance hits when applied to new tasks. Is it possible to pre-train language models that will perform well in many diverse tasks?
A Google Research/Brain team addresses this question in the new article Unifying language learning paradigms, proposing UL2, a framework for pretraining universal language models effective in many different tasks. Their 20B parameter model outperforms the state-of-the-art 175B GPT-3 on the zero-shot SuperGLUE benchmark and triples the performance of the T5-XXL on one-shot resume tasks.
The UL2 framework aims to create a universally applicable language model that is consistently effective across various types of data sets, tasks, and configurations. UL2 is driven by Mixture-of-Denoisers (MoD), a new pre-training objective that integrates various pre-training paradigms to enable a single model to maintain good performance in different tasks.
The MoD uses three main paradigms when pretraining: R-Denoiser, a standard denoiser that is good for gaining knowledge instead of learning how to generate fluent text; S-Denoiser, designed for specific denoising cases where a strict sequential order can be observed to bracket input-to-target tasks; and X-Denoiser, which is adopted when the model needs to fetch a large portion of the input but receives only a small, moderate portion. A new mode switching feature enables dynamic mode switching via a discrete prompt, so the model can switch between R, S, and X denoisers on demand when training downstream tasks.
In their empirical study, the team conducted extensive experiments on a variety of tasks ranging from supervised learning to learning in moments in context. In the ratings, the proposed UL2 outperformed a T5 baseline by 43.6% and GPT-type models by 76.1%. The team also scaled the UL2 parameters to 20B and ran the model on more than 50 NLP tasks, where it achieved peak performance on a large majority of tasks and configurations. In the zero/few hit experiments, UL2 outperformed the GPT-3 175B on the zero hit SuperGLUE benchmark.
Author: Hecate He | Editor: Michel Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.