GLUE Benchmarking

References

GLUE Explained

Introduction

The General Language Understanding Evaluation benchmark (GLUE) consists of nine tasks used to evaluate NLP models.

Why GLUE? In the past, NLP models were often designed for a single, specific task, e.g. Named Entity Recognition (NER) or sentiment analysis. However, in an attempt to improve the generalizability of NLP models, researchers began developing techinuqes from transfer learning. Models could be pretrained on a general language understanding task (e.g. BERT is trained to predict how likely it is that one (masked) sentence follows another). Once trained, this model can then be finetuned - by swapping out it's input and output layers and training on a different task.

GLUE provides a consistent benchmark for these pretrained models. Concretely, a model is finetuned and evaluated on each of the nine GLUE tasks whence its average score can be ranked / compared to other such models.

GLUE Tasks

See here.