BLEURT - ACL 2020

We discuss the Bleurt paper that appears in ACL 2020, handles the problem of learning metrics for some specific NLP tasks. Note that there is a Google blog article on this paper too. This blog article gives a good overview of what’s going on in this paper, however the following are my short notes, primarily for personal understanding.

In literature, and previous work, when we consider machine translation or data-to-text tasks, we often use metrics such as BLEU, or ROUGE.

Overall goal

We want to design a metric, and show that this corresponds closely to human experience.

For instance, consider machine translation. Humans may discern two sentences as being roughly similar to each other (or not). A good metric is one that mimics this process.

Motivating examples

Metrics such as BLEU, ROUGE measure overlaps of n-grams. Thus, such measures fail to take into consideration the semantic similarity of two pieces of text (typically, sentences). More so, recent work, ACL 2020 indicates that small differences in BLEU are not indicative enough of model performance (also see this blog).

Recent papers have tried to improve upon these by either proposing

Fully learned metrics Examples: BEER, RUSE, ESIM etc. This allows more expressivity but does require some amount of training data (here, training data means actual ratings: given two sentences s1 and s2, are they close by?)
Hybrid metrics Examples: YiSi, BERTScore, etc. These combine trained elements with handwritten logic (such as for token alignment etc.)

Beginning steps

Now that we have BERT, we have access to contextual embeddings of words in sentences.
- How do we make use of that? We may consider the two sentences, and consider their BERT embeddings, aggregate the embeddings in some way, and then measure the similarity between the embeddings of the two sentences.
- As is usual, the aggregation step is usually done via tracking the embedding of the [CLS] token.
- While this is a feasible approach, there are some issues with this. BERT was trained on a certain corpus; so if we pick up data from another domain, the (pretrained) embeddings might not be very meaningful by themselves.
So, a natural approach is to infuse some finetuning of BERT using data from the new domain. Here, we assume that we do have some ratings data, where the data essentially looks like (sentence_1, sentence_2, rating) where rating is (say) a real number between 0 and 1. However, the ratings data is not very large, so we only get so much mileage from this idea.
- How do we plan to use such a model?
The next idea is to mitigate this problem - we want to get a lot of (synthetic) data in order to pretrain the BERT model, before finetuning it on the limited ratings data. The main question is how do we get this synthetic ratings data? This is by now, a familiar style of thinking - this follows the weak supervision motif of
“think label functions instead of labels” (see for instance Snorkel). Thus, here too, we will get a whole bunch of synthetic labeled data, and use that for pretraining. Does this make this metric a “hybrid” metric too? No! Here the handcrafting has been folded in the upstream processing, so to speak: the handcrafting here is to devise rules to generate the synthetic data. But since we are using the handcrafting to generate (a large amount of) training data that will then (hopefully) be smoothed out by the training process.

Overall goal

Motivating examples

Beginning steps

References: