Document Ranking using Bert

3 min readAug 7, 2021

Document ranking is a well-known problem in the NLP domain which has prominent use cases across the various industries in extractive question answering, building chatbots, recommendations, etc. With the advent of transformer-like models such as BERT, ROBERTA, GPT, etc. which are pretrained on large datasets and contains semantic relationships as well as world knowledge.

We will talk about its use in document ranking The simplest way to solve this problem is to convert it into a text classification problem where the classifier predicts whether the document is relevant or not. At inference, we sort the documents based on the probabilities. This approach represents a direct realization of the Probability Ranking Principle, which states that documents should be ranked in decreasing order of the estimated probability of relevance

In the candidate generation stage (also called initial retrieval or first-stage retrieval), candidate texts are retrieved from the corpus, typically with bag-of-words queries against inverted indexes. These candidates are then reranked with a transformer model such as monobert

The first architecture of document ranking based on BERT was MonoBERT. It uses the pre-trained BERT and fine-tune on the query document datasets to predict its relevance using cross-entropy loss. However, it has limitations due to its inability to handle long input sequences, and hence the difficulty in ranking texts whose lengths exceed the designed model input(512 token max length).

To overcome these challenges, the researchers have devised three approaches namely — Birch, BERT–MaxP, and CEDR .

Document Ranking by Birch

It solves the problem of large document length by implementing these two parts. First, During training, it exploits data where length issues don’t exist and then transfers the relevance matching models into domain or task of interest. Second, during inference , it converts the task of estimating the document relevance into the task of scoring the relevance of the sentence and then aggregating the score.

To compute the relevance score, (Sf) of the document, the inference is applied to each individual sentence in the document, and then the top n scores are combined with the original document score (Sdoc). The inference is applied on top document retrieved using Anserini (built on FAISS). Sdoc scores of these document come by using anserini. alpha is the hyperparameter.

2. Document Ranking by BERT–MaxP and Variants

During training BERT, segment documents into overlapping passage and treat segments from a relevant document as relevant and all segments from a non-relevant document as not relevant. During inference , segment the document into passage and perform the simple aggregation method on passage relevance score.

BERT–MaxP: take the maximum passage score as the document score
BERT–FirstP: take the score of the first passage as the document score
BERT–SumP: take the sum of all passage scores as the document score

3. Contextual embedding for Document Ranking

All previous approaches have used [CLS] embedding for classification and generating the relevance score of the document. All of these models discard the contextual embeddings that BERT produces for both the query and the candidate text. CEDR is the first approach to use the contextual embedding of BERT.

Documents that are too long for BERT are split into chunks and BERT inference is applied to each chunk independently. CEDR creates an aggregate [CLS] representation by averaging the [CLS] representations from each chunk (i.e., average pooling). CEDR constructs similarity matrices between terms from the query and terms from the candidate text. The model then concatenates the contextual embeddings for terms from the candidate text in each chunk to form the sequence of contextual embeddings for the entire text. Thus, query–document relevance scores are derived from two main sources: the [CLS] token (as in monoBERT, Birch, and BERT–MaxP) and from signals derived from the query–document term similarities.

The architecture of CEDR, which comprises two main sources of relevance signals: the [CLS] representation and the similarity matrix computed from the contextual embeddings of the query and the candidate text. This illustration contains a number of intentional simplifications in order to clearly convey the model’s high-level design.

References

Multi-Stage Document Ranking with BERT https://arxiv.org/pdf/1910.14424.pdf
Applying BERT to Document Retrieval with Birch https://aclanthology.org/D19-3004.pdf
Deeper Text Understanding for IR with Contextual Neural Language Modeling https://arxiv.org/pdf/1905.09217.pdf
CEDR: Contextualized Embeddings for Document Ranking https://arxiv.org/pdf/1904.07094.pdf

Document Ranking using Bert

Written by Neeraj Kumar

Responses (1)