ColPali and PLAID- Part 1

8 min readSep 20, 2024

ColPali:

Context:

To enhance the query-answering capabilities of large language models (LLMs), it is often more effective to first search for relevant information online or within external document collections, such as PDFs, before generating a grounded response using Retrieval-Augmented Generation (RAG). In practice, setting up efficient retrieval pipelines for PDF documents is essential for performance but can be quite complex. The process involves running Optical Character Recognition (OCR) on scanned PDFs, followed by using Document Layout Detection models to segment pages into different sections like paragraphs, figures, and titles. Once segmented, the structure and reading order of the page are reconstructed. Additionally, resource-intensive specialized models can optionally be applied to generate natural language captions for figures, images, and tables. A coherent chunking strategy is then used to split or merge text passages appropriately. Finally, a powerful neural embedding model, such as BGE M3, maps the text chunks into a semantically meaningful vector space, and the vector index is stored for future retrieval tasks.

Architecture:

The core idea is to utilize the alignment between text and image token embeddings learned during multi-modal fine-tuning. In this regard, we present ColPali, an extension of PaliGemma-3B, designed to generate multi-vector representations of text and images in the style of ColBERT.

During runtime , a user query is embedded by the language model, to obtain token embeddings. We are able to run a ColBERT-style “late interaction” (LI) operation to efficiently match query tokens to document patches

Late Interaction :

This mechanism is used in retrieval models like ColBERT for comparing a query q and a document d in a common embedding space .

This operator compares the embeddings by summing the maximum similarity between each query vector Eq(i)E_q^{(i)}Eq(i) and the document vectors Ed(j)E_d^{(j)}Ed(j). In this case, similarity is measured using the dot product ⟨Eq(i)∣Ed(j)⟩⟨E_q^{(i)} | E_d^{(j)}⟩⟨Eq(i)∣Ed(j)⟩, which captures how closely aligned the vectors are in the embedding space.

This calculates the maximum dot product for each query vector Eq(i)E_q^{(i)}Eq(i) with all document vectors Ed(1:Nd)E_d^{(1:N_d)}Ed(1:Nd), and then sums these maximum values. The idea is to find the most relevant parts of the document for each query vector.

Issue with ColPali especially the late interaction process:

Space (Compression)- uses dense vector representations, typically 128-dimensional or higher, for each token. These vectors are stored with high precision (e.g., 16-bit or 32-bit floating-point values).For large document collections, this results in substantial storage requirements. For instance, a single document with hundreds of tokens can require kilobytes of storage just for the embeddings. The overall footprint becomes massive when dealing with millions of documents, making it impractical for many real-world applications.
Latency- During retrieval, LI requires computing pairwise similarities between query and document token embeddings. This is computationally expensive, especially for long documents and queries. The exhaustive computation of these similarities can lead to high latency, making real-time retrieval challenging. Each query token needs to be compared with all document tokens, leading to a quadratic growth in the number of comparisons as the length of queries and documents increases. This complexity translates into longer response times, which is particularly problematic for applications requiring low-latency responses, such as interactive search engines.

PLAID is designed to address the space and latency issues inherent in ColPali.

Space (Compression) —
Centroid-based Encoding and Residual Quantization — PLAID uses a centroid-based encoding approach, similar to the residual representation in ColBERTv2.
Efficient Vector Compression — By quantizing the residuals into lower precision (e.g., 1 or 2 bits per dimension), PLAID further compresses the embeddings.
Latency — Centroid based candidate paragraph retreival helps in reduding the time.

Before going to PLAID, we should revisit the ColBERTV2 residual compression method.

ColBERTV2:

ColBERTv2, a retriever that couples an aggressive residual compression mechanism to improve the quality and space footprint of late interaction and is built on top of ColBERT.

Compression Strategy:

Step 1: Clustering

*Centroids (C ):
- A set of centroids is precomputed. Each centroid represents a cluster of vectors in the embedding space.
- For example, let’s assume we have 4 centroids: ( C = {C_1, C_2, C_3, C_4}).

Step 2: Encoding Vectors
Each vector ( v ) is assigned to its closest centroid ( C_t).
- For example, suppose we have a vector ( v = [0.9, 0.8]), and it is closest to centroid ( C_2 = [0.85, 0.75] ).

-The residual ( r ) is computed as ( r = v — C_t ).
- For our example: ( r = [0.9, 0.8] — [0.85, 0.75] = [0.05, 0.05] ).

Quantization of Residual:
- The residual ( r ) is quantized into a lower precision representation.
- In ColBERTv2, each dimension of ( r ) is quantised into one or two bits (( b = 1) or ( b = 2 )).
- For simplicity, let’s assume ( b = 2 ). Each component of ( r ) can be quantised into one of four possible values.

Step 3: Storage

- The centroid index ( t ) is stored, which requires (ceil(log C) ) bits.
- The quantised residual (r) is stored, which requires (bn) bits, where \( n \) is the dimensionality of the vector.

- For ( n = 128 ), ( b = 2 ), and ( |C| = 2^{32} ):
- Centroid index requires ( 32 ) bits (or 4 bytes).
- Quantised residual requires ( 128*2 = 256 ) bits (or 32 bytes).

Comparison:

- Traditional ColBERT uses 256-byte vector encodings at 16-bit precision.
- ColBERTv2 uses 20 or 36 bytes per vector, significantly reducing storage costs.

Total Storage:

- Centroid index: 4 bytes.
- Quantized residual: 32 bytes.
- Total: 36 bytes per vector.

Centroid Calculation and residual Generation:

Steps for Generating Centroids and Residuals
1. Passage Embedding Collection: Collect the embeddings of all tokens in the passages.

2. Clustering: Apply k-means clustering to group the token embeddings into clusters (centroids).

3. Centroid Calculation: Calculate the centroid for each cluster.

4. Residual Calculation: Compute the residuals for each token embedding relative to its assigned centroid.

5. Residual Encoding: Encode the residuals using 1-bit precision.

Let’s assume we have the following passage embeddings (each embedding has 6 dimensions):

1. Passage Embeddings:

- Passage 1: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
- Passage 2: [0.9, 1.0, 0.9, 1.0, 0.9, 1.0], [0.3, 0.4, 0.3, 0.4, 0.3, 0.4]
- Passage 3: [0.7, 0.8, 0.7, 0.8, 0.7, 0.8], [0.1, 0.2, 0.1, 0.2, 0.1, 0.2]
- Passage 4: [0.3, 0.4, 0.3, 0.4, 0.3, 0.4], [0.7, 0.8, 0.7, 0.8, 0.7, 0.8]
- Passage 5: [0.5, 0.6, 0.5, 0.6, 0.5, 0.6], [0.9, 1.0, 0.9, 1.0, 0.9, 1.0]
- Passage 6: [0.7, 0.8, 0.7, 0.8, 0.7, 0.8], [0.5, 0.6, 0.5, 0.6, 0.5, 0.6]

1. Passage Embedding Collection

Collect all token embeddings from the passages into a single matrix \( E \):

E = [
0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 \\
0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 \\
0.9 & 1.0 & 0.9 & 1.0 & 0.9 & 1.0 \\
0.3 & 0.4 & 0.3 & 0.4 & 0.3 & 0.4 \\
0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8 \\
0.1 & 0.2 & 0.1 & 0.2 & 0.1 & 0.2 \\
0.3 & 0.4 & 0.3 & 0.4 & 0.3 & 0.4 \\
0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8 \\
0.5 & 0.6 & 0.5 & 0.6 & 0.5 & 0.6 \\
0.9 & 1.0 & 0.9 & 1.0 & 0.9 & 1.0 \\
0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8 \\
0.5 & 0.6 & 0.5 & 0.6 & 0.5 & 0.6
]

2. Clustering

Apply k-means clustering to the embeddings. Suppose we use \(k = 5\) clusters (centroids). The clustering algorithm assigns each token embedding to one of the 5 centroids.

Assume the clustering result is as follows (each row indicates the index of the assigned centroid):

Cluster Assignments =[
1 \\
1 \\
2 \\
3 \\
4 \\
1 \\
3 \\
4 \\
5 \\
2 \\
4 \\

]

3. Centroid Calculation

Calculate the centroid for each cluster. Each centroid is the mean of all embeddings assigned to that cluster.

Centroid 1 (C1):

C1 = 0.33[
0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 \\
0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 \\
0.1 & 0.2 & 0.1 & 0.2 & 0.1 & 0.2
]

=[0.133 & 0.233 & 0.266 & 0.366 & 0.4 & 0.5]

Centroid 2 (C2):
C2 = 0.5[
0.9 & 1.0 & 0.9 & 1.0 & 0.9 & 1.0 \\
0.9 & 1.0 & 0.9 & 1.0 & 0.9 & 1.0
]

= [0.9 & 1.0 & 0.9 & 1.0 & 0.9 & 1.0]

Centroid 3 (C3):
C3 = 0.5[
0.3 & 0.4 & 0.3 & 0.4 & 0.3 & 0.4 \\
0.3 & 0.4 & 0.3 & 0.4 & 0.3 & 0.4
= 0.3 & 0.4 & 0.3 & 0.4 & 0.3 & 0.4

-Centroid 4 (C4):
C4 = 0.33[
0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8 \\
0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8 \\
0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8
=[0.7 & 0.8 & 0.7 & 0.8 & 0.7 & 0.8]

Centroid 5 (C5):
C5 = 0.5[
0.5 & 0.6 & 0.5 & 0.6 & 0.5 & 0.6 \\
0.5 & 0.6 & 0.5 & 0.6 & 0.5 & 0.6
]

= [0.5 & 0.6 & 0.5 & 0.6 & 0.5 & 0.6]

4. Residual Calculation

Compute the residuals for each token embedding relative to its assigned centroid. The residual is the difference between the token embedding and its centroid.

For example,
for the first token embedding [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] assigned to (C1):

[0.133 & 0.233 & 0.266 & 0.366 & 0.4 & 0.5]
[-0.033 & -0.033 & 0.034 & 0.034 & 0.1 & 0.1]

5. Residual Encoding

Encode the residuals using 1-bit precision. With 1-bit precision, we can represent the residual as either -1 or 1 (or any two states). For simplicity, let’s assume that we encode the residual as follows:

- If the residual is negative, encode it as -1.
If the residual is positive, encode it as 1.
For the residual [-0.033, -0.033, 0.034, 0.034, 0.1, 0.1] — [-1 & -1 & 1 & 1 & 1 & 1]

Part 2 —

https://neerajku.medium.com/colpali-and-plaid-part-2-a316ccee2052

References:

ColPali: Efficient Document Retrieval with Vision Language Models
PLAID: An Efficient Engine for Late Interaction Retrieval
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction