Download Latest Version v2.2.0 source code.tar.gz (401.3 kB)
Email in envelope

Get an email when there's a new version of BEIR

Home / v2.2.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-06-04 6.0 kB
v2.2.0 source code.tar.gz 2025-06-04 401.3 kB
v2.2.0 source code.zip 2025-06-04 498.4 kB
Totals: 3 Items   905.7 kB 1

Since the previous release of BEIR, I have been updating the repository to support evaluation for the latest SoTA embedding models.

1. Relax faiss dependency; as it's made optional now! Users would need to install faiss-cpu manually

A major complaint was that faiss-cpu was causing installation errors when users would install beir alongside other packages. To avoid this, in the previous version v2.1.0, we removed the faiss-cpu dependency from BEIR. However, this was causing installation errors as the faiss type was present in the Faiss search modules in BEIR, which I sadly overlooked. I have removed the faiss type module, and now the PyPI BEIR installation should be smooth in v2.2.0 without the faiss-cpu package.

2. Extended models.HuggingFace to support multi-GPU inference! 🎊

Thanks to boilerplate codes provided by the E5-team & MTEB, I have updated the huggingface code to use DDP, where the data is distributed across multiple GPUs for inference. Check out an example code here: evaluate_huggingface.py.

It should work directly, just provide the number of GPUs with CUDA_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES=0,1,2,3 python evaluate_huggingface.py

3. Added EvaluateRetrieval.encode_and_retrieve() which first computes embeddings, saves them as a pickle (numpy), and loads them to search with faiss! 🥳

Thanks to the boilerplate codes provided by the Tevatron team, I added this crucial feature in the search with dense retrieval. Earlier, in EvaluateRetrieval.retrieve(), we would encode a sub-batch of corpus embeddings (usually 50K), compute the top-k similarity scores using PyTorch, and save them in a results heap.

Now, we have introduced EvaluateRetrieval.encode_and_retrieve() function that first encodes the queries and encodes the corpus in batches, and saves the embeddings (numpy float) and text_ids as a pickle. This is especially great with embeddings with API providers, as we do not want to recompute embeddings, as it takes time & money.

  1. Encode the queries and passages and store them as a pickle in your local folder. The embeddings will be stored in the encode_output_path folder with queries.pkl for queries and corpus.0.pkl, corpus.1.pkl, .... for passages in the corpus, where each pickle contains a maximum of 50K documents. The overwrite parameter denotes whether we need to overwrite the existing embeddings present or not.

    :::python self.retriever.encode( corpus=corpus, queries=queries, encode_output_path="./embeddings/", overwrite=False, query_filename="queries.pkl", corpus_filename="corpus..pkl", *kwargs, ) 2. After encoding, load the pickle back into numpy and use faiss to do an exact flat search to get the similar documents for each query. Make sure you install the faiss-cpu library: pip install faiss-cpu. You need to provide the query_embedding_file as str and all the list of corpus_embedding_files as List[str]. The function will output back the results dictionary, which contains the top-k passages with scores for each query_id.

    :::python self.retriever.search_from_files( query_embeddings_file=query_embeddings_file, corpus_embeddings_files=corpus_embeddings_files, top_k=self.top_k, **kwargs, )

    4. Added LoRA evaluation models with vLLM support for much faster encoding and inference than huggingface! 🥳

    Again, thanks to boilercode from the Tevatron team, we can now support LoRA fine-tuned models with LLMs such as rlhn/Qwen2.5-7B-rlhn-400K, We have added support to evaluate LoRA fine-tuned models with the vLLM package. You need to make sure you install peft, accelerate, and vllm packages to be able to utilize this.

An example of how to use LoRA evaluation models with vLLM support is shown in evaluate_lora_vllm.py.

NOTE: You can merge the LoRA model weights back to the original LLM model and use it for an even faster inference with vLLM!

5. Added API evaluations such as Cohere, Voyage, etc.! 😎

Many providers wish to benchmark their models against API providers such as OpenAI, Cohere, or Voyage, among others. To enable this we now support API evaluation with API based models. We currently support two vendors: Cohere & Voyage, with more to come soon in the repository.

6. Small but mighty: Added a util function to load a TREC runfile and compute the evaluation scores with BEIR.

A small but useful utility is to get the TREC runfile and compute the nDCG@K or similar metric scores. Now this is possible as I have added a util function to load the TREC runfile as a results dictionary, which can be used to evaluate with qrels to quickly get the evaluation metric scores.

I'm happy to take suggestions pertaining to the improvement of the repository; e.g., what features do users want, and how to keep this repository relevant even though it has been 4-5 years since its inception.

What's Changed

Full Changelog: https://github.com/beir-cellar/beir/compare/v2.1.0...v2.2.0

Source: README.md, updated 2025-06-04