Skip to content
Snippets Groups Projects
README.md 2.33 KiB
Newer Older
Müller, Hanna's avatar
Müller, Hanna committed
# Advanced Information Retrieval 2023 - Project - Group 20

This is our project for the course Advanced Information Retrieval 2023, TU Graz.
In this project, we use BERTopic to extract topic information from documents and queries and incorporate the topic information for re-ranking documents after the initial retrieval with BM25.

Authors: Gatternig Elias, Müller Hanna, Palasser Georg

[View Design Document](./design-document.pdf)
Müller, Hanna's avatar
Müller, Hanna committed

[View Presentation Slides](./slides.pdf)
Müller, Hanna's avatar
Müller, Hanna committed

## Dataset

We used the publicly available [CISI collection](https://ir.dcs.gla.ac.uk/resources/test_collections/cisi/) of the University of Glasgow, containing 1460 documents and 112 queries. It contains documents, queries and a "ground-truth" of query-document matchings.
Müller, Hanna's avatar
Müller, Hanna committed

## How to run

Müller, Hanna's avatar
Müller, Hanna committed
To run the first make sure all the requirements are met. Simply use the command "pip install -r requirements.txt" in your terminal. This will install all the packages required to run our code. Check if all libraries are installed.
Müller, Hanna's avatar
Müller, Hanna committed
Secondly, for running the "reranker-cosine.ipynb" notebook it is necessary to add the "GoogleNews-vectors-negative300.bin" word-embeddings file into the "models" folder as they were not uploaded to the repository initially, because of its large size. However, they can be downloaded [here](https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300).

## Files

<u>**initial-retrieval.ipynb**</u>
Müller, Hanna's avatar
Müller, Hanna committed
Creating the initial retrieval of the queries using bm25. Retrieves 100 documents out of 1460 documents per query and saves the results in _initial_retrieval_with_bm25_scores.pkl_.

<u>**reranker-cosine.ipynb**</u>
Müller, Hanna's avatar
Müller, Hanna committed
Re-ranks the retrieval results with the help of cosine similarity and the pre-trained embeddings of _GoogleNews-vectors-negative300_. Retrieves 50 documents out of the initial 100 per query and saves the results in _reranker_embeddings_cosine_results.pkl_.

<u>**reranker-bertopic.ipynb**</u>
Müller, Hanna's avatar
Müller, Hanna committed
Creates topics for all documents and queries and re-ranks the initial retrieval results. Retrieves 50 documents per query and saves the results in _reranker_bertopic_results_topic_model.pkl_.

<u>**evaluation.ipynb**</u>
Müller, Hanna's avatar
Müller, Hanna committed
Takes the results of all three methods and calculates Recall@k, Precicsion@k, F1@k and nDCG@k. Creates plots for visualizing the results.

<u>**dataset_and_bertopic_analysis.ipynb**</u>
Contains analysis of the dataset and BERTopic experiments.