Laboratory guides and exercises

The laboratory classes will follow the guides provided here. Follow the guides instructions carefully. They assume that you have the basic knowledge to setup a Java project with external resources (data files and JARs).


The document also includes a set of exercises that you should use to practice to the final exam.


Pre-configured Eclipse project

If you prefer to use an Eclipse project with all the materials to implement the laboratories, you can download this project and start from there.


Materials

Lab code. On every laboratories, you ought to implement and understand different algorithms. Basic implementations are available for download covering each lab session. You should read and adapt to solve the lab exercises.


trec_eval. The trec_eval evaluation source code is available on github. Linux, MacOS and Windows bash users must compile their own versions. Windows users who don't have bash or cygwin installed can download the windows binaries.


Dataset. The StackOverflow CrossValidated dataset contains a set of questions and answers published in the corresponding online forum.

The relevance judgments contains a file with a set pre-defined queries and another file with the relation between the queries and the relevant documents.

Libraries. In order to run and implement the laboratory exercises, you must install the following libraries:
  • Lucene: This is a library with implementing several algorithms and retrieval models. You can build your own search engine with this library.
  • Luke: This is a GUI tool to inspect the indexes of lucene.
  • Jsoup: An HTML parser that allows extracting only text.
  • RankLib: This is a learning to rank library implementing several machine learning based ranking algorithms. It requires a training and a test set.