Web Media Mining

Computer Science PhD Course

Laboratories: Web and Media Mining

  • Running large scale experiments
  • Topic Models
  • Information extraction
  • Neural embeddings

Laboratories: Introductory to Web Search

If you are not familiar with IR you should do these laboratories. Follow the guides instructions carefully. They assume that you have the basic knowledge to setup a Java project with external resources (data files and JARs).

An Eclipse project is available with all materials to implement the laboratories - you can download it and start from there.

trec_eval. The trec_eval evaluation source code is available on github. Linux, MacOS and Windows bash users must compile their own versions. Windows users who don't have bash or cygwin installed can download the windows binaries.

Libraries. In order to run and implement the laboratory exercises, you must install the following libraries:
  • Lucene: This is a library with implementing several algorithms and retrieval models. You can build your own search engine with this library.
  • Luke: This is a GUI tool to inspect the indexes of lucene.
  • Jsoup: An HTML parser that allows extracting only text.
  • RankLib: This is a learning to rank library implementing several machine learning based ranking algorithms. It requires a training and a test set.