Web and Media Search - Lab materials

Home · Slides and bibliography · Lab guides · Software Libraries · Resources

Laboratories: Web and Media Mining

Running large scale experiments
Topic Models
Information extraction
Neural embeddings

Laboratories: Introductory to Web Search

If you are not familiar with IR you should do these laboratories. Follow the guides instructions carefully. They assume that you have the basic knowledge to setup a Java project with external resources (data files and JARs).

An Eclipse project is available with all materials to implement the laboratories - you can download it and start from there.

trec_eval. The trec_eval evaluation source code is available on github. Linux, MacOS and Windows bash users must compile their own versions. Windows users who don't have bash or cygwin installed can download the windows binaries.

Libraries. In order to run and implement the laboratory exercises, you must install the following libraries:

Lucene: This is a library with implementing several algorithms and retrieval models. You can build your own search engine with this library.
Luke: This is a GUI tool to inspect the indexes of lucene.
Jsoup: An HTML parser that allows extracting only text.
RankLib: This is a learning to rank library implementing several machine learning based ranking algorithms. It requires a training and a test set.

Web Media Mining labs
Web Search labs
Eclipse
trec_eval
Libraries

Web Media Mining

Computer Science PhD Course

Laboratories: Web and Media Mining

Laboratories: Introductory to Web Search