Web Media Mining
Computer Science PhD Course
Laboratories: Web and Media Mining
- Running large scale experiments
- Topic Models
- Information extraction
- Neural embeddings
Laboratories: Introductory to Web Search
If you are not familiar with IR you should do these laboratories. Follow the guides instructions carefully. They assume that you have the basic knowledge to setup a Java project with external resources (data files and JARs).
An Eclipse project is available with all materials to implement the laboratories - you can download it and start from there.
trec_eval. The trec_eval evaluation source code is available on github. Linux, MacOS and Windows bash users must compile their own versions. Windows users who don't have bash or cygwin installed can download the windows binaries.
Libraries. In order to run and implement the laboratory exercises, you must install the following libraries:- Lucene: This is a library with implementing several algorithms and retrieval models. You can build your own search engine with this library.
- Luke: This is a GUI tool to inspect the indexes of lucene.
- Jsoup: An HTML parser that allows extracting only text.
- RankLib: This is a learning to rank library implementing several machine learning based ranking algorithms. It requires a training and a test set.
- Web Media Mining labs
- Web Search labs
- Eclipse
- trec_eval
- Libraries