Text, web and multimedia mining

Task:

Generate bag-of-words file out of provided news articles using Txt2Bow utility. Analyse how does stop-word removal, stemming and the length of n-grams influence the resulting vectors. To achieve this, work with the following parameters:

  • -stopword:none (no stop-word removal)
  • -stopword:en523 (use pre-defined list of 523 stop-words)
  • -stemmer:none (no stemming)
  • -stemmer:porter (stemming using Porter stemmer)
  • -ngramlen:1 (no n-grams)
  • -ngramlen:5 (n-grams of length 5)

Perform k-means clustering for two different values of k using BowKMeans utility and analyse the results.

Perform classification using BowTrainBinSVM and BowClassify utilities for two frequent and to rare categories. Find an article on the internet that is positively classified into each of the selected categories.

Prepare a presentation of the results in a 5-10 page report and 5-10 slides presentation (all in English).

Example from lectures:

>Txt2Bow.exe -inlndoc:news.txt -o:news.bow -stopwords:none -stemmer:none -ngramlen:1
>BowKMeans.exe -i:news.bow -clusts:5
>BowTrainBinSVM.exe -i:news.bow -o:news.bowmd -cat:GSPO
>BowClassify.exe -ibow:news.bow -imd:news.bowmd -qs:”olympic games”
>BowClassify.exe -ibow:news.bow -imd:news.bowmd -qh:article1.txt

Material:

Deadlines:

  • MPS – 24-03-2010
  • Statistics – 08-12-2010