org.knowceans.corpus.analysis
Class TopicsConverter

java.lang.Object
  extended by org.knowceans.corpus.analysis.TopicsConverter
Direct Known Subclasses:
AtmTopicsConverter

public class TopicsConverter
extends java.lang.Object

TopicAnalyser extracts topics from Phi and Theta variables and shows the Bayesian equivalent of the phi[z][w] = P(z|w) = P(w|z) P(z) / sum_z'(P(w|z') P(z')) or, equivalently, theta[d][z] = P(d|z) = P(z|d) P(d) / sum_d'(P(z|d') P(d'))

Author:
heinrich

Constructor Summary
TopicsConverter()
           
 
Method Summary
protected  void analyse(java.lang.String filename, boolean transposed, java.lang.String labelFilename, double threshold, double postThreshold, java.lang.String comment, java.lang.String postComment)
          Analyse binary probability matrix (conditional).
static java.util.Vector<org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer>> extractTopics(double[][] a, double threshold, boolean transposed)
          Extract topic lists for the probability matrix a (topics in columns).
static void main(java.lang.String[] args)
           
protected static double[] normaliseRows(double[][] matrix)
          normalises the rows of the matrix in situ and returns the vector of normalisation factors.
static double[][] posterior(double[][] likelihood)
          Calculate posterior probability p(y|x) = p(x|y) / sum_y'(p(x|y')) with uniform prior p(y) = const.
static double[][] posterior(double[][] likelihood, double[] prior)
          Calculate posterior probability p(y|x) = p(x|y) p(y) / sum_y'(p(x|y') p(y')) with a prior given.
static void printMatrix(double[][] a)
           
 void run(java.lang.String corpus, java.lang.String model)
           
static void saveTopics(java.lang.String filename, java.util.Vector<org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer>> topics, java.lang.String comment, java.lang.String additional)
          saves a topic hashmap to a readable file; looks up the row labels (terms or document names) from additional file.
static void test()
          test driver for posterior calculation.
static org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer> truncateMap(org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer> sorter, double threshold)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TopicsConverter

public TopicsConverter()
Method Detail

main

public static void main(java.lang.String[] args)

run

public void run(java.lang.String corpus,
                java.lang.String model)

analyse

protected void analyse(java.lang.String filename,
                       boolean transposed,
                       java.lang.String labelFilename,
                       double threshold,
                       double postThreshold,
                       java.lang.String comment,
                       java.lang.String postComment)
Analyse binary probability matrix (conditional).

Parameters:
filename - binary file with original matrix
transposed - true if binary file has transposed normalisation
labelFilename - to load a list of labels for non-topic indexes
threshold - shade visualisation threshold for original matrix or NaN to disable
postThreshold - shade visualisation threshold for posterior matrix or NaN to disable
comment - for original matrix
postComment - for posterior matrix

extractTopics

public static java.util.Vector<org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer>> extractTopics(double[][] a,
                                                                                                                 double threshold,
                                                                                                                 boolean transposed)
Extract topic lists for the probability matrix a (topics in columns). The method sorts all the columns and places them in a vector of tree maps that are sorted by probability in descending order.

Parameters:
a - matrix
threshold - down to which the probabilities are extracted. If negative, the threshold is taken as count, how many of each topic to extract
transposed - topics in row instead of columns
Returns:
a vector of treemultimaps(probability->index) in decreasing order with topic ids as vector indices (multimap is necessary since two indices can have the same probability).

truncateMap

public static org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer> truncateMap(org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer> sorter,
                                                                                             double threshold)

saveTopics

public static void saveTopics(java.lang.String filename,
                              java.util.Vector<org.knowceans.map.TreeMultiMap<java.lang.Double,java.lang.Integer>> topics,
                              java.lang.String comment,
                              java.lang.String additional)
saves a topic hashmap to a readable file; looks up the row labels (terms or document names) from additional file.

Parameters:
filename - target file
topics - vector of maps that contain the probability->index associations for each topic
comment - put on top of the target file
additional - filename of rowlabels information (.docs, .vocab, .actors)

test

public static void test()
test driver for posterior calculation.


printMatrix

public static void printMatrix(double[][] a)
Parameters:
a -

posterior

public static double[][] posterior(double[][] likelihood,
                                   double[] prior)
Calculate posterior probability p(y|x) = p(x|y) p(y) / sum_y'(p(x|y') p(y')) with a prior given.

Parameters:
likelihood - likelihood[y][x] = p(x|y), i.e., normalised along rows
prior - prior[y] = p(y), i.e., must be normalised
Returns:
posterior[x][y] = p(y|x)

posterior

public static double[][] posterior(double[][] likelihood)
Calculate posterior probability p(y|x) = p(x|y) / sum_y'(p(x|y')) with uniform prior p(y) = const.

Parameters:
likelihood - likelihood[y][x] = p(x|y), i.e., normalised along rows
Returns:
posterior[x][y] = p(y|x)

normaliseRows

protected static double[] normaliseRows(double[][] matrix)
normalises the rows of the matrix in situ and returns the vector of normalisation factors.

Parameters:
matrix -
Returns:
the vector of the row sums before normalisation