org.knowceans.corpus.analysis
Class LdaSimilarityAnalyser

java.lang.Object
  extended by org.knowceans.corpus.analysis.LdaSimilarityAnalyser

public class LdaSimilarityAnalyser
extends java.lang.Object

LdaSimilarities analyses the distance between terms documents. Subclasses can implement clusterings based on this.

By convention, conditional likelihoods are normalised along rows, i.e., p(col|row) = double[row][col];

Author:
heinrich

Field Summary
private  java.lang.String comment
           
private  java.text.NumberFormat df
           
private  java.util.Vector<java.lang.String> docnames
           
private  boolean doJensenShannon
           
private  boolean doMutualLikelihood
           
(package private) static double log2
          basis
private  java.lang.String outfilebase
           
(package private)  double[][] phi
          the LDA topic--word associations
private  double[][] theta
          the LDA document--topic associations
private  org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> vocabulary
           
 
Constructor Summary
LdaSimilarityAnalyser(java.lang.String ldabase, java.lang.String corpusbase, java.lang.String outfilebase, boolean terms, boolean docs, boolean js, boolean ml, java.lang.String comment)
          Construct an LdaSimilarities object with path bases and action indicators for terms and documents processing
 
Method Summary
private  double[][] allJsDistances(double[][] pxz, boolean rownorm)
          Compute all js distances between the distributions in the
private  double[][] allMutualLikelihoods(double[][] pxz, double[][] pzx)
          Compute all mutual likelihoods (via a dot product, integrating out z).
private  org.knowceans.map.IndexRanking bestJsMatches(double[][] pxz, int item, int max)
          Find matching items for the item with index i (column!)
private  org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz, double[][] pzx, int item, int max)
          Find matching items for the item with index i (column!)
static double jsDistance(double[][] pxz, int px, int qx, boolean rownorm)
          Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
static double klDivergence(double[][] pxz, int px, int qx, boolean rownorm)
          Compute the Kullback-Leibler divergence between distributions px and qx, KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments px and qx are the rows or colums of a conditional probability distribution matrix.
static void main(java.lang.String[] args)
           
static double mutualLikelihood(double[][] pxz, double[][] pzx, int x1, int x2)
          Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1).
static double mylog(double arg)
          Specialised log function (now logarithmus dualis)
private  void progress(int i)
           
private  void run()
          Perform the actual processing based on the initialisation.
private  void saveDocSimilarities(org.knowceans.map.IndexRanking[] docMatches, java.lang.String ext)
          Saves the document similarities to a file
private  void saveTermSimilarities(org.knowceans.map.IndexRanking[] termMatches, java.lang.String ext)
          Saves the term similarities to a file
private static void test()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log2

static double log2
basis


phi

double[][] phi
the LDA topic--word associations


theta

private double[][] theta
the LDA document--topic associations


outfilebase

private java.lang.String outfilebase

comment

private java.lang.String comment

docnames

private java.util.Vector<java.lang.String> docnames

vocabulary

private org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> vocabulary

doJensenShannon

private boolean doJensenShannon

doMutualLikelihood

private boolean doMutualLikelihood

df

private java.text.NumberFormat df
Constructor Detail

LdaSimilarityAnalyser

public LdaSimilarityAnalyser(java.lang.String ldabase,
                             java.lang.String corpusbase,
                             java.lang.String outfilebase,
                             boolean terms,
                             boolean docs,
                             boolean js,
                             boolean ml,
                             java.lang.String comment)
                      throws java.io.IOException
Construct an LdaSimilarities object with path bases and action indicators for terms and documents processing

Parameters:
ldabase - path base of lda parameter set (path + filename excluding extensions .phi.zip and .theta.zip)
corpusbase - path base of corpus files (path + filname excluding .vocab, .docs etc. extensions)
outfilebase - path base for output files (outfilebase.termsims and/or outfilebase.docsims)
terms -
docs -
js - jensen shannon distance
ml - mutual likelihood
comment -
Throws:
java.io.IOException
java.lang.NumberFormatException
Method Detail

main

public static void main(java.lang.String[] args)

test

private static void test()

run

private void run()
Perform the actual processing based on the initialisation. This creates similarity matrices for both the terms and the documents.


progress

private void progress(int i)

saveTermSimilarities

private void saveTermSimilarities(org.knowceans.map.IndexRanking[] termMatches,
                                  java.lang.String ext)
Saves the term similarities to a file

Parameters:
termMatches - RankingMap array of term matchings
ext - extension appended to outfilebase

saveDocSimilarities

private void saveDocSimilarities(org.knowceans.map.IndexRanking[] docMatches,
                                 java.lang.String ext)
Saves the document similarities to a file

Parameters:
termMatches - RankingMap array of document matchings
ext - extension appended to outfilebase

bestJsMatches

private org.knowceans.map.IndexRanking bestJsMatches(double[][] pxz,
                                                     int item,
                                                     int max)
Find matching items for the item with index i (column!) using Jensen-Shannon distance.

Parameters:
pxz - conditional probability matrix with row normalisation
max - maximum number of matches
Returns:

bestMutLikMatches

private org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz,
                                                         double[][] pzx,
                                                         int item,
                                                         int max)
Find matching items for the item with index i (column!) using mutual likelihood.

Parameters:
pzx -
item -
pxz - conditional probability matrix with row normalisation
max - maximum number of matches
Returns:

allMutualLikelihoods

private double[][] allMutualLikelihoods(double[][] pxz,
                                        double[][] pzx)
Compute all mutual likelihoods (via a dot product, integrating out z). Better memory efficiency if only ranking of best is done.

Parameters:
pzx - p(z|x) as double[x][z], normalised rows
pxz - p(x|z) as double[z][x], normalised rows
Returns:
square matrix of size pzx.length with elements p(xcol|xrow)

allJsDistances

private double[][] allJsDistances(double[][] pxz,
                                  boolean rownorm)
Compute all js distances between the distributions in the

Parameters:
pxz, - with normalised rows / columns
normalised - rows (true) or columns (false)
Returns:

mutualLikelihood

public static double mutualLikelihood(double[][] pxz,
                                      double[][] pzx,
                                      int x1,
                                      int x2)
Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1). I call this mutual likelihood, analogously to the mutual information closely related, but it is actually the predictive likelihood of item x2 under the model z given item x1 has been observed.

Parameters:
pxz - p(x|z) as double[z][x], with normalised rows
pzx - p(z|x) as double[x][z], with normalised rows
x1 - index of generator item
x2 - index of generated item
Returns:

jsDistance

public static double jsDistance(double[][] pxz,
                                int px,
                                int qx,
                                boolean rownorm)
Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence:
 JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
 

Parameters:
pxz -
px -
qx -
rownorm -
Returns:

klDivergence

public static double klDivergence(double[][] pxz,
                                  int px,
                                  int qx,
                                  boolean rownorm)
Compute the Kullback-Leibler divergence between distributions px and qx,
 KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]
 
where arguments px and qx are the rows or colums of a conditional probability distribution matrix. This method does not check the sum=1 property of the distributions.

Parameters:
pxz - matrix
px - first pdf (row or column into pxz)
qx - second pdf (row or column into pxz)
rownorm - whether to use rows or columns as distributions
Returns:

mylog

public static double mylog(double arg)
Specialised log function (now logarithmus dualis)

Parameters:
arg -
Returns: