LdaSimilarityAnalyser

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus.analysis
Class LdaSimilarityAnalyser

java.lang.Object
  org.knowceans.corpus.analysis.LdaSimilarityAnalyser

public class LdaSimilarityAnalyser
extends java.lang.Object
extends java.lang.Object

LdaSimilarities analyses the distance between terms documents. Subclasses can implement clusterings based on this.

By convention, conditional likelihoods are normalised along rows, i.e., p(col|row) = double[row][col];

Author:: heinrich

Field Summary
`private java.lang.String`	`comment`
`private java.text.NumberFormat`	`df`
`private java.util.Vector<java.lang.String>`	`docnames`
`private boolean`	`doJensenShannon`
`private boolean`	`doMutualLikelihood`
`(package private) static double`	`log2` basis
`private java.lang.String`	`outfilebase`
`(package private) double[][]`	`phi` the LDA topic--word associations
`private double[][]`	`theta` the LDA document--topic associations
`private org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer>`	`vocabulary`

Constructor Summary
`LdaSimilarityAnalyser(java.lang.String ldabase, java.lang.String corpusbase, java.lang.String outfilebase, boolean terms, boolean docs, boolean js, boolean ml, java.lang.String comment)` Construct an LdaSimilarities object with path bases and action indicators for terms and documents processing

Method Summary
`private double[][]`	`allJsDistances(double[][] pxz, boolean rownorm)` Compute all js distances between the distributions in the
`private double[][]`	`allMutualLikelihoods(double[][] pxz, double[][] pzx)` Compute all mutual likelihoods (via a dot product, integrating out z).
`private org.knowceans.map.IndexRanking`	`bestJsMatches(double[][] pxz, int item, int max)` Find matching items for the item with index i (column!)
`private org.knowceans.map.IndexRanking`	`bestMutLikMatches(double[][] pxz, double[][] pzx, int item, int max)` Find matching items for the item with index i (column!)
`static double`	`jsDistance(double[][] pxz, int px, int qx, boolean rownorm)` Compute the Jensen-Shannon distance between px and qx, JS(p(x1) \|\| p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px \|\| qx) = 1/2 [ KL(px \|\| qx) + KL(qx \|\| px) ]
`static double`	`klDivergence(double[][] pxz, int px, int qx, boolean rownorm)` Compute the Kullback-Leibler divergence between distributions px and qx, KL(px \|\| qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments px and qx are the rows or colums of a conditional probability distribution matrix.
`static void`	`main(java.lang.String[] args)`
`static double`	`mutualLikelihood(double[][] pxz, double[][] pzx, int x1, int x2)` Given the distributions p(x \| z) and p(z \| x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 \| x1) = sum_z p(x2 \| z) p(z \| x1).
`static double`	`mylog(double arg)` Specialised log function (now logarithmus dualis)
`private void`	`progress(int i)`
`private void`	`run()` Perform the actual processing based on the initialisation.
`private void`	`saveDocSimilarities(org.knowceans.map.IndexRanking[] docMatches, java.lang.String ext)` Saves the document similarities to a file
`private void`	`saveTermSimilarities(org.knowceans.map.IndexRanking[] termMatches, java.lang.String ext)` Saves the term similarities to a file
`private static void`	`test()`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

log2

static double log2

basis

phi

double[][] phi

the LDA topic--word associations

theta

private double[][] theta

the LDA document--topic associations

outfilebase

private java.lang.String outfilebase

comment

private java.lang.String comment

docnames

private java.util.Vector<java.lang.String> docnames

vocabulary

private org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> vocabulary

doJensenShannon

private boolean doJensenShannon

doMutualLikelihood

private boolean doMutualLikelihood

df

private java.text.NumberFormat df

Constructor Detail

LdaSimilarityAnalyser

public LdaSimilarityAnalyser(java.lang.String ldabase,
                             java.lang.String corpusbase,
                             java.lang.String outfilebase,
                             boolean terms,
                             boolean docs,
                             boolean js,
                             boolean ml,
                             java.lang.String comment)
                      throws java.io.IOException

Construct an LdaSimilarities object with path bases and action indicators for terms and documents processing

Parameters:: ldabase - path base of lda parameter set (path + filename excluding extensions .phi.zip and .theta.zip); corpusbase - path base of corpus files (path + filname excluding .vocab, .docs etc. extensions); outfilebase - path base for output files (outfilebase.termsims and/or outfilebase.docsims); terms -; docs -; js - jensen shannon distance; ml - mutual likelihood; comment -
Throws:: java.io.IOException; java.lang.NumberFormatException

Method Detail

main

public static void main(java.lang.String[] args)

test

private static void test()

run

private void run()

Perform the actual processing based on the initialisation. This creates similarity matrices for both the terms and the documents.

progress

private void progress(int i)

saveTermSimilarities

private void saveTermSimilarities(org.knowceans.map.IndexRanking[] termMatches,
                                  java.lang.String ext)

Saves the term similarities to a file

Parameters:: termMatches - RankingMap array of term matchings; ext - extension appended to outfilebase

saveDocSimilarities

private void saveDocSimilarities(org.knowceans.map.IndexRanking[] docMatches,
                                 java.lang.String ext)

Saves the document similarities to a file

Parameters:: termMatches - RankingMap array of document matchings; ext - extension appended to outfilebase

bestJsMatches

private org.knowceans.map.IndexRanking bestJsMatches(double[][] pxz,
                                                     int item,
                                                     int max)

Find matching items for the item with index i (column!) using Jensen-Shannon distance.

Parameters:: pxz - conditional probability matrix with row normalisation; max - maximum number of matches
Returns:

bestMutLikMatches

private org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz,
                                                         double[][] pzx,
                                                         int item,
                                                         int max)

Find matching items for the item with index i (column!) using mutual likelihood.

Parameters:: pzx -; item -; pxz - conditional probability matrix with row normalisation; max - maximum number of matches
Returns:

allMutualLikelihoods

private double[][] allMutualLikelihoods(double[][] pxz,
                                        double[][] pzx)

Compute all mutual likelihoods (via a dot product, integrating out z). Better memory efficiency if only ranking of best is done.

Parameters:: pzx - p(z|x) as double[x][z], normalised rows; pxz - p(x|z) as double[z][x], normalised rows
Returns:: square matrix of size pzx.length with elements p(xcol|xrow)

allJsDistances

private double[][] allJsDistances(double[][] pxz,
                                  boolean rownorm)

Compute all js distances between the distributions in the

Parameters:: pxz, - with normalised rows / columns; normalised - rows (true) or columns (false)
Returns:

mutualLikelihood

public static double mutualLikelihood(double[][] pxz,
                                      double[][] pzx,
                                      int x1,
                                      int x2)

Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1). I call this mutual likelihood, analogously to the mutual information closely related, but it is actually the predictive likelihood of item x2 under the model z given item x1 has been observed.

Parameters:: pxz - p(x|z) as double[z][x], with normalised rows; pzx - p(z|x) as double[x][z], with normalised rows; x1 - index of generator item; x2 - index of generated item
Returns:

jsDistance

public static double jsDistance(double[][] pxz,
                                int px,
                                int qx,
                                boolean rownorm)

Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence:

 JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]

Parameters:: pxz -; px -; qx -; rownorm -
Returns:

klDivergence

public static double klDivergence(double[][] pxz,
                                  int px,
                                  int qx,
                                  boolean rownorm)

Compute the Kullback-Leibler divergence between distributions px and qx,

 KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]

where arguments px and qx are the rows or colums of a conditional probability distribution matrix. This method does not check the sum=1 property of the distributions.

Parameters:: pxz - matrix; px - first pdf (row or column into pxz); qx - second pdf (row or column into pxz); rownorm - whether to use rows or columns as distributions
Returns:

mylog

public static double mylog(double arg)

Specialised log function (now logarithmus dualis)

Parameters:: arg -
Returns:

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus.analysis Class LdaSimilarityAnalyser

log2

phi

theta

outfilebase

comment

docnames

vocabulary

doJensenShannon

doMutualLikelihood

df

LdaSimilarityAnalyser

main

test

run

progress

saveTermSimilarities

saveDocSimilarities

bestJsMatches

bestMutLikMatches

allMutualLikelihoods

allJsDistances

mutualLikelihood

jsDistance

klDivergence

mylog

org.knowceans.corpus.analysis
Class LdaSimilarityAnalyser