|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.knowceans.corpus.analysis.LdaSimilarityAnalyser
public class LdaSimilarityAnalyser
LdaSimilarities analyses the distance between terms documents. Subclasses can implement clusterings based on this.
By convention, conditional likelihoods are normalised along rows, i.e., p(col|row) = double[row][col];
Field Summary | |
---|---|
private java.lang.String |
comment
|
private java.text.NumberFormat |
df
|
private java.util.Vector<java.lang.String> |
docnames
|
private boolean |
doJensenShannon
|
private boolean |
doMutualLikelihood
|
(package private) static double |
log2
basis |
private java.lang.String |
outfilebase
|
(package private) double[][] |
phi
the LDA topic--word associations |
private double[][] |
theta
the LDA document--topic associations |
private org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> |
vocabulary
|
Constructor Summary | |
---|---|
LdaSimilarityAnalyser(java.lang.String ldabase,
java.lang.String corpusbase,
java.lang.String outfilebase,
boolean terms,
boolean docs,
boolean js,
boolean ml,
java.lang.String comment)
Construct an LdaSimilarities object with path bases and action indicators for terms and documents processing |
Method Summary | |
---|---|
private double[][] |
allJsDistances(double[][] pxz,
boolean rownorm)
Compute all js distances between the distributions in the |
private double[][] |
allMutualLikelihoods(double[][] pxz,
double[][] pzx)
Compute all mutual likelihoods (via a dot product, integrating out z). |
private org.knowceans.map.IndexRanking |
bestJsMatches(double[][] pxz,
int item,
int max)
Find matching items for the item with index i (column!) |
private org.knowceans.map.IndexRanking |
bestMutLikMatches(double[][] pxz,
double[][] pzx,
int item,
int max)
Find matching items for the item with index i (column!) |
static double |
jsDistance(double[][] pxz,
int px,
int qx,
boolean rownorm)
Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ] |
static double |
klDivergence(double[][] pxz,
int px,
int qx,
boolean rownorm)
Compute the Kullback-Leibler divergence between distributions px and qx, KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments px and qx are the rows or colums of a conditional probability distribution matrix. |
static void |
main(java.lang.String[] args)
|
static double |
mutualLikelihood(double[][] pxz,
double[][] pzx,
int x1,
int x2)
Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1). |
static double |
mylog(double arg)
Specialised log function (now logarithmus dualis) |
private void |
progress(int i)
|
private void |
run()
Perform the actual processing based on the initialisation. |
private void |
saveDocSimilarities(org.knowceans.map.IndexRanking[] docMatches,
java.lang.String ext)
Saves the document similarities to a file |
private void |
saveTermSimilarities(org.knowceans.map.IndexRanking[] termMatches,
java.lang.String ext)
Saves the term similarities to a file |
private static void |
test()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
static double log2
double[][] phi
private double[][] theta
private java.lang.String outfilebase
private java.lang.String comment
private java.util.Vector<java.lang.String> docnames
private org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> vocabulary
private boolean doJensenShannon
private boolean doMutualLikelihood
private java.text.NumberFormat df
Constructor Detail |
---|
public LdaSimilarityAnalyser(java.lang.String ldabase, java.lang.String corpusbase, java.lang.String outfilebase, boolean terms, boolean docs, boolean js, boolean ml, java.lang.String comment) throws java.io.IOException
ldabase
- path base of lda parameter set (path + filename excluding
extensions .phi.zip and .theta.zip)corpusbase
- path base of corpus files (path + filname excluding
.vocab, .docs etc. extensions)outfilebase
- path base for output files (outfilebase.termsims
and/or outfilebase.docsims)terms
- docs
- js
- jensen shannon distanceml
- mutual likelihoodcomment
-
java.io.IOException
java.lang.NumberFormatException
Method Detail |
---|
public static void main(java.lang.String[] args)
private static void test()
private void run()
private void progress(int i)
private void saveTermSimilarities(org.knowceans.map.IndexRanking[] termMatches, java.lang.String ext)
termMatches
- RankingMap array of term matchingsext
- extension appended to outfilebaseprivate void saveDocSimilarities(org.knowceans.map.IndexRanking[] docMatches, java.lang.String ext)
termMatches
- RankingMap array of document matchingsext
- extension appended to outfilebaseprivate org.knowceans.map.IndexRanking bestJsMatches(double[][] pxz, int item, int max)
pxz
- conditional probability matrix with row normalisationmax
- maximum number of matches
private org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz, double[][] pzx, int item, int max)
pzx
- item
- pxz
- conditional probability matrix with row normalisationmax
- maximum number of matches
private double[][] allMutualLikelihoods(double[][] pxz, double[][] pzx)
pzx
- p(z|x) as double[x][z], normalised rowspxz
- p(x|z) as double[z][x], normalised rows
private double[][] allJsDistances(double[][] pxz, boolean rownorm)
pxz,
- with normalised rows / columnsnormalised
- rows (true) or columns (false)
public static double mutualLikelihood(double[][] pxz, double[][] pzx, int x1, int x2)
pxz
- p(x|z) as double[z][x], with normalised rowspzx
- p(z|x) as double[x][z], with normalised rowsx1
- index of generator itemx2
- index of generated item
public static double jsDistance(double[][] pxz, int px, int qx, boolean rownorm)
JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
pxz
- px
- qx
- rownorm
-
public static double klDivergence(double[][] pxz, int px, int qx, boolean rownorm)
KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]where arguments px and qx are the rows or colums of a conditional probability distribution matrix. This method does not check the sum=1 property of the distributions.
pxz
- matrixpx
- first pdf (row or column into pxz)qx
- second pdf (row or column into pxz)rownorm
- whether to use rows or columns as distributions
public static double mylog(double arg)
arg
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |