org.knowceans.dirichlet.lda
Class LdaTopicSimilarities

java.lang.Object
  extended by org.knowceans.dirichlet.lda.LdaTopicSimilarities
Direct Known Subclasses:
AtmTopicSimilarities

public class LdaTopicSimilarities
extends java.lang.Object

FmLdaSimilarities calculates similarities between term and documents, both known and unknown. This is the interface for LDA queries, once the topics of an unknown string have been determined. This implementation supports both the symmetrised KL-divergence (= Jenson-Shannon distance) and a predictive likelihood.

By convention, conditional likelihoods are normalised along rows, i.e., p(col|row) = double[row][col]; If distributions are along columns, some methods provide a transposed flag.

Author:
heinrich

Field Summary
(package private) static double log2
          basis
protected  double[][] phi
          the LDA topic--word associations phi[z][w] = p(w|z)
protected  double[][] phiPost
          the LDA word-topic associations phiPost[w][z] = p(z|w)
protected  double[][] theta
          the LDA document--topic associations theta[d][z] = p(z|d)
protected  double[][] thetaPost
          the LDA topic-document associations thetaPost[z][d] = p(d|z)
 
Constructor Summary
LdaTopicSimilarities(LdaGibbsSampler lda, boolean terms, boolean docs, boolean pl, boolean js)
          Initialise topic similarities using an existing lda gibbs sampler, whose phi and theta values are shared.
LdaTopicSimilarities(java.lang.String ldabase, boolean terms, boolean docs, boolean pl, boolean js)
          Construct an LdaSimilaritiesCps object with path bases and action indicators for terms and documents processing
 
Method Summary
protected  org.knowceans.map.IndexRanking bestJsMatches(double[][] pzx, double[] qz, int max)
          Find matching items for the item with index i (row) using Jensen-Shannon distance.
protected  org.knowceans.map.IndexRanking bestJsMatches(double[][] pzx, int x, int max)
          Find matching items for the item with index i (row) using Jensen-Shannon distance.
protected  org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz, double[][] pzx, int item, int max)
          Find matching items for the item with index i (column!)
protected  org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz, double[] qz, int max)
          Find matching items in pzx for distribution qz, p(d|q) = sum p(d|z) p(z|q)
 org.knowceans.map.IndexRanking docDocs(int doc, boolean mutLik, int max)
          Get the most similar documents for the document doc.
 org.knowceans.map.IndexRanking docTerms(int doc, boolean mutLik, int max)
          Get the most similar terms for the doc.
 double[][] getPhi()
           
 double[][] getPhiPost()
           
 double[][] getTheta()
           
 double[][] getThetaPost()
           
static double jsDistance(double[][] pzx, int xp, int xq, boolean transposed)
          Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
static double jsDistance(double[] px, double[] qx)
          Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
static double klDivergence(double[][] pzx, int xp, int xq, boolean transposed)
          Compute the Kullback-Leibler divergence between distributions px and qx, KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments xp and xq are the rows of a conditional probability distribution matrix.
static double klDivergence(double[] px, double[] qx)
          Compute the Kullback-Leibler divergence between distributions px and qx, KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments px and qx are the distributions with equal length.
static double mutualLikelihood(double[][] pxz, double[][] pzx, int x1, int x2)
          Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1).
static double mylog(double arg)
          Specialised log function (now logarithmus dualis)
static double[][] posterior(double[][] likelihood)
          Calculate posterior probability p(y|x) = p(x|y) / sum_y'(p(x|y')) with uniform prior p(y) = const.
 org.knowceans.map.IndexRanking[] queryDocs(double[][] topics, boolean mutLik, int max)
          Get the most similar documents for the queries expressed as array of distributions over z.
 org.knowceans.map.IndexRanking queryDocs(double[] topics, boolean mutLik, int max)
          Get the most similar documents for the query expressed as distribution over z.
 org.knowceans.map.IndexRanking[] queryTerms(double[][] topics, boolean mutLik, int max)
          Get the most similar terms for the queries expressed as array of distributions over z.
 org.knowceans.map.IndexRanking queryTerms(double[] topics, boolean mutLik, int max)
          Get the most similar terms for the query expressed as distribution over z.
 org.knowceans.map.IndexRanking termDocs(int term, boolean mutLik, int max)
          Get the most similar docs for the term.
 org.knowceans.map.IndexRanking termTerms(int term, boolean mutLik, int max)
          Get the most similar terms for the term.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log2

static double log2
basis


phi

protected double[][] phi
the LDA topic--word associations phi[z][w] = p(w|z)


phiPost

protected double[][] phiPost
the LDA word-topic associations phiPost[w][z] = p(z|w)


theta

protected double[][] theta
the LDA document--topic associations theta[d][z] = p(z|d)


thetaPost

protected double[][] thetaPost
the LDA topic-document associations thetaPost[z][d] = p(d|z)

Constructor Detail

LdaTopicSimilarities

public LdaTopicSimilarities(java.lang.String ldabase,
                            boolean terms,
                            boolean docs,
                            boolean pl,
                            boolean js)
                     throws java.io.IOException
Construct an LdaSimilaritiesCps object with path bases and action indicators for terms and documents processing

Parameters:
ldabase - path base of lda parameter set (path + filename excluding extensions .phi.zip and .theta.zip)
terms - load term matrix (phi)
docs - load document matrix (theta)
pl - configure for use with predictive likelihoods (syn. mutual likelihood, because it appears to be symmetric)
js - configure for use with jenson shannon likelihood
Throws:
java.io.IOException

LdaTopicSimilarities

public LdaTopicSimilarities(LdaGibbsSampler lda,
                            boolean terms,
                            boolean docs,
                            boolean pl,
                            boolean js)
Initialise topic similarities using an existing lda gibbs sampler, whose phi and theta values are shared.

Parameters:
lda -
terms -
docs -
pl -
js -
Method Detail

queryDocs

public org.knowceans.map.IndexRanking queryDocs(double[] topics,
                                                boolean mutLik,
                                                int max)
Get the most similar documents for the query expressed as distribution over z.

Parameters:
topics - distribution over z. multiple elements
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon)
max - maximum number of matches
Returns:

queryTerms

public org.knowceans.map.IndexRanking queryTerms(double[] topics,
                                                 boolean mutLik,
                                                 int max)
Get the most similar terms for the query expressed as distribution over z.

Parameters:
topics - distribution over z. multiple elements
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon)
max - maximum number of matches
Returns:

queryDocs

public org.knowceans.map.IndexRanking[] queryDocs(double[][] topics,
                                                  boolean mutLik,
                                                  int max)
Get the most similar documents for the queries expressed as array of distributions over z.

Parameters:
topics - distribution over z.
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon)
max - maximum number of matches
Returns:

queryTerms

public org.knowceans.map.IndexRanking[] queryTerms(double[][] topics,
                                                   boolean mutLik,
                                                   int max)
Get the most similar terms for the queries expressed as array of distributions over z.

Parameters:
topics - distribution over z.
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon)
max - maximum number of matches
Returns:

docDocs

public org.knowceans.map.IndexRanking docDocs(int doc,
                                              boolean mutLik,
                                              int max)
Get the most similar documents for the document doc.

Parameters:
doc - document index
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon)
max - maximum number of matches
Returns:

termTerms

public org.knowceans.map.IndexRanking termTerms(int term,
                                                boolean mutLik,
                                                int max)
Get the most similar terms for the term.

Parameters:
doc - document index
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon)
max - maximum number of matches
Returns:

docTerms

public org.knowceans.map.IndexRanking docTerms(int doc,
                                               boolean mutLik,
                                               int max)
Get the most similar terms for the doc.

Parameters:
doc - document index
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon) *
max - maximum number of matches
Returns:

termDocs

public org.knowceans.map.IndexRanking termDocs(int term,
                                               boolean mutLik,
                                               int max)
Get the most similar docs for the term.

Parameters:
term - term index
mutLik - use mutual / predictive likelihood (otherwise jensen-shannon) *
max - maximum number of matches
Returns:

bestJsMatches

protected org.knowceans.map.IndexRanking bestJsMatches(double[][] pzx,
                                                       int x,
                                                       int max)
Find matching items for the item with index i (row) using Jensen-Shannon distance.

Parameters:
pzx - conditional probability matrix with row normalisation
x - the distribution to be matched as row of pzx
max - maximum number of matches
Returns:

bestJsMatches

protected org.knowceans.map.IndexRanking bestJsMatches(double[][] pzx,
                                                       double[] qz,
                                                       int max)
Find matching items for the item with index i (row) using Jensen-Shannon distance.

Parameters:
pzx - p(z|x) = pzx[x][z]
qz - q(z)
max -
Returns:

bestMutLikMatches

protected org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz,
                                                           double[][] pzx,
                                                           int item,
                                                           int max)
Find matching items for the item with index i (column!) using mutual likelihood.

Parameters:
pzx -
item -
pxz - conditional probability matrix with row normalisation
max - maximum number of matches
Returns:

bestMutLikMatches

protected org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz,
                                                           double[] qz,
                                                           int max)
Find matching items in pzx for distribution qz, p(d|q) = sum p(d|z) p(z|q)

Parameters:
pxz - conditional probability matrix with row normalisation
qz - query distribution
max - maximum number of matches
Returns:

mutualLikelihood

public static double mutualLikelihood(double[][] pxz,
                                      double[][] pzx,
                                      int x1,
                                      int x2)
Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1). I call this mutual likelihood, analogously to the mutual information closely related, but it is actually the predictive likelihood of item x2 under the model z given item x1 has been observed.

Parameters:
pxz - p(x|z) as double[z][x], with normalised rows
pzx - p(z|x) as double[x][z], with normalised rows
x1 - index of generator item
x2 - index of generated item
Returns:

jsDistance

public static double jsDistance(double[][] pzx,
                                int xp,
                                int xq,
                                boolean transposed)
Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence:
 JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
 

Parameters:
pzx -
xp - index of px in the matrix pzx (row)
xq - index of qx in the matrix pzx (row)
transposed - rows -> columns
Returns:

klDivergence

public static double klDivergence(double[][] pzx,
                                  int xp,
                                  int xq,
                                  boolean transposed)
Compute the Kullback-Leibler divergence between distributions px and qx,
 KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]
 
where arguments xp and xq are the rows of a conditional probability distribution matrix. This method does not check the sum=1 property of the distributions.

Parameters:
pzx - matrix
xp - first pdf (row into pzx)
xq - second pdf (row into pzx)
transposed - use columns as distributions
Returns:

jsDistance

public static double jsDistance(double[] px,
                                double[] qx)
Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence:
 JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]
 

Parameters:
px -
qx -
Returns:

klDivergence

public static double klDivergence(double[] px,
                                  double[] qx)
Compute the Kullback-Leibler divergence between distributions px and qx,
 KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]
 
where arguments px and qx are the distributions with equal length. This method does not check the sum=1 property of the distributions.

Parameters:
px - first pdf
qx - second pdf
Returns:

posterior

public static double[][] posterior(double[][] likelihood)
Calculate posterior probability p(y|x) = p(x|y) / sum_y'(p(x|y')) with uniform prior p(y) = const.

Parameters:
likelihood - likelihood[y][x] = p(x|y), i.e., normalised along rows
Returns:
posterior[x][y] = p(y|x)

mylog

public static double mylog(double arg)
Specialised log function (now logarithmus dualis)

Parameters:
arg -
Returns:

getPhi

public final double[][] getPhi()

getPhiPost

public final double[][] getPhiPost()

getTheta

public final double[][] getTheta()

getThetaPost

public final double[][] getThetaPost()