LdaTopicSimilarities

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.dirichlet.lda
Class LdaTopicSimilarities

java.lang.Object
  org.knowceans.dirichlet.lda.LdaTopicSimilarities

Direct Known Subclasses:: AtmTopicSimilarities

public class LdaTopicSimilarities
extends java.lang.Object
extends java.lang.Object

FmLdaSimilarities calculates similarities between term and documents, both known and unknown. This is the interface for LDA queries, once the topics of an unknown string have been determined. This implementation supports both the symmetrised KL-divergence (= Jenson-Shannon distance) and a predictive likelihood.

By convention, conditional likelihoods are normalised along rows, i.e., p(col|row) = double[row][col]; If distributions are along columns, some methods provide a transposed flag.

Author:: heinrich

Field Summary
`(package private) static double`	`log2` basis
`protected double[][]`	`phi` the LDA topic--word associations phi[z][w] = p(w\|z)
`protected double[][]`	`phiPost` the LDA word-topic associations phiPost[w][z] = p(z\|w)
`protected double[][]`	`theta` the LDA document--topic associations theta[d][z] = p(z\|d)
`protected double[][]`	`thetaPost` the LDA topic-document associations thetaPost[z][d] = p(d\|z)

Constructor Summary
`LdaTopicSimilarities(LdaGibbsSampler lda, boolean terms, boolean docs, boolean pl, boolean js)` Initialise topic similarities using an existing lda gibbs sampler, whose phi and theta values are shared.
`LdaTopicSimilarities(java.lang.String ldabase, boolean terms, boolean docs, boolean pl, boolean js)` Construct an LdaSimilaritiesCps object with path bases and action indicators for terms and documents processing

Method Summary
`protected org.knowceans.map.IndexRanking`	`bestJsMatches(double[][] pzx, double[] qz, int max)` Find matching items for the item with index i (row) using Jensen-Shannon distance.
`protected org.knowceans.map.IndexRanking`	`bestJsMatches(double[][] pzx, int x, int max)` Find matching items for the item with index i (row) using Jensen-Shannon distance.
`protected org.knowceans.map.IndexRanking`	`bestMutLikMatches(double[][] pxz, double[][] pzx, int item, int max)` Find matching items for the item with index i (column!)
`protected org.knowceans.map.IndexRanking`	`bestMutLikMatches(double[][] pxz, double[] qz, int max)` Find matching items in pzx for distribution qz, p(d\|q) = sum p(d\|z) p(z\|q)
`org.knowceans.map.IndexRanking`	`docDocs(int doc, boolean mutLik, int max)` Get the most similar documents for the document doc.
`org.knowceans.map.IndexRanking`	`docTerms(int doc, boolean mutLik, int max)` Get the most similar terms for the doc.
`double[][]`	`getPhi()`
`double[][]`	`getPhiPost()`
`double[][]`	`getTheta()`
`double[][]`	`getThetaPost()`
`static double`	`jsDistance(double[][] pzx, int xp, int xq, boolean transposed)` Compute the Jensen-Shannon distance between px and qx, JS(p(x1) \|\| p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px \|\| qx) = 1/2 [ KL(px \|\| qx) + KL(qx \|\| px) ]
`static double`	`jsDistance(double[] px, double[] qx)` Compute the Jensen-Shannon distance between px and qx, JS(p(x1) \|\| p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence: JS(px \|\| qx) = 1/2 [ KL(px \|\| qx) + KL(qx \|\| px) ]
`static double`	`klDivergence(double[][] pzx, int xp, int xq, boolean transposed)` Compute the Kullback-Leibler divergence between distributions px and qx, KL(px \|\| qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments xp and xq are the rows of a conditional probability distribution matrix.
`static double`	`klDivergence(double[] px, double[] qx)` Compute the Kullback-Leibler divergence between distributions px and qx, KL(px \|\| qx) = sum_x px(x) [log px(x) - log qx(x)] where arguments px and qx are the distributions with equal length.
`static double`	`mutualLikelihood(double[][] pxz, double[][] pzx, int x1, int x2)` Given the distributions p(x \| z) and p(z \| x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 \| x1) = sum_z p(x2 \| z) p(z \| x1).
`static double`	`mylog(double arg)` Specialised log function (now logarithmus dualis)
`static double[][]`	`posterior(double[][] likelihood)` Calculate posterior probability p(y\|x) = p(x\|y) / sum_y'(p(x\|y')) with uniform prior p(y) = const.
`org.knowceans.map.IndexRanking[]`	`queryDocs(double[][] topics, boolean mutLik, int max)` Get the most similar documents for the queries expressed as array of distributions over z.
`org.knowceans.map.IndexRanking`	`queryDocs(double[] topics, boolean mutLik, int max)` Get the most similar documents for the query expressed as distribution over z.
`org.knowceans.map.IndexRanking[]`	`queryTerms(double[][] topics, boolean mutLik, int max)` Get the most similar terms for the queries expressed as array of distributions over z.
`org.knowceans.map.IndexRanking`	`queryTerms(double[] topics, boolean mutLik, int max)` Get the most similar terms for the query expressed as distribution over z.
`org.knowceans.map.IndexRanking`	`termDocs(int term, boolean mutLik, int max)` Get the most similar docs for the term.
`org.knowceans.map.IndexRanking`	`termTerms(int term, boolean mutLik, int max)` Get the most similar terms for the term.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

log2

static double log2

basis

phi

protected double[][] phi

the LDA topic--word associations phi[z][w] = p(w|z)

phiPost

protected double[][] phiPost

the LDA word-topic associations phiPost[w][z] = p(z|w)

theta

protected double[][] theta

the LDA document--topic associations theta[d][z] = p(z|d)

thetaPost

protected double[][] thetaPost

the LDA topic-document associations thetaPost[z][d] = p(d|z)

Constructor Detail

LdaTopicSimilarities

public LdaTopicSimilarities(java.lang.String ldabase,
                            boolean terms,
                            boolean docs,
                            boolean pl,
                            boolean js)
                     throws java.io.IOException

Construct an LdaSimilaritiesCps object with path bases and action indicators for terms and documents processing

Parameters:: ldabase - path base of lda parameter set (path + filename excluding extensions .phi.zip and .theta.zip); terms - load term matrix (phi); docs - load document matrix (theta); pl - configure for use with predictive likelihoods (syn. mutual likelihood, because it appears to be symmetric); js - configure for use with jenson shannon likelihood
Throws:: java.io.IOException

LdaTopicSimilarities

public LdaTopicSimilarities(LdaGibbsSampler lda,
                            boolean terms,
                            boolean docs,
                            boolean pl,
                            boolean js)

Initialise topic similarities using an existing lda gibbs sampler, whose phi and theta values are shared.

Parameters:: lda -; terms -; docs -; pl -; js -

Method Detail

queryDocs

public org.knowceans.map.IndexRanking queryDocs(double[] topics,
                                                boolean mutLik,
                                                int max)

Get the most similar documents for the query expressed as distribution over z.

Parameters:: topics - distribution over z. multiple elements; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon); max - maximum number of matches
Returns:

queryTerms

public org.knowceans.map.IndexRanking queryTerms(double[] topics,
                                                 boolean mutLik,
                                                 int max)

Get the most similar terms for the query expressed as distribution over z.

Parameters:: topics - distribution over z. multiple elements; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon); max - maximum number of matches
Returns:

queryDocs

public org.knowceans.map.IndexRanking[] queryDocs(double[][] topics,
                                                  boolean mutLik,
                                                  int max)

Get the most similar documents for the queries expressed as array of distributions over z.

Parameters:: topics - distribution over z.; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon); max - maximum number of matches
Returns:

queryTerms

public org.knowceans.map.IndexRanking[] queryTerms(double[][] topics,
                                                   boolean mutLik,
                                                   int max)

Get the most similar terms for the queries expressed as array of distributions over z.

Parameters:: topics - distribution over z.; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon); max - maximum number of matches
Returns:

docDocs

public org.knowceans.map.IndexRanking docDocs(int doc,
                                              boolean mutLik,
                                              int max)

Get the most similar documents for the document doc.

Parameters:: doc - document index; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon); max - maximum number of matches
Returns:

termTerms

public org.knowceans.map.IndexRanking termTerms(int term,
                                                boolean mutLik,
                                                int max)

Get the most similar terms for the term.

Parameters:: doc - document index; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon); max - maximum number of matches
Returns:

docTerms

public org.knowceans.map.IndexRanking docTerms(int doc,
                                               boolean mutLik,
                                               int max)

Get the most similar terms for the doc.

Parameters:: doc - document index; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon) *; max - maximum number of matches
Returns:

termDocs

public org.knowceans.map.IndexRanking termDocs(int term,
                                               boolean mutLik,
                                               int max)

Get the most similar docs for the term.

Parameters:: term - term index; mutLik - use mutual / predictive likelihood (otherwise jensen-shannon) *; max - maximum number of matches
Returns:

bestJsMatches

protected org.knowceans.map.IndexRanking bestJsMatches(double[][] pzx,
                                                       int x,
                                                       int max)

Find matching items for the item with index i (row) using Jensen-Shannon distance.

Parameters:: pzx - conditional probability matrix with row normalisation; x - the distribution to be matched as row of pzx; max - maximum number of matches
Returns:

bestJsMatches

protected org.knowceans.map.IndexRanking bestJsMatches(double[][] pzx,
                                                       double[] qz,
                                                       int max)

Find matching items for the item with index i (row) using Jensen-Shannon distance.

Parameters:: pzx - p(z|x) = pzx[x][z]; qz - q(z); max -
Returns:

bestMutLikMatches

protected org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz,
                                                           double[][] pzx,
                                                           int item,
                                                           int max)

Find matching items for the item with index i (column!) using mutual likelihood.

Parameters:: pzx -; item -; pxz - conditional probability matrix with row normalisation; max - maximum number of matches
Returns:

bestMutLikMatches

protected org.knowceans.map.IndexRanking bestMutLikMatches(double[][] pxz,
                                                           double[] qz,
                                                           int max)

Find matching items in pzx for distribution qz, p(d|q) = sum p(d|z) p(z|q)

Parameters:: pxz - conditional probability matrix with row normalisation; qz - query distribution; max - maximum number of matches
Returns:

mutualLikelihood

public static double mutualLikelihood(double[][] pxz,
                                      double[][] pzx,
                                      int x1,
                                      int x2)

Given the distributions p(x | z) and p(z | x), calculate the likelihood that the topics of item x1 can generate item x2, i.e., p(x2 | x1) = sum_z p(x2 | z) p(z | x1). I call this mutual likelihood, analogously to the mutual information closely related, but it is actually the predictive likelihood of item x2 under the model z given item x1 has been observed.

Parameters:: pxz - p(x|z) as double[z][x], with normalised rows; pzx - p(z|x) as double[x][z], with normalised rows; x1 - index of generator item; x2 - index of generated item
Returns:

jsDistance

public static double jsDistance(double[][] pzx,
                                int xp,
                                int xq,
                                boolean transposed)

Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence:

 JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]

Parameters:: pzx -; xp - index of px in the matrix pzx (row); xq - index of qx in the matrix pzx (row); transposed - rows -> columns
Returns:

klDivergence

public static double klDivergence(double[][] pzx,
                                  int xp,
                                  int xq,
                                  boolean transposed)

Compute the Kullback-Leibler divergence between distributions px and qx,

 KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]

where arguments xp and xq are the rows of a conditional probability distribution matrix. This method does not check the sum=1 property of the distributions.

Parameters:: pzx - matrix; xp - first pdf (row into pzx); xq - second pdf (row into pzx); transposed - use columns as distributions
Returns:

jsDistance

public static double jsDistance(double[] px,
                                double[] qx)

Compute the Jensen-Shannon distance between px and qx, JS(p(x1) || p(x2)), which is used analogously to klDivergence (see there) -- JS-distance is just the symmetrised KL-divergence:

 JS(px || qx) = 1/2 [ KL(px || qx) + KL(qx || px) ]

Parameters:: px -; qx -
Returns:

klDivergence

public static double klDivergence(double[] px,
                                  double[] qx)

Compute the Kullback-Leibler divergence between distributions px and qx,

 KL(px || qx) = sum_x px(x) [log px(x) - log qx(x)]

where arguments px and qx are the distributions with equal length. This method does not check the sum=1 property of the distributions.

Parameters:: px - first pdf; qx - second pdf
Returns:

posterior

public static double[][] posterior(double[][] likelihood)

Calculate posterior probability p(y|x) = p(x|y) / sum_y'(p(x|y')) with uniform prior p(y) = const.

Parameters:: likelihood - likelihood[y][x] = p(x|y), i.e., normalised along rows
Returns:: posterior[x][y] = p(y|x)

mylog

public static double mylog(double arg)

Specialised log function (now logarithmus dualis)

Parameters:: arg -
Returns:

getPhi

public final double[][] getPhi()

getPhiPost

public final double[][] getPhiPost()

getTheta

public final double[][] getTheta()

getThetaPost

public final double[][] getThetaPost()

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.dirichlet.lda Class LdaTopicSimilarities

log2

phi

phiPost

theta

thetaPost

LdaTopicSimilarities

LdaTopicSimilarities

queryDocs

queryTerms

queryDocs

queryTerms

docDocs

termTerms

docTerms

termDocs

bestJsMatches

bestJsMatches

bestMutLikMatches

bestMutLikMatches

mutualLikelihood

jsDistance

klDivergence

jsDistance

klDivergence

posterior

mylog

getPhi

getPhiPost

getTheta

getThetaPost

org.knowceans.dirichlet.lda
Class LdaTopicSimilarities