org.knowceans.corpus.analysis
Class LdaAmqDistance

java.lang.Object
  extended by org.knowceans.corpus.analysis.LdaAmqDistance

public class LdaAmqDistance
extends java.lang.Object

LdaAmqCorrelationAnalyser analyses the distance between the extracted topics of an LDA model and an LS-AMQ, effectively measuring the influence of authorship and querying information on the topic distributions.

Opposed to TopicCorrelationAnalyser, which compares clusterings of documents, LdaAmqCorrelationAnalyser uses terms, which are the common entities between the two approaches LDA and AMQ.

Author:
heinrich

Field Summary
(package private)  double[][] amqphi
          the AMQ word--topic associations
private  java.lang.String comment
           
(package private)  double[][] ldaphi
          the LDA word--topic associations
(package private) static double log2
          basis
(package private)  int nAmqTopics
          number of categories
(package private)  int nLdaTopics
          number of LDA topics
(package private)  int nTerms
          number of documents
private  java.lang.String outfile
           
private  double sumPCatDoc
           
 
Constructor Summary
LdaAmqDistance(java.lang.String ldaphifile, java.lang.String amqphifile, java.lang.String outfile, java.lang.String comment)
           
 
Method Summary
(package private)  double entropy(double[] p)
          entropy of the distribution
static void main(java.lang.String[] args)
           
 double metric(double[][] ldaphi, double[][] amqphi)
          Variation of Information metric for a priori and a posteriori relationships (Meila 2003).
private  double mutualInfo(double[] pv, double[] pw, double[][] pzv, double[][] pzw)
          calculate mutual info for the two clusterings if pjoint is known.
 double mylog(double arg)
           
 double[] pItem(double[][] pzw)
          averaged distributions n_z * sum_z p(v=s|z)
private  double pJoint(double[][] pzv, double[][] pzw, int v, int w)
          calculate joint probability for the two clusterings.
private  void run()
           
 double sum(double[] v)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log2

static double log2
basis


nTerms

int nTerms
number of documents


nLdaTopics

int nLdaTopics
number of LDA topics


nAmqTopics

int nAmqTopics
number of categories


ldaphi

double[][] ldaphi
the LDA word--topic associations


amqphi

double[][] amqphi
the AMQ word--topic associations


outfile

private java.lang.String outfile

comment

private java.lang.String comment

sumPCatDoc

private double sumPCatDoc
Constructor Detail

LdaAmqDistance

public LdaAmqDistance(java.lang.String ldaphifile,
                      java.lang.String amqphifile,
                      java.lang.String outfile,
                      java.lang.String comment)
Parameters:
comment -
includeunknown2 -
hierup2 -
Method Detail

main

public static void main(java.lang.String[] args)

run

private void run()

metric

public double metric(double[][] ldaphi,
                     double[][] amqphi)
Variation of Information metric for a priori and a posteriori relationships (Meila 2003).

D(X, Y) = H(X) + H(Y) - 2 I(X, Y) with entropy H(X) = - sum p(x) log p(x) and the KL divergence between the x,y considered independent and the actual joint distribution I(X, Y) = KL( p(x,y) || p(x)p(y) )


pItem

public double[] pItem(double[][] pzw)
averaged distributions n_z * sum_z p(v=s|z)

Parameters:
pwz - p(v|z), e.g., phi.
Returns:

entropy

double entropy(double[] p)
entropy of the distribution

Parameters:
p -
Returns:

mutualInfo

private double mutualInfo(double[] pv,
                          double[] pw,
                          double[][] pzv,
                          double[][] pzw)
calculate mutual info for the two clusterings if pjoint is known.

Parameters:
pv - categories distribution p(v=s)
pw - topics distribution p(w=t)
pvw - joint distribution p(v=s, w=t)
Returns:

pJoint

private double pJoint(double[][] pzv,
                      double[][] pzw,
                      int v,
                      int w)
calculate joint probability for the two clusterings.

Returns:
pjoint[cat][topic] with all cats

mylog

public double mylog(double arg)

sum

public double sum(double[] v)