org.knowceans.sandbox.hlda
Class HldaGibbsSampler

java.lang.Object
  extended by org.knowceans.dirichlet.lda.LdaGibbsSampler
      extended by org.knowceans.sandbox.hlda.HldaGibbsSampler
All Implemented Interfaces:
java.io.Serializable

public class HldaGibbsSampler
extends LdaGibbsSampler

Gibbs sampler for estimating the best assignments of topics for words and documents in a corpus, using a hierarchical approach with a Chinese restaurant process prior. The algorithm is introduced in Blei et al. paper "Hierarchical Topic Models and the Nested Chinese Restaurant Process" by David M. Blei, Thomas L. Griffiths, Michael I. Jordan and Joshua B. Tenenbaum (2004).

Author:
heinrich
See Also:
Serialized Form

Field Summary
(package private)  NestedCrpNode[][] c
          c[m][ell] is the restaurant corresponding to the ell'th topic for document m.
(package private)  int M
          number of documents
(package private)  NestedCrpNode ncrp
          the nested CRP structure into which c points
(package private)  int[][][] ncw
          ncw[ell][m][v] number of times topic ell was assigned word v in document m (need doc index because topics are indexed from the CRP tree nodes and need document-specific counts).
(package private)  int[][] ncwsum
          ncwsum[ell][m] number of words from m assigned to topic ell
 
Fields inherited from class org.knowceans.dirichlet.lda.LdaGibbsSampler
backupIteration, conf, dispcol, numstats, phisum, rand, state, thetasum
 
Constructor Summary
HldaGibbsSampler(int[][] documents, int V)
          Initialise the Gibbs sampler with data.
 
Method Summary
 NestedCrpNode getNcrp()
          Retrieve estimated nested topic structure.
private  void gibbs(int L, double alpha, double eta, double gamma)
          Main method: Select initial state.
 void initialState(int L, double gamma)
          Initialisation: LDA: Must start with an assignment of observations to topics.
static void main(java.lang.String[] args)
           
protected  NestedCrpNode[] sampleCrpFullConditional(int m)
          Sample a tree path c_m from the full conditional distribution: p(c_m | c_-m, w_-m, z) \propto p(w_m | c, w_-m, z) p(c_m | c_-m), where p(c_m | c_-m) = m_i / (gamma + m - 1) for occupied tables and = gamma / (gamma + m - 1) for unoccupied tables is the CRP prior.
protected  void updateParams()
          Add to the statistics the values of theta and phi for the current state.
 
Methods inherited from class org.knowceans.dirichlet.lda.LdaGibbsSampler
getPhi, getState, getTheta, gibbs, gibbs, gibbsHeap, gibbsHeap, initialState, load, output, run, sampleCorpus, sampleLdaFullConditional, save, saveState, updatePhi, updateTheta, writeParameters
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

c

NestedCrpNode[][] c
c[m][ell] is the restaurant corresponding to the ell'th topic for document m. (Points at nodes in the hierarchy ncrp.)


ncrp

NestedCrpNode ncrp
the nested CRP structure into which c points


M

int M
number of documents


ncw

int[][][] ncw
ncw[ell][m][v] number of times topic ell was assigned word v in document m (need doc index because topics are indexed from the CRP tree nodes and need document-specific counts). TODO: also put the totals here which are now non-sparse in tree nodes.


ncwsum

int[][] ncwsum
ncwsum[ell][m] number of words from m assigned to topic ell

Constructor Detail

HldaGibbsSampler

public HldaGibbsSampler(int[][] documents,
                        int V)
Initialise the Gibbs sampler with data.

Parameters:
documents - document content
V - vocabulary size
Method Detail

initialState

public void initialState(int L,
                         double gamma)
Initialisation: LDA: Must start with an assignment of observations to topics. CRP: random tree and respective assignments c.

Parameters:
L - number of topics (same as L in super)

gibbs

private void gibbs(int L,
                   double alpha,
                   double eta,
                   double gamma)
Main method: Select initial state. Repeat a large number of times: 1. Select an element 2. Update conditional on other elements. If appropriate, output summary for each run.

"Conceptually, we divide the Gibbs sampler into two parts. First, given the current state of the CRP, we sample the z_m,n variables of the underlying LDA model following the [LdaGibbsSampler] algorithm [...]. Second, given the values of the LDA hidden variables, we sample the cm;` variables which are associated with the CRP prior." [HLDA paper]

Parameters:
L - number of topics
alpha - hyperprior parameter for document--topic association
eta - hyperprior parameter for topic--term association (= Griffiths' beta when phi is used for Blei's beta matrix)
gamma - CRP concentration parameter

sampleCrpFullConditional

protected NestedCrpNode[] sampleCrpFullConditional(int m)
Sample a tree path c_m from the full conditional distribution: p(c_m | c_-m, w_-m, z) \propto p(w_m | c, w_-m, z) p(c_m | c_-m), where p(c_m | c_-m) = m_i / (gamma + m - 1) for occupied tables and = gamma / (gamma + m - 1) for unoccupied tables is the CRP prior.

p(w_m | c, w_-m, z) = Prod_ell=1..L (Gamma(nsum_-m[ c[m][ell] ] + V * beta)) / (Prod_w Gamma(n_-m[ c[m][ell] ][w] + beta)) * (Prod_w Gamma(n_-m[ c[m][ell] ][w] + n_m[ c[m][ell] ][w] + beta)) / (Gamma(nsum_-m[ c[m][ell] ] + nsum_m[ c[m][ell] ] + V * beta)) *

Parameters:
m - document
n - word

updateParams

protected void updateParams()
Add to the statistics the values of theta and phi for the current state.

Overrides:
updateParams in class LdaGibbsSampler

getNcrp

public NestedCrpNode getNcrp()
Retrieve estimated nested topic structure. TODO: If sample lag > 0 then how is averaging done?

Returns:
theta multinomial mixture of document topics (M x L)

main

public static void main(java.lang.String[] args)