org.knowceans.dirichlet.sandbox
Class LdaGibbsSamplerDpa

java.lang.Object
  extended by org.knowceans.dirichlet.sandbox.LdaGibbsSamplerDpa
All Implemented Interfaces:
java.io.Serializable

Deprecated. use LdaGibbsSamplerCps, this class only for backward compatibility and testing.

public class LdaGibbsSamplerDpa
extends java.lang.Object
implements java.io.Serializable

Gibbs sampler for estimating the best assignments of topics for words and documents in a corpus. The algorithm is introduced in Tom Griffiths' paper "Gibbs sampling in the generative model of Latent Dirichlet Allocation" (2002).

Author:
heinrich
See Also:
Serialized Form

Field Summary
protected  double alpha
          Deprecated. Dirichlet parameter (document--topic associations)
protected  VariationOfInformationAnalyser analyser
          Deprecated.  
protected  int backupInterval
          Deprecated.  
protected  int backupIteration
          Deprecated. iteration in the last backup
protected  double beta
          Deprecated. Dirichlet parameter (topic--term associations)
protected  int BURN_IN
          Deprecated. burn-in period
protected  java.lang.String corpusname
          Deprecated.  
protected  int dispcol
          Deprecated.  
protected  int[][] documents
          Deprecated. document data (term lists)
protected  int[] interSamples
          Deprecated. number of iteration at which intermediate (single) samples are taken.
protected  boolean interSave
          Deprecated.  
protected  boolean interTopics
          Deprecated.  
protected  int ITERATIONS
          Deprecated. max iterations
protected  int K
          Deprecated. number of topics
protected  java.lang.String messageheader
          Deprecated.  
protected  java.lang.String[] messagerecipients
          Deprecated.  
protected  java.lang.String messagetext
          Deprecated.  
protected  int[][] nd
          Deprecated. nd[d][k] number of words in document d assigned to topic k.
protected  int[] ndsum
          Deprecated. ndsum[d] total number of words in document d.
protected  int numstats
          Deprecated. size of statistics
protected  int[][] nw
          Deprecated. cwt[k][j] number of instances of word j (term?)
protected  int[] nwsum
          Deprecated. nwsum[k] total number of words assigned to topic k.
protected  java.lang.String outfilename
          Deprecated.  
protected  double[][] phisum
          Deprecated. cumulative statistics of phi
protected  int SAMPLE_LAG
          Deprecated. sample lag (if -1 only one sample taken)
private static long serialVersionUID
          Deprecated.  
protected  long t0
          Deprecated.  
protected  double[][] thetasum
          Deprecated. cumulative statistics of theta
protected  int THIN_INTERVAL
          Deprecated. sampling lag (?)
protected  long timeElapsed
          Deprecated.  
protected  int V
          Deprecated. vocabulary size
protected  int[][] z
          Deprecated. topic assignments for each word.
 
Constructor Summary
LdaGibbsSamplerDpa(int[][] documents, int V)
          Deprecated. Initialise the Gibbs sampler with data.
 
Method Summary
protected  void configureMessaging(java.lang.String header, java.lang.String text, java.lang.String[] recipients)
          Deprecated. configure the sampler for messaging
protected  void configureOutput(java.lang.String corpusname, java.lang.String outfilename, int backupInterval, int[] interSamples, VariationOfInformationAnalyser analyser, boolean interSave, boolean interTopics)
          Deprecated. configure the sampler output
 void configureSampler(int iterations, int burnIn, int thinInterval, int sampleLag, int K, double alpha, double beta)
          Deprecated. Configure the gibbs sampler
private  VariationOfInformationAnalyser.DistMetric distance(java.lang.String outname)
          Deprecated. perform a distance calculation on the estimated results
 double getAlpha()
          Deprecated.  
 double getBeta()
          Deprecated.  
 int[][] getDocuments()
          Deprecated.  
 int getK()
          Deprecated.  
 double[][] getPhi()
          Deprecated. Retrieve estimated topic--word associations.
 double[][] getTheta()
          Deprecated. Retrieve estimated document--topic associations.
protected  long getTimer()
          Deprecated. get the current value of the timer.
 int getV()
          Deprecated.  
 int[][] getZ()
          Deprecated.  
private  void gibbs()
          Deprecated. Main method: Select initial state ?
 void initialState()
          Deprecated. Initialisation: Must start with an assignment of observations to topics ?
static java.lang.Object load(java.lang.String filename)
          Deprecated. read object from the stream
static void main(java.lang.String[] args)
          Deprecated.  
protected  void output(java.lang.String analysisfile, java.lang.String addheader, java.lang.String addmessage)
          Deprecated. Calculate distance (if doDist) and replace all occurrences of $@ and $# in strings by the complete distance information and the distance value only, respectively.
protected  void sampleCorpus()
          Deprecated. sample once through the corpus.
protected  int sampleLdaFullConditional(int m, int n)
          Deprecated. Sample a topic z_i from the full conditional distribution: p(z_i = j | z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) + alpha)/(n_-i,.
 void save(java.lang.String filename)
          Deprecated. Object stream only for testing.
 void setAlpha(double alpha)
          Deprecated.  
 void setBeta(double beta)
          Deprecated.  
protected  void startTimer(long offset)
          Deprecated. start timer from with an initial offset.
protected  void updateParams()
          Deprecated. Add to the statistics the values of theta and phi for the current state.
protected static void writeParameters(java.lang.String file, java.lang.String corpusname, int k, double alpha, double beta, int m, int v, int w, long duration, int iterations, int samplelag, int burnin, org.knowceans.util.Arguments a)
          Deprecated. write statistics of the current run to a text file for later review
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

private static final long serialVersionUID
Deprecated. 
See Also:
Constant Field Values

documents

protected int[][] documents
Deprecated. 
document data (term lists)


V

protected int V
Deprecated. 
vocabulary size


K

protected int K
Deprecated. 
number of topics


alpha

protected double alpha
Deprecated. 
Dirichlet parameter (document--topic associations)


beta

protected double beta
Deprecated. 
Dirichlet parameter (topic--term associations)


z

protected int[][] z
Deprecated. 
topic assignments for each word.


nw

protected int[][] nw
Deprecated. 
cwt[k][j] number of instances of word j (term?) assigned to topic k.


nd

protected int[][] nd
Deprecated. 
nd[d][k] number of words in document d assigned to topic k.


nwsum

protected int[] nwsum
Deprecated. 
nwsum[k] total number of words assigned to topic k.


ndsum

protected int[] ndsum
Deprecated. 
ndsum[d] total number of words in document d.


thetasum

protected double[][] thetasum
Deprecated. 
cumulative statistics of theta


phisum

protected double[][] phisum
Deprecated. 
cumulative statistics of phi


numstats

protected int numstats
Deprecated. 
size of statistics


interSamples

protected int[] interSamples
Deprecated. 
number of iteration at which intermediate (single) samples are taken.


THIN_INTERVAL

protected int THIN_INTERVAL
Deprecated. 
sampling lag (?)


BURN_IN

protected int BURN_IN
Deprecated. 
burn-in period


ITERATIONS

protected int ITERATIONS
Deprecated. 
max iterations


SAMPLE_LAG

protected int SAMPLE_LAG
Deprecated. 
sample lag (if -1 only one sample taken)


dispcol

protected int dispcol
Deprecated. 

corpusname

protected java.lang.String corpusname
Deprecated. 

outfilename

protected java.lang.String outfilename
Deprecated. 

messagetext

protected java.lang.String messagetext
Deprecated. 

messageheader

protected java.lang.String messageheader
Deprecated. 

messagerecipients

protected java.lang.String[] messagerecipients
Deprecated. 

interSave

protected boolean interSave
Deprecated. 

analyser

protected VariationOfInformationAnalyser analyser
Deprecated. 

interTopics

protected boolean interTopics
Deprecated. 

backupIteration

protected int backupIteration
Deprecated. 
iteration in the last backup


timeElapsed

protected long timeElapsed
Deprecated. 

backupInterval

protected int backupInterval
Deprecated. 

t0

protected long t0
Deprecated. 
Constructor Detail

LdaGibbsSamplerDpa

public LdaGibbsSamplerDpa(int[][] documents,
                          int V)
Deprecated. 
Initialise the Gibbs sampler with data.

Parameters:
V - vocabulary size
data -
Method Detail

initialState

public void initialState()
Deprecated. 
Initialisation: Must start with an assignment of observations to topics ? Many alternatives are possible, I chose to perform random assignments with equal probabilities


gibbs

private void gibbs()
Deprecated. 
Main method: Select initial state ? Repeat a large number of times: 1. Select an element 2. Update conditional on other elements. If appropriate, output summary for each run.


sampleCorpus

protected void sampleCorpus()
Deprecated. 
sample once through the corpus.


sampleLdaFullConditional

protected int sampleLdaFullConditional(int m,
                                       int n)
Deprecated. 
Sample a topic z_i from the full conditional distribution: p(z_i = j | z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) + alpha)/(n_-i,.(d_i) + K * alpha)

Parameters:
m - document
n - word

updateParams

protected void updateParams()
Deprecated. 
Add to the statistics the values of theta and phi for the current state.


getTheta

public double[][] getTheta()
Deprecated. 
Retrieve estimated document--topic associations. If sample lag > 0 then the mean value of all sampled statistics for theta[][] is taken.

Returns:
theta multinomial mixture of document topics (M x K)

getPhi

public double[][] getPhi()
Deprecated. 
Retrieve estimated topic--word associations. If sample lag > 0 then the mean value of all sampled statistics for phi[][] is taken.

Returns:
phi multinomial mixture of topic words (K x V)

configureSampler

public void configureSampler(int iterations,
                             int burnIn,
                             int thinInterval,
                             int sampleLag,
                             int K,
                             double alpha,
                             double beta)
Deprecated. 
Configure the gibbs sampler

Parameters:
iterations - number of total iterations
burnIn - number of burn-in iterations
thinInterval - update statistics interval
sampleLag - sample interval (-1 for just one sample at the end)
K - number of topics
alpha - symmetric prior parameter on document--topic associations
beta - symmetric prior parameter on topic--term associations

configureOutput

protected void configureOutput(java.lang.String corpusname,
                               java.lang.String outfilename,
                               int backupInterval,
                               int[] interSamples,
                               VariationOfInformationAnalyser analyser,
                               boolean interSave,
                               boolean interTopics)
Deprecated. 
configure the sampler output

Parameters:
corpusname -
outfilename -
backupInterval -
interSamples -
analyser -
interSave3 -
interSave -

configureMessaging

protected void configureMessaging(java.lang.String header,
                                  java.lang.String text,
                                  java.lang.String[] recipients)
Deprecated. 
configure the sampler for messaging

Parameters:
header -
text -
recipients -

main

public static void main(java.lang.String[] args)
Deprecated. 

startTimer

protected void startTimer(long offset)
Deprecated. 
start timer from with an initial offset. Further, the value of time elapsed is considered.

Parameters:
offset -

getTimer

protected long getTimer()
Deprecated. 
get the current value of the timer.

Returns:

output

protected void output(java.lang.String analysisfile,
                      java.lang.String addheader,
                      java.lang.String addmessage)
Deprecated. 
Calculate distance (if doDist) and replace all occurrences of $@ and $# in strings by the complete distance information and the distance value only, respectively. For console output, the addmessage text is replaced with distance information ($@, $#) and the message text printed. The method works in a separate thread. If mailing is enabled, the messagetext variable is appended with addmessage and header is appended with addheader and all $@ and $# occurrences replaced. Email is sent afterwards.

Parameters:
addmessage -
doDist -

save

public void save(java.lang.String filename)
Deprecated. 
Object stream only for testing.


load

public static java.lang.Object load(java.lang.String filename)
Deprecated. 
read object from the stream

Parameters:
filename -
Returns:

distance

private VariationOfInformationAnalyser.DistMetric distance(java.lang.String outname)
Deprecated. 
perform a distance calculation on the estimated results

Returns:

writeParameters

protected static void writeParameters(java.lang.String file,
                                      java.lang.String corpusname,
                                      int k,
                                      double alpha,
                                      double beta,
                                      int m,
                                      int v,
                                      int w,
                                      long duration,
                                      int iterations,
                                      int samplelag,
                                      int burnin,
                                      org.knowceans.util.Arguments a)
Deprecated. 
write statistics of the current run to a text file for later review

Parameters:
file -
corpusname -
k - topics
alpha - hyperparameter
beta - hyperparameter
m - doc count
v - vocabulary/term count
w - word count
duration - training duration
iterations - no. of total iterations
samplelag - sampling lag
burnin - burnin samples
a - Arguments object

getAlpha

public final double getAlpha()
Deprecated. 

setAlpha

public final void setAlpha(double alpha)
Deprecated. 

getBeta

public final double getBeta()
Deprecated. 

setBeta

public final void setBeta(double beta)
Deprecated. 

getDocuments

public final int[][] getDocuments()
Deprecated. 

getK

public final int getK()
Deprecated. 

getV

public final int getV()
Deprecated. 

getZ

public final int[][] getZ()
Deprecated.