org.knowceans.dirichlet.lsam
Class LsamGibbsSampler

java.lang.Object
  extended by org.knowceans.dirichlet.lsam.LsamGibbsSampler

public class LsamGibbsSampler
extends java.lang.Object

Gibbs sampler for estimating the best assignments of topics for words and actors in a corpus with co-authoring information and personalised queries. The algorithm is based on Mark Steyvers et al. "Probabilistic Author-Topic Models for Information Discovery" (2004) and will be shown in G. H. "Probabilistic associative browsing for virtual expert communities" (2005).

TODO: What happens if one author is not searching for anything? TODO: How can demand be weighted other than within the bounds of pdf normalisation (introduce null topic by fill with 0 terms? use p(a|z)?) TODO: Weight influence of relations in phi.

Author:
heinrich

Field Summary
(package private)  int A
          total number of actors
(package private)  int[][] actors
          actor data (co-author lists and queriers)
(package private)  double alpha
          Dirichlet parameter (actor--topic associations)
(package private)  double beta
          Dirichlet parameter (topic--term associations)
private static int BURN_IN
          burn-in period
(package private)  int[][][] cat
          cat[r][i][j] number of times actor i is assigned to topic j in relation r.
(package private)  int[][] catsum
          catsum[r][i] total number of word (=topic) assignments to actor i in relation r.
(package private)  int[][] cwt
          cwt[i][j] number of instances of word i (term) assigned to topic j.
(package private)  int[] cwtsum
          cwtsum[r][j] total number of words assigned to topic j.
private static int dispcol
           
private static int ITERATIONS
          max iterations
(package private)  int K
          number of topics
(package private) static java.text.NumberFormat lnf
           
(package private)  int[][] media
          document data (term lists)
(package private)  int numstats
          size of statistics
(package private)  double[][] phisum
          cumulative statistics of phi
(package private)  double[][][] psisum
          cumulative statistics of psi (one per relation)
(package private)  int R
          total number of relation types
static org.knowceans.util.CokusRandom rand
           
(package private)  int[] relations
          relation that each medium maps to
private static int SAMPLE_LAG
          sample lag (if <=0 only one sample taken at the end)
(package private) static java.lang.String[] shades
           
private static int THIN_INTERVAL
          sampling lag (?)
(package private)  int V
          vocabulary size
(package private)  int[][] x
          actor assignments for each word.
(package private)  int[][] z
          topic assignments for each word.
 
Constructor Summary
LsamGibbsSampler(int[][] documents, int[][] actors, int[] relations, int V, int A, int R)
          Initialise the Gibbs sampler with data.
 
Method Summary
 void configure(int iterations, int burnIn, int thinInterval, int sampleLag)
          Configure the gibbs sampler
 double[][] getPhi()
          Retrieve estimated topic--word associations.
 double[][][] getPsi()
          Retrieve estimated actor--topic associations, one matrix for each relation.
private  void gibbs(int K, double alpha, double beta)
          Main method: Select initial state ?
 void initialState(int K)
          Initialisation: Must start with an assignment of observations to topics ?
private static double kl(double[] p, double[] q)
          KL-divergence between distributions p and q.
static void main(java.lang.String[] args)
          Driver with example data.
private static double[][] matchActors(double[][][] psi, int p, int q)
          Build matrix of actor--actor matches according to the Kullback-Leibler-divergence, M_ij = KL(demand_i||supply_j).
private  void sampleFullConditional(int m, int n)
          Sample an actor--topic pair (x_i, z_i) from the full conditional distribution: p(x_i = q,z_i = j|z_-i, w, x_i, a_d, r_d) = (cwt_mj + beta)/(cwtsum_j + W * beta) * (cat_qjr + alpha)/(catsum_qr + K * alpha)
static java.lang.String shadeDouble(double d, double max)
          create a string representation whose gray value appears as an indicator of magnitude, cf.
static void test(java.lang.String[] args)
          test driver with a small synthetic data set
private  void updateParams()
          Add to the statistics the values of psi and phi for the current state.
private static void writeParameters(java.lang.String file, java.lang.String corpusname, int k, double alpha, double beta, int x, int m, int v, int w, long duration, int iterations, int samplelag, int burnin, org.knowceans.util.Arguments a)
          write statistics of the current run to a text file for later review
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

media

int[][] media
document data (term lists)


actors

int[][] actors
actor data (co-author lists and queriers)


relations

int[] relations
relation that each medium maps to


V

int V
vocabulary size


A

int A
total number of actors


R

int R
total number of relation types


K

int K
number of topics


alpha

double alpha
Dirichlet parameter (actor--topic associations)


beta

double beta
Dirichlet parameter (topic--term associations)


z

int[][] z
topic assignments for each word. TODO: could be z[r][m][n] to facilitate multiple document relations


x

int[][] x
actor assignments for each word.


cwt

int[][] cwt
cwt[i][j] number of instances of word i (term) assigned to topic j.


cat

int[][][] cat
cat[r][i][j] number of times actor i is assigned to topic j in relation r.


cwtsum

int[] cwtsum
cwtsum[r][j] total number of words assigned to topic j.


catsum

int[][] catsum
catsum[r][i] total number of word (=topic) assignments to actor i in relation r.


numstats

int numstats
size of statistics


psisum

double[][][] psisum
cumulative statistics of psi (one per relation)


phisum

double[][] phisum
cumulative statistics of phi


THIN_INTERVAL

private static int THIN_INTERVAL
sampling lag (?)


BURN_IN

private static int BURN_IN
burn-in period


ITERATIONS

private static int ITERATIONS
max iterations


SAMPLE_LAG

private static int SAMPLE_LAG
sample lag (if <=0 only one sample taken at the end)


dispcol

private static int dispcol

rand

public static org.knowceans.util.CokusRandom rand

shades

static java.lang.String[] shades

lnf

static java.text.NumberFormat lnf
Constructor Detail

LsamGibbsSampler

public LsamGibbsSampler(int[][] documents,
                        int[][] actors,
                        int[] relations,
                        int V,
                        int A,
                        int R)
Initialise the Gibbs sampler with data.

Parameters:
media -
actors -
relations -
V -
A -
R -
Method Detail

initialState

public void initialState(int K)
Initialisation: Must start with an assignment of observations to topics ? Many alternatives are possible, I chose to perform random assignments with equal probabilities

Parameters:
K - number of topics

gibbs

private void gibbs(int K,
                   double alpha,
                   double beta)
Main method: Select initial state ? Repeat a large number of times: 1. Select an element 2. Update conditional on other elements. If appropriate, output summary for each run.

Parameters:
K - number of topics
alpha - symmetric prior parameter on document--topic associations
beta - symmetric prior parameter on topic--term associations

sampleFullConditional

private void sampleFullConditional(int m,
                                   int n)
Sample an actor--topic pair (x_i, z_i) from the full conditional distribution: p(x_i = q,z_i = j|z_-i, w, x_i, a_d, r_d) = (cwt_mj + beta)/(cwtsum_j + W * beta) * (cat_qjr + alpha)/(catsum_qr + K * alpha)

Parameters:
m - document
n - word

updateParams

private void updateParams()
Add to the statistics the values of psi and phi for the current state.


getPsi

public double[][][] getPsi()
Retrieve estimated actor--topic associations, one matrix for each relation. (In the LS-AMQ model with R=2, psi[0] is theta and psi[1] is xi). If sample lag > 0 then the mean value of all sampled statistics for psi[][][] is taken.

Returns:
psi multinomial mixture of author topics (R x A x K)

getPhi

public double[][] getPhi()
Retrieve estimated topic--word associations. If sample lag > 0 then the mean value of all sampled statistics for phi[][] is taken.

IDEA: Here only the supply relation could be considered by cwtsum[r][]

Returns:
phi multinomial mixture of topic words (K x V)

configure

public void configure(int iterations,
                      int burnIn,
                      int thinInterval,
                      int sampleLag)
Configure the gibbs sampler

Parameters:
iterations - number of total iterations
burnIn - number of burn-in iterations
thinInterval - update statistics interval
sampleLag - sample interval (-1 for just one sample at the end)

main

public static void main(java.lang.String[] args)
Driver with example data. TODO: implement averaging of phi and psi.

Parameters:
args -

writeParameters

private static void writeParameters(java.lang.String file,
                                    java.lang.String corpusname,
                                    int k,
                                    double alpha,
                                    double beta,
                                    int x,
                                    int m,
                                    int v,
                                    int w,
                                    long duration,
                                    int iterations,
                                    int samplelag,
                                    int burnin,
                                    org.knowceans.util.Arguments a)
write statistics of the current run to a text file for later review

Parameters:
file -
corpusname -
k - topics
alpha - hyperparameter
beta - hyperparameter
x - actor count
m - doc count
v - vocabulary/term count
w - word count
duration - training duration
iterations - no. of total iterations
samplelag - sampling lag
burnin - burnin samples
a - Arguments object

test

public static void test(java.lang.String[] args)
test driver with a small synthetic data set

Parameters:
args -

matchActors

private static double[][] matchActors(double[][][] psi,
                                      int p,
                                      int q)
Build matrix of actor--actor matches according to the Kullback-Leibler-divergence, M_ij = KL(demand_i||supply_j).

Parameters:
psi - matrix[][][] of actor--topic associations, by relations in first index
p - relation of left KL argument
q - relation of right KL argument
Returns:

kl

private static double kl(double[] p,
                         double[] q)
KL-divergence between distributions p and q. (The measure is asymmetric: KL(p||q) = int p(x) * log(p(x)/q(x)) dx).

Parameters:
p - discrete pdf
q - discrete pdf (q.length = p.length)
Returns:
KL divergence

shadeDouble

public static java.lang.String shadeDouble(double d,
                                           double max)
create a string representation whose gray value appears as an indicator of magnitude, cf. Hinton diagrams in statistics.

Parameters:
d - value
max - maximum value
Returns: