org.knowceans.dirichlet.lda
Class LdaQueryClient

java.lang.Object
  extended by org.knowceans.dirichlet.lda.LdaQueryClient

public class LdaQueryClient
extends java.lang.Object

LdaQueryClient is the central class for querying lda parameter sets. The type of supported queries is full-text, document and term similarities.

TODO: handle reset of lda model before / after sampling

TODO: handle parametrisation

TODO: if the index is split, which largely improves scalability, handle low-frequency terms: if a term is not in the lda index and for instance has a document frequency of 1, its topics could defined to be the topics of that document. For a mindf>2 and a low-frequency terms with df>1, a document could be sampled. This would, however, need an inverted index.

Author:
gregor

Field Summary
private  TermCorpus corpus
           
private  java.lang.String corpusbase
           
private  java.lang.String ldabase
           
private  ExtLdaConfiguration ldac
           
private  LdaGibbsQuerySampler ldaq
           
private  LdaTopicSimilarities ldat
           
private  double maxdistance
           
private  int maxresults
           
private  LdaMarkovState mcmc
           
private  double minlikelihood
           
private  boolean usePredLikelihood
           
 
Constructor Summary
LdaQueryClient(java.lang.String corpusbase, java.lang.String ldabase)
           
 
Method Summary
 void describeCorpus()
          Show the topic distribution in the corpus
 void describeDoc(int doc)
           
 void describeTerm(int term)
           
 void describeTopics(double[] pzx)
           
private  void docQueries(java.lang.String file, java.lang.String[] queries)
          Find documents for the queries array.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocResults(double[] topics)
          Get a list of terms that matches the topics.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocResults(int[] words)
          Get a list of documents that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocResults(java.lang.String query)
          Get a list of documents that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocSimilarities(int doc)
          Get similar documents to this one
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocTermSimilarities(int doc)
          Get similar terms to arg document
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermDocSimilarities(int term)
          Get similar terms to arg document
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(double[] topics)
          Get a list of documents that matches the topics.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(int[] words)
          Get a list of terms that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(java.lang.String query)
          Get a list of terms that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermSimilarities(int term)
          Get similar terms to this one
 double[] getTopics(int[] query)
          Get the topics for the query sampled from the model.
 int[] getWords(java.lang.String query)
          Get the query as indices of terms.
private  void init()
          Initialise the query client.
private static java.lang.String[] load(java.lang.String file)
          Load a file with queries.
static void main(java.lang.String[] args)
           
private  void test(java.lang.String[] queries)
           
private  void test2()
          Test document and term similarities
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ldabase

private java.lang.String ldabase

corpusbase

private java.lang.String corpusbase

mcmc

private LdaMarkovState mcmc

ldac

private ExtLdaConfiguration ldac

ldaq

private LdaGibbsQuerySampler ldaq

corpus

private TermCorpus corpus

ldat

private LdaTopicSimilarities ldat

maxresults

private int maxresults

maxdistance

private double maxdistance

minlikelihood

private double minlikelihood

usePredLikelihood

private boolean usePredLikelihood
Constructor Detail

LdaQueryClient

public LdaQueryClient(java.lang.String corpusbase,
                      java.lang.String ldabase)
Method Detail

main

public static void main(java.lang.String[] args)

init

private void init()
Initialise the query client.


getWords

public int[] getWords(java.lang.String query)
Get the query as indices of terms. This can be used to check whether the query has a valid length. For direct results, use getDocResults(String).

Parameters:
query -
Returns:

getTopics

public double[] getTopics(int[] query)
Get the topics for the query sampled from the model. This can be used to output the sampled sampled topics. For direct results, use getDocResults(String).

Parameters:
query -
Returns:

getTermResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(java.lang.String query)
Get a list of terms that matches the query.

Parameters:
query -
Returns:

getTermResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(int[] words)
Get a list of terms that matches the query.

Parameters:
words -
Returns:

getTermResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(double[] topics)
Get a list of documents that matches the topics.

Parameters:
words -
Returns:

getDocResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocResults(java.lang.String query)
Get a list of documents that matches the query.

Parameters:
query -
Returns:

getDocResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocResults(int[] words)
Get a list of documents that matches the query.

Parameters:
words -
Returns:

getDocResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocResults(double[] topics)
Get a list of terms that matches the topics.

Parameters:
words -
Returns:

getDocSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocSimilarities(int doc)
Get similar documents to this one

Parameters:
doc -
Returns:

getTermSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermSimilarities(int term)
Get similar terms to this one

Parameters:
doc -
Returns:

getDocTermSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getDocTermSimilarities(int doc)
Get similar terms to arg document

Parameters:
doc -
Returns:

getTermDocSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermDocSimilarities(int term)
Get similar terms to arg document

Parameters:
doc -
Returns:

describeCorpus

public void describeCorpus()
Show the topic distribution in the corpus


describeDoc

public void describeDoc(int doc)

describeTerm

public void describeTerm(int term)

describeTopics

public void describeTopics(double[] pzx)

test

private void test(java.lang.String[] queries)

test2

private void test2()
Test document and term similarities


load

private static java.lang.String[] load(java.lang.String file)
Load a file with queries. Format: space-separated terms, one query per line.

Parameters:
file -
Returns:

docQueries

private void docQueries(java.lang.String file,
                        java.lang.String[] queries)
Find documents for the queries array.

Parameters:
file -
queries -