org.knowceans.dirichlet.atm
Class AtmQueryClient

java.lang.Object
  extended by org.knowceans.dirichlet.atm.AtmQueryClient

public class AtmQueryClient
extends java.lang.Object

LdaQueryClient is the central class for querying lda parameter sets. The type of supported queries is full-text, document and term similarities.

TODO: handle reset of lda model before / after sampling

TODO: handle parametrisation

TODO: if the index is split, which largely improves scalability, handle low-frequency terms: if a term is not in the lda index and for instance has a document frequency of 1, its topics could defined to be the topics of that document. For a mindf>2 and a low-frequency terms with df>1, a document could be sampled. This would, however, need an inverted index.

Author:
gregor

Field Summary
private  java.lang.String atmbase
           
private  AtmGibbsQuerySampler atmq
           
private  AtmTopicSimilarities atmt
           
private  AmqCorpus corpus
           
private  java.lang.String corpusbase
           
private  ExtLdaConfiguration ldac
           
private  double maxdistance
           
private  int maxresults
           
private  AtmMarkovState mcmc
           
private  double minlikelihood
           
private  boolean usePredLikelihood
           
 
Constructor Summary
AtmQueryClient(java.lang.String corpusbase, java.lang.String ldabase)
           
 
Method Summary
private  void authorQueries(java.lang.String file, java.lang.String[] queries)
          Find authors for the queries array.
 void describeAuthor(int author)
           
 void describeCorpus()
          Show the topic distribution in the corpus
 void describeTerm(int term)
           
 void describeTopics(double[] pzx)
           
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorResults(double[] topics)
          Get a list of terms that matches the topics.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorResults(int[] words)
          Get a list of documents that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorResults(java.lang.String query)
          Get a list of documents that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorSimilarities(int author)
          Get similar documents to this one
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorTermSimilarities(int doc)
          Get similar terms to arg document
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermAuthorSimilarities(int term)
          Get similar terms to arg document
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(double[] topics)
          Get a list of documents that matches the topics.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(int[] words)
          Get a list of terms that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(java.lang.String query)
          Get a list of terms that matches the query.
 java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermSimilarities(int term)
          Get similar terms to this one
 double[] getTopics(int[] query)
          Get the topics for the query sampled from the model.
 int[] getWords(java.lang.String query)
          Get the query as indices of terms.
private  void init()
          Initialise the query client.
private static java.lang.String[] load(java.lang.String file)
          Load a file with queries.
static void main(java.lang.String[] args)
           
private  void test(java.lang.String[] queries)
           
private  void test2()
          Test document and term similarities
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

atmbase

private java.lang.String atmbase

corpusbase

private java.lang.String corpusbase

mcmc

private AtmMarkovState mcmc

ldac

private ExtLdaConfiguration ldac

atmq

private AtmGibbsQuerySampler atmq

corpus

private AmqCorpus corpus

atmt

private AtmTopicSimilarities atmt

maxresults

private int maxresults

maxdistance

private double maxdistance

minlikelihood

private double minlikelihood

usePredLikelihood

private boolean usePredLikelihood
Constructor Detail

AtmQueryClient

public AtmQueryClient(java.lang.String corpusbase,
                      java.lang.String ldabase)
Method Detail

main

public static void main(java.lang.String[] args)

init

private void init()
Initialise the query client.


getWords

public int[] getWords(java.lang.String query)
Get the query as indices of terms. This can be used to check whether the query has a valid length. For direct results, use getDocResults(String).

Parameters:
query -
Returns:

getTopics

public double[] getTopics(int[] query)
Get the topics for the query sampled from the model. This can be used to output the sampled sampled topics. For direct results, use getDocResults(String).

Parameters:
query -
Returns:

getTermResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(java.lang.String query)
Get a list of terms that matches the query.

Parameters:
query -
Returns:

getTermResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(int[] words)
Get a list of terms that matches the query.

Parameters:
words -
Returns:

getTermResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermResults(double[] topics)
Get a list of documents that matches the topics.

Parameters:
words -
Returns:

getAuthorResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorResults(java.lang.String query)
Get a list of documents that matches the query.

Parameters:
query -
Returns:

getAuthorResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorResults(int[] words)
Get a list of documents that matches the query.

Parameters:
words -
Returns:

getAuthorResults

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorResults(double[] topics)
Get a list of terms that matches the topics.

Parameters:
words -
Returns:

getAuthorSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorSimilarities(int author)
Get similar documents to this one

Parameters:
author -
Returns:

getTermSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermSimilarities(int term)
Get similar terms to this one

Parameters:
doc -
Returns:

getAuthorTermSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getAuthorTermSimilarities(int doc)
Get similar terms to arg document

Parameters:
doc -
Returns:

getTermAuthorSimilarities

public java.util.List<org.knowceans.map.IndexRanking.IndexEntry> getTermAuthorSimilarities(int term)
Get similar terms to arg document

Parameters:
doc -
Returns:

describeCorpus

public void describeCorpus()
Show the topic distribution in the corpus


describeAuthor

public void describeAuthor(int author)

describeTerm

public void describeTerm(int term)

describeTopics

public void describeTopics(double[] pzx)

test

private void test(java.lang.String[] queries)

test2

private void test2()
Test document and term similarities


load

private static java.lang.String[] load(java.lang.String file)
Load a file with queries. Format: space-separated terms, one query per line.

Parameters:
file -
Returns:

authorQueries

private void authorQueries(java.lang.String file,
                           java.lang.String[] queries)
Find authors for the queries array.

Parameters:
file -
queries -