org.knowceans.corpus
Class LuceneMapCorpus

java.lang.Object
  extended by org.knowceans.corpus.LuceneCorpus
      extended by org.knowceans.corpus.LuceneMapCorpus
All Implemented Interfaces:
IRandomAccessTermCorpus, ITermCorpus

public class LuceneMapCorpus
extends LuceneCorpus
implements IRandomAccessTermCorpus

LuceneTermCorpus creates a TermCorpus interface around a lucene index. For this, the lucene index needs a stored field with some document identification (technically, not necessarily unique), and a term vector field with the content.

This implementation uses a map of terms and documents to directly access mapping information without searching the Lucene index. Use this class if frequent lookups are necessary, e.g., for applications that focus on topic-based search rather than full-text search.

Author:
gregor

Field Summary
private  java.util.ArrayList<java.lang.String> docNames
          Index of document names from the
protected  java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTerms
          Each document's term frequencies termid -> frequency(doc)
private  int nWords
          Words in the corpus;
 
Fields inherited from class org.knowceans.corpus.LuceneCorpus
contentField, docNamesField, emptyDocs, INDEX_UNKNOWN, indexpath, ir, minDf, nTerms, nTermsLowDf, termIndex, termIndexLowDf
 
Constructor Summary
LuceneMapCorpus(java.lang.String indexPath, java.lang.String contentField, java.lang.String docNameField, int minDf, boolean useLowDf)
          Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names.
 
Method Summary
private  void buildDocNames()
          Create the list of document names from the Lucene index.
private  void buildDocTerms()
          Create the term frequency maps for the corpus, with the reduced vocabulary according to the document frequencies.
protected  void extract()
          Initialise the corpus by extracting the files from the index.
 java.util.ArrayList<java.lang.String> getDocNames()
          Get a list of all document names / ids.
 java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTerms()
          Get list of document term maps (index->freq)
 java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
          Get the document terms as a frequency map id->frequency.
 int getNwords()
          Get the number of words (term observations) in the corpus.
 org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> getTermIndex()
          Get a bijective map term / id
 java.lang.String lookupDoc(int doc)
          Get document name from id.
private  void setupDocs()
          Creates document-terms maps.
 
Methods inherited from class org.knowceans.corpus.LuceneCorpus
buildTermIndex, buildTermIndexLowDf, getDocWords, getDocWords, getNdocs, getNterms, getNwords, isEmptyDoc, lookup, lookup, lookupDoc, setupIndex, writeCorpus, writeDocList, writeVocabulary
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.knowceans.corpus.ITermCorpus
getDocWords, getNdocs, getNterms, lookup, lookup, lookupDoc
 

Field Detail

docTerms

protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTerms
Each document's term frequencies termid -> frequency(doc)


docNames

private java.util.ArrayList<java.lang.String> docNames
Index of document names from the


nWords

private int nWords
Words in the corpus;

Constructor Detail

LuceneMapCorpus

public LuceneMapCorpus(java.lang.String indexPath,
                       java.lang.String contentField,
                       java.lang.String docNameField,
                       int minDf,
                       boolean useLowDf)
                throws java.io.IOException
Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names. Use only terms with document frequency above or equal to minDf and ignore or make accessible the low-df terms.

The content field must have been indexed.

The doc names field must have been stored.

Parameters:
indexPath -
contentField -
docNameField -
minDf -
useLowDf -
Throws:
java.io.IOException
Method Detail

extract

protected void extract()
                throws java.io.IOException
Description copied from class: LuceneCorpus
Initialise the corpus by extracting the files from the index. This method must be called exactly once before any get method is called.

Overrides:
extract in class LuceneCorpus
Throws:
java.io.IOException

setupDocs

private void setupDocs()
                throws java.io.IOException
Creates document-terms maps.

Throws:
java.io.IOException

buildDocNames

private void buildDocNames()
                    throws java.io.IOException
Create the list of document names from the Lucene index.

TODO: doc names could be read directly from the index, but then only an interface like getDocName(index) can be provided, as there exists no access to all document names at once.

Throws:
java.io.IOException

buildDocTerms

private void buildDocTerms()
                    throws java.io.IOException
Create the term frequency maps for the corpus, with the reduced vocabulary according to the document frequencies.

Throws:
java.io.IOException

lookupDoc

public java.lang.String lookupDoc(int doc)
Description copied from interface: ITermCorpus
Get document name from id.

Specified by:
lookupDoc in interface ITermCorpus
Overrides:
lookupDoc in class LuceneCorpus
Returns:

getDocNames

public java.util.ArrayList<java.lang.String> getDocNames()
Description copied from interface: IRandomAccessTermCorpus
Get a list of all document names / ids.

Specified by:
getDocNames in interface IRandomAccessTermCorpus
Returns:

getDocTerms

public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
Description copied from interface: ITermCorpus
Get the document terms as a frequency map id->frequency.

Specified by:
getDocTerms in interface ITermCorpus
Overrides:
getDocTerms in class LuceneCorpus
Returns:

getDocTerms

public java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTerms()
Description copied from interface: IRandomAccessTermCorpus
Get list of document term maps (index->freq)

Specified by:
getDocTerms in interface IRandomAccessTermCorpus
Returns:

getTermIndex

public org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> getTermIndex()
Description copied from interface: IRandomAccessTermCorpus
Get a bijective map term / id

Specified by:
getTermIndex in interface IRandomAccessTermCorpus
Returns:

getNwords

public int getNwords()
Description copied from interface: IRandomAccessTermCorpus
Get the number of words (term observations) in the corpus.

Specified by:
getNwords in interface IRandomAccessTermCorpus
Returns: