org.knowceans.corpus
Class LuceneCorpus

java.lang.Object
  extended by org.knowceans.corpus.LuceneCorpus
All Implemented Interfaces:
ITermCorpus
Direct Known Subclasses:
LuceneMapCorpus

public class LuceneCorpus
extends java.lang.Object
implements ITermCorpus

LuceneTermCorpus creates a TermCorpus interface around a lucene index. For this, the lucene index needs a stored field with some document identification (technically, not necessarily unique), and a term vector field with the content. This implementation directly (hence its name) accesses the fields of the lucene index.

The corpus can split the lucene index by a df threshold.

Author:
gregor

Field Summary
protected  java.lang.String contentField
          Lucene index field to extract the corpus information from.
protected  java.lang.String docNamesField
          Lucene index field to read the document names from.
protected  java.util.ArrayList<java.lang.Integer> emptyDocs
           
protected static int INDEX_UNKNOWN
           
protected  java.lang.String indexpath
           
protected  org.apache.lucene.index.IndexReader ir
           
protected  int minDf
          Minimum document frequency for terms allowd in the regular term index.
protected  int nTerms
          Terms in the regular term index
protected  int nTermsLowDf
          Terms in the lowDf index
protected  org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndex
          Index of term<->id
protected  org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndexLowDf
          Index of lowDf term<->id, the id is above that of the termIndex, i.e., one term-document matrix could created if necessary.
private  boolean useLowDf
           
 
Constructor Summary
LuceneCorpus(java.lang.String path, java.lang.String docNamesField)
          Initialise the corpus with just access to the IndexReader.
LuceneCorpus(java.lang.String indexPath, java.lang.String docNameField, java.lang.String contentField, int minDf, boolean useLowDf)
          Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names.
 
Method Summary
protected  java.util.ArrayList<java.lang.String> buildTermIndex(boolean useIgnored)
          Create term index from the lucene index
protected  void buildTermIndexLowDf(java.util.ArrayList<java.lang.String> ignoredTerms)
          Create the hash map from the vector of low-df terms
protected  void extract()
          Initialise the corpus by extracting the files from the index.
 java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
          Get the document terms as a frequency map id->frequency.
private  java.util.Vector<java.lang.Integer> getDocWords(int doc, java.util.Random rand)
          Get the words of document doc as a scrambled sequence.
 int[][] getDocWords(java.util.Random rand)
          Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.
 int[] getDocWords(java.lang.String string)
          Get the words of an unknown document as a scrambled sequence.
 int getNdocs()
          Number of documents in corpus
 int getNterms()
          Number of terms in corpus
 int getNwords(int doc)
           
 boolean isEmptyDoc(int doc)
          Whether this document is non-empty after filtering.
 java.lang.String lookup(int term)
          Get the string for the particular index, either from the regular index or from the lowDf index.
 int lookup(java.lang.String term)
          Get the index of the particular term, either from the regular index or from the lowDf index, which results in an index >= nTerms.
 java.lang.String lookupDoc(int doc)
          Get document name from id.
 int lookupDoc(java.lang.String docName)
          Get the document index of the document with string id docName.
private  boolean ok(java.lang.String string)
           
protected  void setupIndex(boolean useIgnored)
          Creates term map and counts.
 void writeCorpus(java.lang.String filebase)
          Write the corpus to the file.
 void writeDocList(java.lang.String file)
          Write the document titles in a file (one doc per line)
 void writeVocabulary(java.lang.String file, boolean sort)
          Write the vocabulary to the file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INDEX_UNKNOWN

protected static final int INDEX_UNKNOWN
See Also:
Constant Field Values

indexpath

protected java.lang.String indexpath

ir

protected org.apache.lucene.index.IndexReader ir

emptyDocs

protected java.util.ArrayList<java.lang.Integer> emptyDocs

termIndex

protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndex
Index of term<->id


termIndexLowDf

protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndexLowDf
Index of lowDf term<->id, the id is above that of the termIndex, i.e., one term-document matrix could created if necessary.


minDf

protected int minDf
Minimum document frequency for terms allowd in the regular term index.


nTerms

protected int nTerms
Terms in the regular term index


nTermsLowDf

protected int nTermsLowDf
Terms in the lowDf index


contentField

protected java.lang.String contentField
Lucene index field to extract the corpus information from.


docNamesField

protected java.lang.String docNamesField
Lucene index field to read the document names from.


useLowDf

private boolean useLowDf
Constructor Detail

LuceneCorpus

public LuceneCorpus(java.lang.String indexPath,
                    java.lang.String docNameField,
                    java.lang.String contentField,
                    int minDf,
                    boolean useLowDf)
             throws java.io.IOException
Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names. Use only terms with document frequency above or equal to minDf and ignore or make accessible the low-df terms.

The content field must have been indexed.

The doc names field must have been stored.

Parameters:
indexPath -
docNameField -
contentField -
minDf -
useLowDf -
Throws:
java.io.IOException

LuceneCorpus

public LuceneCorpus(java.lang.String path,
                    java.lang.String docNamesField)
             throws java.io.IOException
Initialise the corpus with just access to the IndexReader. This is useful if only id information is required.

Parameters:
path -
docNamesField -
Throws:
java.io.IOException
Method Detail

extract

protected void extract()
                throws java.io.IOException
Initialise the corpus by extracting the files from the index. This method must be called exactly once before any get method is called.

Throws:
java.io.IOException

setupIndex

protected void setupIndex(boolean useIgnored)
                   throws java.io.IOException
Creates term map and counts.

Parameters:
useIgnored -
Throws:
java.io.IOException

buildTermIndex

protected java.util.ArrayList<java.lang.String> buildTermIndex(boolean useIgnored)
                                                        throws java.io.IOException
Create term index from the lucene index

Parameters:
useIgnored -
Returns:
Throws:
java.io.IOException

ok

private boolean ok(java.lang.String string)
Parameters:
string -
Returns:

buildTermIndexLowDf

protected void buildTermIndexLowDf(java.util.ArrayList<java.lang.String> ignoredTerms)
Create the hash map from the vector of low-df terms

Parameters:
ignoredTerms -

lookupDoc

public java.lang.String lookupDoc(int doc)
Description copied from interface: ITermCorpus
Get document name from id.

Specified by:
lookupDoc in interface ITermCorpus
Returns:

lookupDoc

public int lookupDoc(java.lang.String docName)
Get the document index of the document with string id docName.

Specified by:
lookupDoc in interface ITermCorpus
Parameters:
docName -
Returns:

getDocTerms

public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
Description copied from interface: ITermCorpus
Get the document terms as a frequency map id->frequency.

Specified by:
getDocTerms in interface ITermCorpus
Returns:

getDocWords

public int[][] getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.

Specified by:
getDocWords in interface ITermCorpus
Parameters:
rand - random number generator or null to use standard generator
Returns:

getDocWords

private java.util.Vector<java.lang.Integer> getDocWords(int doc,
                                                        java.util.Random rand)
Get the words of document doc as a scrambled sequence.

It seems that the getDocTerms... loop scales badly. Use LuceneMapCorpus for larger documents.

Parameters:
doc -
rand - random number generator or null to use standard generator
Returns:

getDocWords

public int[] getDocWords(java.lang.String string)
Get the words of an unknown document as a scrambled sequence. This method accesses the Lucene search index.

Parameters:
string -

lookup

public java.lang.String lookup(int term)
Get the string for the particular index, either from the regular index or from the lowDf index.

Specified by:
lookup in interface ITermCorpus
Parameters:
term - term index
Returns:
term or null if unknown

lookup

public int lookup(java.lang.String term)
Get the index of the particular term, either from the regular index or from the lowDf index, which results in an index >= nTerms.

Specified by:
lookup in interface ITermCorpus
Parameters:
term - string
Returns:
index of the term or -1 (INDEX_UNKNOWN)

getNdocs

public int getNdocs()
Description copied from interface: ITermCorpus
Number of documents in corpus

Specified by:
getNdocs in interface ITermCorpus
Returns:

getNterms

public int getNterms()
Description copied from interface: ITermCorpus
Number of terms in corpus

Specified by:
getNterms in interface ITermCorpus
Returns:

getNwords

public int getNwords(int doc)

isEmptyDoc

public final boolean isEmptyDoc(int doc)
Whether this document is non-empty after filtering.

Parameters:
doc -
Returns:

writeDocList

public void writeDocList(java.lang.String file)
                  throws java.io.IOException
Write the document titles in a file (one doc per line)

Parameters:
file -
Throws:
java.io.IOException

writeCorpus

public void writeCorpus(java.lang.String filebase)
                 throws java.io.IOException
Write the corpus to the file. (Important note: both numbers are doc frequency and equal since the Lucene index does not support term frequency).

Parameters:
filebase -
Throws:
java.io.IOException

writeVocabulary

public void writeVocabulary(java.lang.String file,
                            boolean sort)
                     throws java.io.IOException
Write the vocabulary to the file. The corpus / term index is supposed to be reordered / split during extraction.

Parameters:
file -
sort - sorts the vocabulary in alphabetical order
Throws:
java.io.IOException