LuceneMapCorpus

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus
Class LuceneMapCorpus

java.lang.Object
  org.knowceans.corpus.LuceneCorpus
      org.knowceans.corpus.LuceneMapCorpus

All Implemented Interfaces:: IRandomAccessTermCorpus, ITermCorpus

public class LuceneMapCorpus
extends LuceneCorpus
implements IRandomAccessTermCorpus
extends LuceneCorpus
implements IRandomAccessTermCorpus

LuceneTermCorpus creates a TermCorpus interface around a lucene index. For this, the lucene index needs a stored field with some document identification (technically, not necessarily unique), and a term vector field with the content.

This implementation uses a map of terms and documents to directly access mapping information without searching the Lucene index. Use this class if frequent lookups are necessary, e.g., for applications that focus on topic-based search rather than full-text search.

Author:: gregor

Field Summary
`private java.util.ArrayList<java.lang.String>`	`docNames` Index of document names from the
`protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>>`	`docTerms` Each document's term frequencies termid -> frequency(doc)
`private int`	`nWords` Words in the corpus;

Fields inherited from class org.knowceans.corpus.LuceneCorpus
`contentField, docNamesField, emptyDocs, INDEX_UNKNOWN, indexpath, ir, minDf, nTerms, nTermsLowDf, termIndex, termIndexLowDf`

Constructor Summary
`LuceneMapCorpus(java.lang.String indexPath, java.lang.String contentField, java.lang.String docNameField, int minDf, boolean useLowDf)` Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names.

Method Summary
`private void`	`buildDocNames()` Create the list of document names from the Lucene index.
`private void`	`buildDocTerms()` Create the term frequency maps for the corpus, with the reduced vocabulary according to the document frequencies.
`protected void`	`extract()` Initialise the corpus by extracting the files from the index.
`java.util.ArrayList<java.lang.String>`	`getDocNames()` Get a list of all document names / ids.
`java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>>`	`getDocTerms()` Get list of document term maps (index->freq)
`java.util.Map<java.lang.Integer,java.lang.Integer>`	`getDocTerms(int doc)` Get the document terms as a frequency map id->frequency.
`int`	`getNwords()` Get the number of words (term observations) in the corpus.
`org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer>`	`getTermIndex()` Get a bijective map term / id
`java.lang.String`	`lookupDoc(int doc)` Get document name from id.
`private void`	`setupDocs()` Creates document-terms maps.

Methods inherited from class org.knowceans.corpus.LuceneCorpus
`buildTermIndex, buildTermIndexLowDf, getDocWords, getDocWords, getNdocs, getNterms, getNwords, isEmptyDoc, lookup, lookup, lookupDoc, setupIndex, writeCorpus, writeDocList, writeVocabulary`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface org.knowceans.corpus.ITermCorpus
`getDocWords, getNdocs, getNterms, lookup, lookup, lookupDoc`

Field Detail

docTerms

protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTerms

Each document's term frequencies termid -> frequency(doc)

docNames

private java.util.ArrayList<java.lang.String> docNames

Index of document names from the

nWords

private int nWords

Words in the corpus;

Constructor Detail

LuceneMapCorpus

public LuceneMapCorpus(java.lang.String indexPath,
                       java.lang.String contentField,
                       java.lang.String docNameField,
                       int minDf,
                       boolean useLowDf)
                throws java.io.IOException

Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names. Use only terms with document frequency above or equal to minDf and ignore or make accessible the low-df terms.

The content field must have been indexed.

The doc names field must have been stored.

Parameters:: indexPath -; contentField -; docNameField -; minDf -; useLowDf -
Throws:: java.io.IOException

Method Detail

extract

protected void extract()
                throws java.io.IOException

Description copied from class: LuceneCorpus

Initialise the corpus by extracting the files from the index. This method must be called exactly once before any get method is called.

Overrides:: extract in class LuceneCorpus

Throws:: java.io.IOException

setupDocs

private void setupDocs()
                throws java.io.IOException

Creates document-terms maps.

Throws:: java.io.IOException

buildDocNames

private void buildDocNames()
                    throws java.io.IOException

Create the list of document names from the Lucene index.

TODO: doc names could be read directly from the index, but then only an interface like getDocName(index) can be provided, as there exists no access to all document names at once.

Throws:: java.io.IOException

buildDocTerms

private void buildDocTerms()
                    throws java.io.IOException

Create the term frequency maps for the corpus, with the reduced vocabulary according to the document frequencies.

Throws:: java.io.IOException

lookupDoc

public java.lang.String lookupDoc(int doc)

Description copied from interface: ITermCorpus

Get document name from id.

Specified by:: lookupDoc in interface ITermCorpus
Overrides:: lookupDoc in class LuceneCorpus

Returns:

getDocNames

public java.util.ArrayList<java.lang.String> getDocNames()

Description copied from interface: IRandomAccessTermCorpus

Get a list of all document names / ids.

Specified by:: getDocNames in interface IRandomAccessTermCorpus

Returns:

getDocTerms

public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)

Description copied from interface: ITermCorpus

Get the document terms as a frequency map id->frequency.

Specified by:: getDocTerms in interface ITermCorpus
Overrides:: getDocTerms in class LuceneCorpus

Returns:

getDocTerms

public java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTerms()

Description copied from interface: IRandomAccessTermCorpus

Get list of document term maps (index->freq)

Specified by:: getDocTerms in interface IRandomAccessTermCorpus

Returns:

getTermIndex

public org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> getTermIndex()

Description copied from interface: IRandomAccessTermCorpus

Get a bijective map term / id

Specified by:: getTermIndex in interface IRandomAccessTermCorpus

Returns:

getNwords

public int getNwords()

Description copied from interface: IRandomAccessTermCorpus

Get the number of words (term observations) in the corpus.

Specified by:: getNwords in interface IRandomAccessTermCorpus

Returns:

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.knowceans.corpus Class LuceneMapCorpus

docTerms

docNames

nWords

LuceneMapCorpus

extract

setupDocs

buildDocNames

buildDocTerms

lookupDoc

getDocNames

getDocTerms

getDocTerms

getTermIndex

getNwords

org.knowceans.corpus
Class LuceneMapCorpus