|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.knowceans.corpus.LuceneCorpus org.knowceans.corpus.LuceneMapCorpus
public class LuceneMapCorpus
LuceneTermCorpus creates a TermCorpus interface around a lucene index. For this, the lucene index needs a stored field with some document identification (technically, not necessarily unique), and a term vector field with the content.
This implementation uses a map of terms and documents to directly access mapping information without searching the Lucene index. Use this class if frequent lookups are necessary, e.g., for applications that focus on topic-based search rather than full-text search.
Field Summary | |
---|---|
private java.util.ArrayList<java.lang.String> |
docNames
Index of document names from the |
protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> |
docTerms
Each document's term frequencies termid -> frequency(doc) |
private int |
nWords
Words in the corpus; |
Fields inherited from class org.knowceans.corpus.LuceneCorpus |
---|
contentField, docNamesField, emptyDocs, INDEX_UNKNOWN, indexpath, ir, minDf, nTerms, nTermsLowDf, termIndex, termIndexLowDf |
Constructor Summary | |
---|---|
LuceneMapCorpus(java.lang.String indexPath,
java.lang.String contentField,
java.lang.String docNameField,
int minDf,
boolean useLowDf)
Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names. |
Method Summary | |
---|---|
private void |
buildDocNames()
Create the list of document names from the Lucene index. |
private void |
buildDocTerms()
Create the term frequency maps for the corpus, with the reduced vocabulary according to the document frequencies. |
protected void |
extract()
Initialise the corpus by extracting the files from the index. |
java.util.ArrayList<java.lang.String> |
getDocNames()
Get a list of all document names / ids. |
java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> |
getDocTerms()
Get list of document term maps (index->freq) |
java.util.Map<java.lang.Integer,java.lang.Integer> |
getDocTerms(int doc)
Get the document terms as a frequency map id->frequency. |
int |
getNwords()
Get the number of words (term observations) in the corpus. |
org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> |
getTermIndex()
Get a bijective map term / id |
java.lang.String |
lookupDoc(int doc)
Get document name from id. |
private void |
setupDocs()
Creates document-terms maps. |
Methods inherited from class org.knowceans.corpus.LuceneCorpus |
---|
buildTermIndex, buildTermIndexLowDf, getDocWords, getDocWords, getNdocs, getNterms, getNwords, isEmptyDoc, lookup, lookup, lookupDoc, setupIndex, writeCorpus, writeDocList, writeVocabulary |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.knowceans.corpus.ITermCorpus |
---|
getDocWords, getNdocs, getNterms, lookup, lookup, lookupDoc |
Field Detail |
---|
protected java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> docTerms
private java.util.ArrayList<java.lang.String> docNames
private int nWords
Constructor Detail |
---|
public LuceneMapCorpus(java.lang.String indexPath, java.lang.String contentField, java.lang.String docNameField, int minDf, boolean useLowDf) throws java.io.IOException
The content field must have been indexed.
The doc names field must have been stored.
indexPath
- contentField
- docNameField
- minDf
- useLowDf
-
java.io.IOException
Method Detail |
---|
protected void extract() throws java.io.IOException
LuceneCorpus
extract
in class LuceneCorpus
java.io.IOException
private void setupDocs() throws java.io.IOException
java.io.IOException
private void buildDocNames() throws java.io.IOException
TODO: doc names could be read directly from the index, but then only an interface like getDocName(index) can be provided, as there exists no access to all document names at once.
java.io.IOException
private void buildDocTerms() throws java.io.IOException
java.io.IOException
public java.lang.String lookupDoc(int doc)
ITermCorpus
lookupDoc
in interface ITermCorpus
lookupDoc
in class LuceneCorpus
public java.util.ArrayList<java.lang.String> getDocNames()
IRandomAccessTermCorpus
getDocNames
in interface IRandomAccessTermCorpus
public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
ITermCorpus
getDocTerms
in interface ITermCorpus
getDocTerms
in class LuceneCorpus
public java.util.ArrayList<java.util.Map<java.lang.Integer,java.lang.Integer>> getDocTerms()
IRandomAccessTermCorpus
getDocTerms
in interface IRandomAccessTermCorpus
public org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> getTermIndex()
IRandomAccessTermCorpus
getTermIndex
in interface IRandomAccessTermCorpus
public int getNwords()
IRandomAccessTermCorpus
getNwords
in interface IRandomAccessTermCorpus
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |