|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.knowceans.corpus.LuceneCorpus
public class LuceneCorpus
LuceneTermCorpus creates a TermCorpus interface around a lucene index. For this, the lucene index needs a stored field with some document identification (technically, not necessarily unique), and a term vector field with the content. This implementation directly (hence its name) accesses the fields of the lucene index.
The corpus can split the lucene index by a df threshold.
| Field Summary | |
|---|---|
protected java.lang.String |
contentField
Lucene index field to extract the corpus information from. |
protected java.lang.String |
docNamesField
Lucene index field to read the document names from. |
protected java.util.ArrayList<java.lang.Integer> |
emptyDocs
|
protected static int |
INDEX_UNKNOWN
|
protected java.lang.String |
indexpath
|
protected org.apache.lucene.index.IndexReader |
ir
|
protected int |
minDf
Minimum document frequency for terms allowd in the regular term index. |
protected int |
nTerms
Terms in the regular term index |
protected int |
nTermsLowDf
Terms in the lowDf index |
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> |
termIndex
Index of term<->id |
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> |
termIndexLowDf
Index of lowDf term<->id, the id is above that of the termIndex, i.e., one term-document matrix could created if necessary. |
private boolean |
useLowDf
|
| Constructor Summary | |
|---|---|
LuceneCorpus(java.lang.String path,
java.lang.String docNamesField)
Initialise the corpus with just access to the IndexReader. |
|
LuceneCorpus(java.lang.String indexPath,
java.lang.String docNameField,
java.lang.String contentField,
int minDf,
boolean useLowDf)
Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names. |
|
| Method Summary | |
|---|---|
protected java.util.ArrayList<java.lang.String> |
buildTermIndex(boolean useIgnored)
Create term index from the lucene index |
protected void |
buildTermIndexLowDf(java.util.ArrayList<java.lang.String> ignoredTerms)
Create the hash map from the vector of low-df terms |
protected void |
extract()
Initialise the corpus by extracting the files from the index. |
java.util.Map<java.lang.Integer,java.lang.Integer> |
getDocTerms(int doc)
Get the document terms as a frequency map id->frequency. |
private java.util.Vector<java.lang.Integer> |
getDocWords(int doc,
java.util.Random rand)
Get the words of document doc as a scrambled sequence. |
int[][] |
getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated. |
int[] |
getDocWords(java.lang.String string)
Get the words of an unknown document as a scrambled sequence. |
int |
getNdocs()
Number of documents in corpus |
int |
getNterms()
Number of terms in corpus |
int |
getNwords(int doc)
|
boolean |
isEmptyDoc(int doc)
Whether this document is non-empty after filtering. |
java.lang.String |
lookup(int term)
Get the string for the particular index, either from the regular index or from the lowDf index. |
int |
lookup(java.lang.String term)
Get the index of the particular term, either from the regular index or from the lowDf index, which results in an index >= nTerms. |
java.lang.String |
lookupDoc(int doc)
Get document name from id. |
int |
lookupDoc(java.lang.String docName)
Get the document index of the document with string id docName. |
private boolean |
ok(java.lang.String string)
|
protected void |
setupIndex(boolean useIgnored)
Creates term map and counts. |
void |
writeCorpus(java.lang.String filebase)
Write the corpus to the file. |
void |
writeDocList(java.lang.String file)
Write the document titles in a file (one doc per line) |
void |
writeVocabulary(java.lang.String file,
boolean sort)
Write the vocabulary to the file. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected static final int INDEX_UNKNOWN
protected java.lang.String indexpath
protected org.apache.lucene.index.IndexReader ir
protected java.util.ArrayList<java.lang.Integer> emptyDocs
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndex
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndexLowDf
protected int minDf
protected int nTerms
protected int nTermsLowDf
protected java.lang.String contentField
protected java.lang.String docNamesField
private boolean useLowDf
| Constructor Detail |
|---|
public LuceneCorpus(java.lang.String indexPath,
java.lang.String docNameField,
java.lang.String contentField,
int minDf,
boolean useLowDf)
throws java.io.IOException
The content field must have been indexed.
The doc names field must have been stored.
indexPath - docNameField - contentField - minDf - useLowDf -
java.io.IOException
public LuceneCorpus(java.lang.String path,
java.lang.String docNamesField)
throws java.io.IOException
path - docNamesField -
java.io.IOException| Method Detail |
|---|
protected void extract()
throws java.io.IOException
java.io.IOException
protected void setupIndex(boolean useIgnored)
throws java.io.IOException
useIgnored -
java.io.IOException
protected java.util.ArrayList<java.lang.String> buildTermIndex(boolean useIgnored)
throws java.io.IOException
useIgnored -
java.io.IOExceptionprivate boolean ok(java.lang.String string)
string -
protected void buildTermIndexLowDf(java.util.ArrayList<java.lang.String> ignoredTerms)
ignoredTerms - public java.lang.String lookupDoc(int doc)
ITermCorpus
lookupDoc in interface ITermCorpuspublic int lookupDoc(java.lang.String docName)
lookupDoc in interface ITermCorpusdocName -
public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
ITermCorpus
getDocTerms in interface ITermCorpuspublic int[][] getDocWords(java.util.Random rand)
getDocWords in interface ITermCorpusrand - random number generator or null to use standard generator
private java.util.Vector<java.lang.Integer> getDocWords(int doc,
java.util.Random rand)
It seems that the getDocTerms... loop scales badly. Use LuceneMapCorpus for larger documents.
doc - rand - random number generator or null to use standard generator
public int[] getDocWords(java.lang.String string)
string - public java.lang.String lookup(int term)
lookup in interface ITermCorpusterm - term index
public int lookup(java.lang.String term)
lookup in interface ITermCorpusterm - string
public int getNdocs()
ITermCorpus
getNdocs in interface ITermCorpuspublic int getNterms()
ITermCorpus
getNterms in interface ITermCorpuspublic int getNwords(int doc)
public final boolean isEmptyDoc(int doc)
doc -
public void writeDocList(java.lang.String file)
throws java.io.IOException
file -
java.io.IOException
public void writeCorpus(java.lang.String filebase)
throws java.io.IOException
filebase -
java.io.IOException
public void writeVocabulary(java.lang.String file,
boolean sort)
throws java.io.IOException
file - sort - sorts the vocabulary in alphabetical order
java.io.IOException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||