formerly, the site knowceans.org served as a code repository. this is the mirror site and may become the only one in the future. most of this code is gpl or lgpl.
latent dirichlet allocation in java:
- lda-j (version 20050325) is a Java 1.5 port of David Blei's lda-c.
- LdaGibbsSampler.java,
a working "hack" of the MCMC algorithm for LDA in one Java class.
- See primer on parameter estimation for text
- lda.odc, a WinBUGS script
to run LDA and an author-topic model with Gibbs sampling.
- See WinBUGS.
markov graph clustering in java:
- knowceans-mcl(version
20060805), provides a Java implementation of Markov graph clustering
(MCL), which finds hard clusters in a graph.
- See the javadoc.
- See Stijn van Dongens (2000) PhD thesis.
- See faster (but much more complex) C implementation.
adaptive rejection sampling in java:
- arms-java(version
20060516), provides a Java port of the adaptive rejection Metropolis
sampler (ARMS), which can sample from virtually any univariate
distribution.
- See the javadoc.
- See the original C/fortran implementation by Wally Gilks.
- See the cvs on sourceforge project knowceans.
- Samplers and densities / likelihood functions of various probability distributions as well as a Java port of the Mersenne Twister random generator can be found in the package knowceans-tools.jar (see below).
java dataset manipulation:
- NEW: knowceans citeseer-fetcher
(version 20100406), simple Java code to construct a corpus from the OAI2 site of
the CiteSeerX digital library. This rather quickly written code is assumed LGPL.
It does not yet clean the high number of duplicates in the document titles and
near-duplicates in the author names (for which I plan to add code later).
- See the javadoc.
- Some of the code requires the knowceans-tools package below.
some java basis classes:
- NEW: knowceans-tools (version 20100406), many Java helper classes I frequently use: command line parser, runtime stop watch, some statistical distributions, estimators and samplers, helpers for vectors and matrices, perl-like regular expression usage (reduces Java coding), thread pool, special invertible, regex and many-to-many implementations of the Map interface, data output formatters specialised to commandline output (like histograms and dot-encoded numbers) and many more.