Running from the Source

Running Syntactic requires Java, at least 2GB RAM, and something like 10x the size of the corpus on the disk. Got these? Follow the instructions:

  1. Download the source from this link.
  2. Compile the code into a .jar file. We use Syntactic.jar.
  3. Prepare your corpus in plaintext. We have some useful tools to help you with that.

The run command looks like this:

java -jar Syntactic.jar [name] [input folder] [output folder] [clusters] 
[threshold] [epsilon]

      • [name] is the corpus name. Only alphanumeric characters and underscores(_).
      • [input folder] is the folder where the corpus is. By default, only .txt files are read.
      • [output folder] is a folder in which Syntactic will create the output root folder, with a timestamp. If it is set to Output/, then Syntactic will place everything in Output/CorpusName dd.MM.yy
      • [clusters] the amount of resulting groups. Good results appear above 75. Speed decreases polynomially with the number of clusters. Default is 50.
      • [threshold] the minimum frequency a word has to have in order to be clustered. Default is 50.
      • [epsilon] clusters which are not mutually separated by this distance are merged. Values vary significantly. Typical values are between 0.5 and 0.05.


Bonus: Converting Wikipedia into Plaintext

Currently, the best plaintext corpora we got our hands on were Wikipedia Dumps. You can get them at Wikipedia’s Database Output page. Do note that these files are huge. You can then use these methods:

  •  After the Deadline has a good blogpost on doing so.
  • This Ruby script works quite fast.
  • The slow way: We only know this because we used this several times. Download this MediaWiki to XML converter by Università di Pisa. After using it, you’ll get an XML file. Clean it by using the regular expression <.+?>.