Syntactic is an open source, unsupervised lexical categorization project.
The Syntactic Engine is a program that scans huge texts and sorts words into categories, without human intervention. Essentially, it attempts to autonomously learn concepts of language.
The Syntactic Visualization displays the algorithm’s decisions in a human-readable way. It looks like this:
From a scan made on the Simple English Wikipedia, using 100 clusters, Syntactic made the following categorizations:
has, contains, provides, produces, includes, involves, takes, comprises, owns, covers, supports, adds, attracts, absorbs, carries, affects
people, students, players, scientists, women, birds, men, soldiers, americans, characters, peoples
after, before, upon, despite, beside, beneath, since, ago
two, three, four, six, five, seven, ten, nine, eight, twelve, various, multiple
may, november, april, june, february, january, december, july, march, october, august, september
north, south, southeast, middle, east, southwest, northwest, west, northeast, bottom
often, usually, generally, normally, typically, sometimes, frequently, now, currently, traditionally, always, rarely, actually, commonly, obviously, presently
How it Works
By reading huge texts (e.g Wikipedia in Italian), Syntactic attempts to find correlations between the ways we use different words. If we say
“A banjo string”
a lot, and also
“A guitar string”
quite often, then Syntactic will guess that Guitar and Banjo belong in the same category.
How it really works
What Syntactic really does is divide words into groups, using statistics to guess what goes where. In order to find out how similar words are, we examine their use.
Let’s look at the words Between and Under. Obviously they’re very similar – you would say
“Put the cat between the pots”
“put the cat under the sofa”
so the context we’re looking for,
cat ____ the
is expected to correlate quite well between the words.
This is exactly the kind of similarity that Syntactic is looking for. So the program makes a table for common words in the text, and creates a little (actually, it’s enormous) scorecard for each word. In order to compare two words, we use a method called the Kullback-Leibler Information Correlation, KLIC. Here’s how it works.
Suppose we have scorecads for two different words, Queue and Line. So ‘a long queue’ appears a lot, and so does ‘a long line’. Many of the times we can even say line instead of queue, so we’d like to mark these words as similar. So, any word that has good scores in the same squares as queue does is probably a good replacement for queue in a sentence – in other words – queue is similar to it, therefore we mark queue as close to that word. If there’s some use that queue has and that other word doesn’t, well – they’re not that close.
Careful here: Queue and line are not synonymous. You wouldn’t say
“draw a queue between these two dots”
because that takes forever. That’s because line is not similar to queue.
So to sum up, the KLIC rewards any word that has a similar score in similar places as our chosen word, and punishes words that don’t. That’s why queue is close to line, but line is quite far from queue.
Now that we know the distance from word A to word B, let’s think of the distance between a word and a group. That’s easy. Just make a scorecard for the entire group (which is the average of the scorecards for each element in the group), and measure the distance. So the distance from Between to “Beside, Opposite, Under, Over, Inside” is quite small, even if the opposite isn’t true.
So at first, we take the most common words in the text and throw them into groups (let’s call them Clusters). We use the most common words because they’re usually also quite general. Now we measure the distance between the rest of the words and these clusters. We then take the closest words and add them to their target cluster.
This process goes on a few times, until most of the words have been categorized.
In the visualizaiton page, you will find a list of words and a list of clusters. A click on a word will load its scorecard on the monitor, and a click on the cluster will load the cluster’s scorecard on the monitor just below it.
The Syntactic Engine is open source. Fork it on GitHub.
- Develop. You can help developing the project. Grab a task in GitHub.
- Run. You can run syntactic on your server or PC. Check out the running instructions page to see how to run and how to mash your corpus into plaintext. If you want to post the results into the visualization engine here, head on to the contact form or email Omer Shapira.
- Include. The Syntactic Engine is free with an MIT license, which means anyone can use it (for anything), only with credit restrictions. If you know a project that could use an engine like Syntactic, feel free to add it.
The Syntactic Engine was written by Omer Shapira, during a Seminar about Learnability taught by Roni Katzir in the Linguistics department at Tel Aviv University. Syntactic is based on an algorithm described by Alexander Clark.