[Logo]

Link Grammar Parser

by Davy Temperley, John Lafferty and Daniel Sleator
(this variant maintained by Dom Lachowicz - <domlachowicz@gmail.com> and Linas Vepstas - <linasvepstas@gmail.com> )

News

November, 2009: link-grammar 4.6.5 released! See below for a description of recent changes.

What is the Link Grammar?

The Link Grammar Parser is a syntactic parser of English (and other languages as well), based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" (Penn tree-bank style phrase tree) representation of a sentence (showing noun phrases, verb phrases, etc.). The RelEx extension provides dependency-parse output.

Did the AbiWord team write Link Grammar?

In large part, no. The project is the brainchild of Davy Temperley, John Lafferty and Daniel Sleator, all university professors. It is the product of a decade of academic research into grammar, and is founded on a theory backed by numerous publications. Its canonical homepage is hosted by Carnegie Mellon University.

So, then what is it doing @ AbiSource.com?

The AbiWord team had a concrete need - to integrate a grammar checking feature into AbiWord. The best choice, they felt, was to build upon Temperley et. al.'s successful Link Grammar project.

However, in order for the link-grammar project to be useful to them and to the greater Free Software world, the AbiWord community felt that a variety of changes to the project would be necessary. While they did have success (a few years ago) convincing the authors to release Link Grammar under a GPL-compatible license, there was no practical way to continue project development and maintenance at the CMU website. So the AbiWord community took it under its wing and has nurtured the project since.

Ongoing development by OpenCog

Ongoing development of link-grammar is being primarily guided by the Open Cognition project, where the parser plays an important role in the OpenCog natural language processing subsystem. Research and implementation is ongoing; current work includes investigations into statistically guided parse ranking, grammatically induced word-sense disambiguation using statistical results from the Mihalcea all-words WSD algorithm, and work on automatically learning new parse rules based on corpus statistics.

A sibling project, RelEx, uses constraint-grammar-like techniques to extract dependency relations and assorted additional linguistic information, including FrameNet-style framing and reference (anaphora) resolution. The dependency output is similar to that of the Stanford parser. It's performance is comparable to the Stanford PCFG parsing model, and is more than three times faster than the Stanford "lexicalized" (factored) model.

The NLGen and NLGen2 projects provide natural language generation modules, based on, and compatible with link-grammar and RelEx. They implement the SegSim ideas for NL generation. See the following NLGen demos: Demo of Virtual Dog Learning to Play Fetch via Imitation and Reinforcement, AI Virtual Dog's Emotions Fluctuate Based on Its Experiences, Demo of Embodied Anaphor Resolution and AI Virtual Dog Answers Simple Questions about Itself and Its Environment.

Notable changes from the upstream Link Grammar package include:

  • Actively maintained! New releases typically every few months.
  • Numerous bug fixes and performance improvements; expanded dictionaries with thousands of new words; improved parse coverage for a wide variety of constructions.
  • Merger of BioLG project changes, for improved parsing of biomedical text.
  • New bindings, including Ruby, Python, perl, Java and Ocaml.
  • Support for UTF8 Unicode; Arabic and Persian dictionaries; prototype German dictionary.
  • Multi-threading support; a standard build system; pkg-config integration; dynamic/shared library support; fixes for non-Linux platforms: i.e. Windows, MacOSX, FreeBSD.

Downloading Link Grammar

The system can be downloaded either as a tarball, or via SVN. The current stable version is Link Grammar 4.6.5 (November, 2009). Older versions are available here.

Unstable, development versions are available through AbiWord's SVN repository. Anonymous read-only access is available by issuing the command:

svn co http://svn.abisource.com/link-grammar/trunk link-grammar

General instructions for AbiWord's anonymous SVN can be found here.

The Link Grammar source can be browsed online here.

Documentation

One of the best ways to obtain a solid, easy-to-understand overview of the parser is to review the original papers describing it, here, here, here and here. There is an extensive set of pages documenting the dictionary; specifically, the names of links and thier meanings, as well as how to write new rules. There is also a short primer for creating dictionaries for new languages. The documentation for the programming API is here. Documentation for additions made in the 4.0 release is on the improvments page. A fairly comprehensive bibliography of papers written before 2004 is here.

Mailing Lists

The current list for Link Grammar discussion is at the link-grammar google group.

Subscribe to link-grammar:

Enter email:

Bug Tracker

Bug reports, patches, RFEs, etc. are gladly welcomed.

Disclaimer

Link grammar is a natural language parser, not an artificial intelligence. This means that there are many sentences that it cannot parse correctly, and many others for which it generates multiple parses. There are also entire classes of speech that it cannot parse, such as Valley-girl speak. Link grammar does best on "newspaper English": medium-length sentences written with good grammar, proper punctuation, and proper capitalization. It don't do 733t speek, etc. In particular, it has problems with the following "registers" and types of writing:

  • Phrases (that are not a part of a complete sentence)
  • Bulleted lists, such as this.
  • Quotations within sentences (and parenthetical remarks) These can be handled by an appropriate front-end, that separates out the quotations from the rest of the text.
  • Slang speech, words, like 733t warez d00dz, although it can certainly guess from context if the slang is sufficiently grammatical.
  • Long run-on sentences. These can generate thousands of alternative parses in a combinatorial explosion.
  • Certain "registers", such as newspaper headlines; for example, "Thieves rob bank."

In addition, it has a variety of "bugs": it currently has trouble with "if...then..." constructs, compound queries ("who did it, and why?"), lists, "...not only...but also..." constructs, certain types of idiomatic phrases, certain types of "institutional utterances", and so on. The goal of the project is to eventually fix all of these cases; progress is ongoing.


Adjunct Projects

RelEx Semantic Relation Extractor
RelEx is an English-language semantic relationship extractor, built on the Carnegie-Mellon link parser. It can identify subject, object, indirect object and many other relationships between words in a sentence. It will also provide part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. Relex includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm. Optionally, it can use GATE for entity detection.
Ruby bindings
There are two different packages providing Ruby bindings: Ruby Link Grammar, which is up-to-date and currently maintained, and Link Grammar 4 Ruby, which is wildly out-of-date (its for version 4.2.2) and is unmaintained. You only need one!
Python bindings (New!)
New python bindings are in development. Development snapshots are available on Launchpad. Install instructions here.
Perl bindings (New!)
The perl bindings, created by Danny Brian, have been updated. See the Lingua-LinkParser page on CPAN. There is also a tutorial written against an older version of the bindings; some details may be different.
Objective Caml bindings
OCaml interface to Link Grammar
.Net Framework bindings
.Net interface to Link Grammar from Leonard Chalk/ProAI.
Alternative Java bindings
Another, completely differrent set of Java bindings have been developed: a tar ball is here. These are for the old verision 4.1 only. Note that these are not compatible with the bindings that ship, by default, with the main link-grammar package.
Persian dictionaries
Persian dictionaries, by Jon Dehdari. These require the Persian stemming engine, as significant morphology analysis needs to be performed to parse Persian.
Arabic dictionaries
Arabic dictionaries, by Jon Dehdari. [download] These require the Aramorph stemming package, which is included.
French dictionary, Luthor
The Luthor project aims to develop a set of scripts to automatcally construct Link Grammar linkage dictionaries by mining Wiktionary data. Current efforts are focusing on French.
Russian parser
Located at http://slashzone.ru/parser/. By Sergey Protasov. Includes link documentation and subscript (morphology) documentation. Russian morpheme dictionaries can be had at http://aot.ru.
English dictionary extensions
LinkGrammar-WN is a lexicon expansion for the English language Link Grammar Parser. This project adds 14K new words to the dictionaries. The extended lexicon is provided under the GPL license, and thus cannot be merged back into the current project.
Medical Text Analysis
The MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) Clinical Decision Making Group has done work to extend the Link Grammar dictionaries by adding many new words. All but the six largest of these dictionaries have been merged into link-grammar, since version 4.3.1. The large dictionaries EXTRA.2, EXTRA.3, EXTRA.8, EXTRA.9, EXTRA.12, and EXTRA.17 have not been merged. These dictionaries contain 180K assorted medical, biological and biochemical terms and phrases.
BioLG
The BioLG project is a modification of the Link Grammar Parser adapted for the biomedical domain, as described in Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches (Sampo Pyysalo, Tapio Salakoski, Sophie Aubin and Adeline Nazarenko; BMC Bioinformatics 2006). Almost all of the BioLG changes have been merged back into the main line, as of version 4.5.0 (April 2009), with scattered bug-fixes after that.

Of related interest

Genia tagger
The Genia tagger is useful for named entity extraction. BSD license source.

Recent Applications and Publications

Some recent uses and applications of the Link Grammar Parser are shown below. There is also an older bibliography on the CMU website (mirror) referencing several dozen papers pertaining to the Link Grammar Parser.

Some miscellaneous facts:

  • Any categorical grammar can be easily converted to a link grammar; see section 6 of Daniel Sleator and Davy Temperley. 1993. "Parsing English with a Link Grammar." Third International Workshop on Parsing Technologies.
  • Link grammars can be learned by performing a statistical analysis on a large corpus: see John Lafferty, Daniel Sleator, and Davy Temperley. 1992. "Grammatical Trigrams: A Probabilistic Model of Link Grammar." Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October, 1992.

Recent Changes

Version 4.6.5 (3 November 2009)

  • Fix: Superlatives without preceding determiners ("... likes you best")
  • Fix: Take more care in distinguishing mass and count nouns.
  • Fix: Old bug w/relative clauses: Rw+ is optional, not mandatory.
  • Provide tags identifying relative, superlative adjectives.
  • Remove BioLG NUMBER-AND-UNIT handling, its been superceeded.
  • Fix handling of parenthetical phrases/clauses.
  • Fix: handling of capitalized first words ending in letter "s".
  • Fix: support "filler-it" SF link for "It was reasoned that..."
  • Fix: certain WH-word constructions: "I did not know why until recently"
  • Fix: go: "there goes the greatest guy ever"
  • Fix: opening coordinating conjunctions: "And you can also ..."
  • Configurable Hunspell spell-checker dictionary location.
  • Fix: Misc ordinal usage.
  • Add support for aspell spell-checker.

Version 4.6.4 (11 October 2009)

  • Restore nouns starting w/letters x-z, elided in version 4.5.9 ff.
  • Add support for single-word interjections/exclamations!
  • Fix: sometimes command line client fails to show all valid linkages.
  • Misc fixes: such_that, upon, acted.v
  • Fix: impersonal "be" linking to passive participle.
  • Fix: handling of capitalized first words.
  • Fix: duplication of certain parses involving transitive verbs.

Version 4.6.3 (4 October 2009)

  • Fix compilation bug on FreeBSD.
  • Fix: allow MX link to post-nominal ", to be ..., "
  • Fix: add idiom "time and again"
  • Fix: another BioLG regression in handling of possesives.
  • Fix: handling of period at end of number at end of sentence.
  • Fix: Capitalized words ending in s at start of sentence.
  • Use corpus-statistics-based ranking by default, if available.
  • Fix difficulties in build of corpus statistics module.

Version 4.6.2 (21 September 2009)

  • Fix: "come across as authoritiative".
  • Improve java location guessing in FreeBSD
  • Fix for assert triggered by long sentences.
  • Fix: long sequence of periods treated as unknown word.
  • Add informational print showing dictionary location on startup.
  • Remove duplicated {@MV+} in tend.v
  • Automatically resize the display size to fit the current window size.
  • Fix handling of punctuation at the end of a capitalized word.
  • Fix misc verbs acting as adjectival modifiers: e.g. "given", "allied"
  • Fix bug in BioLG code regarding the handling of possesives.
  • Fix a (rare) crash in sentences with many conjunctions.
  • Fix a crash involving long sequences of UTF8 punctuation marks.

Version 4.6.1 (31 August 2009)

  • Stop printing annoying warning when !vars are used.
  • Fix missing dict file units.2 problem
  • Fix compilation bug on FreeBSD.

Version 4.6.0 (29 August 2009)

  • Avoid used of bzero, add missing include directives (MacOSX problem)
  • Reclassify a number of "medical" prepostions as adverbs.
  • Add approx 100 adverbs & 300 adjectives.
  • Add approx 250 verbs.
  • Add approx 300 nouns.
  • Add misc units.
  • Add misc European connector words/patronymics.
  • Reclassify 100's of transitive verbs as optionally-transitive.
  • Add distinct tokenization step ("sentence_split") to public API. This last change forces the minor-version-number bump.

Version 4.5.10 (25 August 2009)

  • Be sure to link with -lm

Version 4.5.9 (25 August 2009)

  • Modify error messages to indicate that they are from link-grammar.
  • Add missing Java files that were forgotten last time around.
  • Add greeting to command-line client startup.
  • Print disjunct cost also, when requesting disjunct printing.
  • Add missing color names as mass nouns.
  • Fix: Reclassify musical instruments: "He plays piano"
  • Add experimental word-clustering system.
  • Add CMake build file
  • Fix: "It takes longer than that."
  • Fix: "He has done very well."
  • Fix: a dozen optionally transitive verbs (swim, kill, etc.)
  • Fix: "He's out running."
  • Fix: "suddenly" is a "manner adverb", not a clausal adverb.
  • Fix: Use Pg links to gerunds: "He feared hitting the wall."
  • Fix: assorted numerical-range bugs.
  • Fix: prep modifiers with distances: "It is a few miles out"
  • Fix: Spelled-out dates: "It started in nineteen twelve"
  • Fix: Misc date, time expression parsing e.g "Zero hour is here."
  • Fix: Misc words, "ordered list", "screened out"
  • Fix: Post-fixed numbers can act as determiners.
  • Fix: "We bought the last 50 ft. of cable."
  • Fix: opening directives to imperatives: "Finally, move it back."
  • Fix: Improved simple equation parsing support.
  • Fix: Add misc fixes from BioLG that were previously overlooked.
  • Fix: "favorite" can take determiner "a" ("a favorite place")
  • Fix: assorted clausal complements: "The emperor ordered it done."
  • Fix: ordinals: "First on our list is ..."
  • Fix: verb modifier "some of the time", "most places"
  • Fix: Sit, stand take modifiers: "he stood still"

Version 4.5.8 (2 July 2009) includes the following changes:

  • Fix: 'than anticipated', 'than was anticipated', etc.
  • Fix: 'saw the wood'
  • Fix: sometimes commas are used as if they were semicolons.
  • Fix: 'We have quite enough work already, thank you!'
  • Fix: allow 'and' as conjunction in entity names.
  • Fix: 'I stared him down', 'They shouted him down', 'booed off'
  • Fix: 'sound him out', 'look him over'
  • Fix: 'Somewhere in the distance'
  • Stub out list of names given to both men and women, to avoid duplication.
  • Fix: 'I think so, too'
  • Fixes for compilation under Cygwin.
  • From Boris Iordanov: fixes to JSON java code.
  • From Boris Iordanov: new java remote client code.
  • Fix: Biological texts commonly have adj-noun-adj-noun chains

Version 4.5.7 (4 June 2009) includes the following changes:

  • Fix 'make install' for windows (abi bug 12049)
  • Fix multi-threaded bug when TRACK_SPACE_USAGE is defined.
  • Add './configure --enable-mudflap' just for fun...
  • Fix: "Walk tall", "Think quick"
  • Fix: "... part no. 1234-56A"
  • Fix regression from BioLG merge: "It cost $14 million."
  • Fix come/came: "The dog came running..."
  • Fix year abbreviations: "He drove a souped-up '57 Chevy"
  • Fix sit, stand: "The dog stood still"
  • Fix act up, act out: "He is acting out." "The motor is acting up."
  • Fix notoriously, poorly: "The store was poorly stocked".
  • Fix: "strong" can be adverb
  • Add support for recognizing basic time zones during parsing.
  • Fix: verbs acting as adjectival modifiers: "a very politicized deal."
  • Fix: ".. nearly so well", "...almost so well".
  • Fix financial ranges: "It will cost $10 million to $20 million to build."
  • Expand handling of capitalized words that appear in entity names.
  • Expand the list of characters that are recognized as quotes.
  • Support usage of yes, no as sentence openers.
  • Better support for directives, commands.
  • Fix: "Ash Wednesday", "Fat Tuesday", etc.
  • Fix: post-verbal adj: "she wiped the table dry"
  • Fix: wish: "she wished me a happy birthday"

Version 4.5.6 (24 May 2009) includes the following changes:

  • Bugfix: fix non-thread-safe usage.
  • Changes to enable MinGW/Windows to compile.
  • Update of MSVC6 build files
  • Fix: pizza, fries, chopsticks.
  • Export word-sense database to Java apps.
  • Fix: "Was the man drunk or crazy or both?"

Version 4.5.5 (10 May 2009) includes the following changes:

  • Bugfix: crash for zero-length sentences.

Version 4.5.4 (9 May 2009) includes the following changes:

  • Fix: "sleep in": "A bed is something you sleep in."
  • Fix: "drinking": "Let's go drinking."
  • Fix go+bare infinitive: "Let's go shop", "Let's go swim"
  • Fix: "Let's go for a swim." "Let's go for a smoke".
  • Fix: "Let's not" "Let's not go" "Let's not cry"
  • Fix: ... is : "All he ever does is complain."
  • Fix: "You will die young/happy/unhappy"
  • Fix: "You should exercise to stay fit."
  • Fix: "We danced 'til dawn."
  • Fix: "tell off": "She had told him off."
  • Bugfix: sometimes spell checker would run even if turned off.

Version 4.5.3 (14 April 2009) includes the following changes:

  • Haste makes waste! Revert a recent 'fix'.

Version 4.5.2 (14 April 2009) includes the following changes:

  • Use re-entrant version of mbtowc in all code.
  • Fix run-time breakage on Mac OSX and FreeBSD.

Version 4.5.1 (13 April 2009) includes the following changes:

  • Fix Assertion failed: negative constituent length!
  • Fix build break for Mac OSX.
  • Force use of UTF-8 locale in the command-line tool.

Version 4.5.0 (10 April 2009) includes the following changes:

  • Hack around missing SQLite3 pkgconfig on MacOS
  • Fix adverbs: 'The motor ran hot', 'the door swung wide open', etc.
  • Fix: 'at risk of breakdown', 'under threat of fire'
  • Add regular-expression-based word guessing, from BioLG project. This provides support for many scientific/biomedical terms.
  • Add spell-guessing for unknown words.
  • Fix UTF8 support to be correctly thread-safe.
  • BioLG: fix post-numbering: 'it started on day one'
  • BioLG: add number ranges: 'it takes 2 to 3 times the effort'
  • BioLG: assorted adverb fixes, typical of scientific prose.
  • BioLG: initiate, attach, localize etc are optionally transitive.
  • BioLG: allow fork, branch, splice, export, etc to take particles.
  • BioLG: extended use of greek letters in biomedical text.
  • BioLG: support parsing of Roman numerals.
  • BioLG: support greek-letter-number combinations.
  • Fix: 'she was singing', etc.
  • Enable WordNet word-sense identification based on syntactical usage.

A summary of older changes can be found here.

License

The Link Grammar license is essentially the BSD license. A copy of this license can be found below, and at the original author's CMU site

Copyright (c) 2003-2004 Daniel Sleator, David Temperley, and John Lafferty. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. The names "Link Grammar" and "Link Parser" must not be used to endorse or promote products derived from this software without prior written permission. To obtain permission, contact sleator@cs.cmu.edu

THIS SOFTWARE IS PROVIDED BY DANIEL SLEATOR, DAVID TEMPERLEY, JOHN LAFFERTY AND OTHER CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.