Link Grammar Parser
by Davy Temperley, John Lafferty and Daniel Sleator (this variant maintained by Dom Lachowicz - <domlachowicz@gmail.com> and Linas Vepstas - <linasvepstas@gmail.com> )News
July, 2009: link-grammar 4.5.8 released! See below for a description of recent changes.
What is the Link Grammar?
The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" (Penn tree-bank style phrase tree) representation of a sentence (showing noun phrases, verb phrases, etc.).
Did the AbiWord team write Link Grammar?
In large part, no. The project is the brainchild of Davy Temperley, John Lafferty and Daniel Sleator, all university professors. It is the product of a decade of academic research into grammar, and is founded on a theory backed by numerous publications. Its canonical homepage is hosted by Carnegie Mellon University.
So, then what is it doing @ AbiSource.com?
The AbiWord team had a concrete need - to integrate a grammar checking feature into AbiWord. The best choice, they felt, was to build upon Temperley et. al.'s successful Link Grammar project.
However, in order for the link-grammar project to be useful to them and to the greater Free Software world, the AbiWord community felt that a variety of changes to the project would be necessary. While they did have success (a few years ago) convincing the authors to release Link Grammar under a GPL-compatible license, there was no practical way to continue project development and maintenance at the CMU website. So the AbiWord community took it under its wing and has nurtured the project since.
Ongoing development by OpenCog
Ongoing development of link-grammar is being primarily guided by the Open Cognition project, where the parser plays an important role in the OpenCog natural language processing subsystem. Research and implementation is ongoing; current work includes investigations into statistically guided parse ranking, grammatically induced word-sense disambiguation using statistical results from the Mihalcea all-words WSD algorithm, and work on automatically learning new parse rules based on corpus statistics. The link-grammar project participates in the Google Summer of Code, and is currently accepting applications for GSOC 2009. See the Ideas Page for details.
A sibling project, RelEx, uses constraint-grammar-like techniques to extract dependency relations and assorted additional linguistic information, including FrameNet-style framing and reference resolution.
Notable changes from the upstream Link Grammar package include:
- Actively maintained! New releases typically every few months.
- Numerous bug fixes and performance improvements; expanded dictionaries and parse coverage.
- New bindings, including Ruby, Python, perl, Java and Ocaml.
- Support for UTF8 Unicode; Arabic and Persian dictionaries; prototype German dictionary.
- Multi-threading support; a standard build system; pkg-config integration; dynamic/shared library support; fixes for non-Linux platforms: i.e. Windows, Macintosh.
Downloading Link Grammar
The system can be downloaded either as a tarball, or via SVN. The current stable version is Link Grammar 4.5.8 (July, 2009). Older versions are available here.
Unstable, development versions are available through AbiWord's SVN repository. Anonymous read-only access is available by issuing the command:
svn co http://svn.abisource.com/link-grammar/trunk link-grammar
General instructions for AbiWord's anonymous SVN can be found here.
The Link Grammar source can be browsed online here.
Documentation
One of the best ways to obtain a solid, easy-to-understand overview of the parser is to review the original papers describing it, here, here, here and here. There is an extensive set of pages documenting the dictionary; specifically, the names of links and thier meanings, as well as how to write new rules. There is also a short primer for creating dictionaries for new languages. The documentation for the programming API is here. Documentation for additions made in the 4.0 release is on the improvments page. A fairly comprehensive bibliography of papers written before 2004 is here.
Mailing Lists
The current list for Link Grammar discussion is at the link-grammar google group.
Subscribe to link-grammar:
Bug Tracker
Bug reports, patches, RFEs, etc. are gladly welcomed.
- Bug reports should be filed at the Google code bug tracker.
- General issue discussion, requests for enhancement, and related matters should be discussed on the Link Grammar mailing list
Disclaimer
Link grammar is a natural language parser, not an artificial intelligence. This means that there are many sentences that it cannot parse correctly, and many others for which it generates multiple parses. There are also entire classes of speech that it cannot parse, such as Valley-girl speak. Link grammar does best on "newspaper English": medium-length sentences written with good grammar, proper punctuation, and proper capitalization. It don't do 733t speek, etc. In particular, it has problems with the following "registers" and types of writing:
- Phrases (that are not a part of a complete sentence)
- Bulleted lists, such as this.
- Quotations within sentences (and parenthetical remarks) These can be handled by an appropriate front-end, that separates out the quotations from the rest of the text.
- Slang speech, words, like 733t warez d00dz, although it can certainly guess from context if the slang is sufficiently grammatical.
- Long run-on sentences. These can generate thousands of alternative parses in a combinatorial explosion.
- Certain "registers", such as newspaper headlines; for example, "Thieves rob bank."
In addition, it has a variety of "bugs": it currently has trouble with "if...then..." constructs, compound queries ("who did it, and why?"), lists, "...not only...but also..." constructs, certain types of idiomatic phrases, certain types of "institutional utterances", and so on. The goal of the project is to eventually fix all of these cases; progress is ongoing.
Recent Changes
Version 4.5.8 (2 July 2009) includes the following changes:
- Fix: 'than anticipated', 'than was anticipated', etc.
- Fix: 'saw the wood'
- Fix: sometimes commas are used as if they were semicolons.
- Fix: 'We have quite enough work already, thank you!'
- Fix: allow 'and' as conjunction in entity names.
- Fix: 'I stared him down', 'They shouted him down', 'booed off'
- Fix: 'sound him out', 'look him over'
- Fix: 'Somewhere in the distance'
- Stub out list of names given to both men and women, to avoid duplication.
- Fix: 'I think so, too'
- Fixes for compilation under Cygwin.
- From Boris Iordanov: fixes to JSON java code.
- From Boris Iordanov: new java remote client code.
- Fix: Biological texts commonly have adj-noun-adj-noun chains
Version 4.5.7 (4 June 2009) includes the following changes:
- Fix 'make install' for windows (abi bug 12049)
- Fix multi-threaded bug when TRACK_SPACE_USAGE is defined.
- Add './configure --enable-mudflap' just for fun...
- Fix: "Walk tall", "Think quick"
- Fix: "... part no. 1234-56A"
- Fix regression from BioLG merge: "It cost $14 million."
- Fix come/came: "The dog came running..."
- Fix year abbreviations: "He drove a souped-up '57 Chevy"
- Fix sit, stand: "The dog stood still"
- Fix act up, act out: "He is acting out." "The motor is acting up."
- Fix notoriously, poorly: "The store was poorly stocked".
- Fix: "strong" can be adverb
- Add support for recognizing basic time zones during parsing.
- Fix: verbs acting as adjectival modifiers: "a very politicized deal."
- Fix: ".. nearly so well", "...almost so well".
- Fix financial ranges: "It will cost $10 million to $20 million to build."
- Expand handling of capitalized words that appear in entity names.
- Expand the list of characters that are recognized as quotes.
- Support usage of yes, no as sentence openers.
- Better support for directives, commands.
- Fix: "Ash Wednesday", "Fat Tuesday", etc.
- Fix: post-verbal adj: "she wiped the table dry"
- Fix: wish: "she wished me a happy birthday"
Version 4.5.6 (24 May 2009) includes the following changes:
- Bugfix: fix non-thread-safe usage.
- Changes to enable MinGW/Windows to compile.
- Update of MSVC6 build files
- Fix: pizza, fries, chopsticks.
- Export word-sense database to Java apps.
- Fix: "Was the man drunk or crazy or both?"
Version 4.5.5 (10 May 2009) includes the following changes:
- Bugfix: crash for zero-length sentences.
Version 4.5.4 (9 May 2009) includes the following changes:
- Fix: "sleep in": "A bed is something you sleep in."
- Fix: "drinking": "Let's go drinking."
- Fix go+bare infinitive: "Let's go shop", "Let's go swim"
- Fix: "Let's go for a swim." "Let's go for a smoke".
- Fix: "Let's not" "Let's not go" "Let's not cry"
- Fix: ... is
: "All he ever does is complain." - Fix: "You will die young/happy/unhappy"
- Fix: "You should exercise to stay fit."
- Fix: "We danced 'til dawn."
- Fix: "tell
off": "She had told him off." - Bugfix: sometimes spell checker would run even if turned off.
Version 4.5.3 (14 April 2009) includes the following changes:
- Haste makes waste! Revert a recent 'fix'.
Version 4.5.2 (14 April 2009) includes the following changes:
- Use re-entrant version of mbtowc in all code.
- Fix run-time breakage on Mac OSX and FreeBSD.
Version 4.5.1 (13 April 2009) includes the following changes:
- Fix Assertion failed: negative constituent length!
- Fix build break for Mac OSX.
- Force use of UTF-8 locale in the command-line tool.
Version 4.5.0 (10 April 2009) includes the following changes:
- Hack around missing SQLite3 pkgconfig on MacOS
- Fix adverbs: 'The motor ran hot', 'the door swung wide open', etc.
- Fix: 'at risk of breakdown', 'under threat of fire'
- Add regular-expression-based word guessing, from BioLG project. This provides support for many scientific/biomedical terms.
- Add spell-guessing for unknown words.
- Fix UTF8 support to be correctly thread-safe.
- BioLG: fix post-numbering: 'it started on day one'
- BioLG: add number ranges: 'it takes 2 to 3 times the effort'
- BioLG: assorted adverb fixes, typical of scientific prose.
- BioLG: initiate, attach, localize etc are optionally transitive.
- BioLG: allow fork, branch, splice, export, etc to take particles.
- BioLG: extended use of greek letters in biomedical text.
- BioLG: support parsing of Roman numerals.
- BioLG: support greek-letter-number combinations.
- Fix: 'she was singing', etc.
- Enable WordNet word-sense identification based on syntactical usage.
Adjunct Projects
- RelEx Semantic Relation Extractor
- RelEx is an English-language semantic relationship extractor, built on the Carnegie-Mellon link parser. It can identify subject, object, indirect object and many other relationships between words in a sentence. It will also provide part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. Relex includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm. Optionally, it can use GATE for entity detection.
- BioLG (New!)
- The BioLG project is a modification of the Link Grammar Parser adapted for the biomedical domain, as described in Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches (Sampo Pyysalo, Tapio Salakoski, Sophie Aubin and Adeline Nazarenko; BMC Bioinformatics 2006).
- Ruby bindings
- There are two different packages providing Ruby bindings: Ruby Link Grammar, which is up-to-date and currently maintained, and Link Grammar 4 Ruby, which is wildly out-of-date (its for version 4.2.2) and is unmaintained. You only need one!
- Python bindings
- New python bindings are in development. Development snapshots are available on Launchpad.
- Perl bindings (New!)
- The perl bindings, created by Danny Brian, have been updated. See the Lingua-LinkParser page on CPAN. There is also a tutorial written against an older version of the bindings; some details may be different.
- Objective Caml bindings
- OCaml interface to Link Grammar
- .Net Framework bindings
- .Net interface to Link Grammar from Leonard Chalk/ProAI.
- Alternative Java bindings
- Another, completely differrent set of Java bindings have been developed: a tar ball is here. These are for the old verision 4.1 only. Note that these are not compatible with the bindings that ship, by default, with the main link-grammar package.
- Persian dictionaries
- Persian dictionaries, by Jon Dehdari. These require the Persian stemming engine, as significant morphology analysis needs to be performed to parse Persian.
- Arabic dictionaries
- Arabic dictionaries, by Jon Dehdari. [download] These require the Aramorph stemming package, which is included.
- French dictionary, Luthor
- The Luthor project aims to develop a set of scripts to automatcally construct Link Grammar linkage dictionaries by mining Wiktionary data. Current efforts are focusing on French.
- Russian parser
- Located at http://slashzone.ru/parser/. By Sergey Protasov. Russian morpheme dictionaries can be had at http://aot.ru.
- English dictionary extensions
- LinkGrammar-WN is a lexicon expansion for the English language Link Grammar Parser. This project adds 14K new words to the dictionaries. The extended lexicon is provided under the GPL license, and thus cannot be merged back into the current project.
- Medical Text Analysis
- The MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) Clinical Decision Making Group has done work to extend the Link Grammar dictionaries by adding many new words. All but the six largest of these dictionaries have been merged into link-grammar, since version 4.3.1. The large dictionaries EXTRA.2, EXTRA.3, EXTRA.8, EXTRA.9, EXTRA.12, and EXTRA.17 have not been merged. These dictionaries contain 180K assorted medical, biological and biochemical terms and phrases.
Of related interest
- Genia tagger
- The Genia tagger is useful for named entity extraction. BSD license source.
Recent Applications and Publications
Some recent uses and applications of the Link Grammar Parser are shown below. There is also an older bibliography on the CMU website (mirror) referencing several dozen papers pertaining to the Link Grammar Parser.
- Sampo Pyysalo, Tapio Salakoski, Sophie Aubin and Adeline Nazarenko, "Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches". BMC Bioinformatics 2006.
- Schneider, Gerold (1998). "A Linguistic Comparison Constituency, Dependency, and Link Grammar". Masters Thesis, University of Zurich.
- Özlem Istek, "A Link Grammar for Turkish", Thesis, 2006
- Shailly Goyal and Niladri Chatterjee, " Study of Hindi Noun Phrase Morphology for Developing a Link Grammar Parser", Language in India, Volume 5 : 8 August 2005
- Fabian M. Suchanek, Georgiana Ifrim, Gerhard Weikum, "Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents" (2006)
- P. Szolovits, "Adding a Medical Lexicon to an English Parser". Proc. AMIA 2003 Annual Symposium. Pages 639-643. 2003.
- Jing Ding, Daniel Berleant, Jun Xu, Andy W. Fulmer, "Extracting Biochemical Interactions from MEDLINE Using a Link Grammar Parser"
- Rania A. Abul Seoud, Nahed H. Solouma, Abou-Baker M. Youssef, and Yasser M. Kadah, "PIELG: A Protein Interaction Extraction System using a Link Grammar Parser from Biomedical Abstracts". International Journal of Biological, Biomedical and Medical Sciences 3;3 www.waset.org Summer 2008
- I. Marshall and E. Safar, "Extraction of semantic representations from syntactic CMU link grammar linkages"
Some miscellaneous facts:
- Any categorical grammar can be easily converted to a link grammar; see section 6 of Daniel Sleator and Davy Temperley. 1993. "Parsing English with a Link Grammar." Third International Workshop on Parsing Technologies.
- Link grammars can be learned by performing a statistical analysis on a large corpus: see John Lafferty, Daniel Sleator, and Davy Temperley. 1992. "Grammatical Trigrams: A Probabilistic Model of Link Grammar." Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October, 1992.
License
The Link Grammar license is essentially the BSD license. A copy of this license can be found below, and at the original author's CMU site
Copyright (c) 2003-2004 Daniel Sleator, David Temperley, and John Lafferty. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- The names "Link Grammar" and "Link Parser" must not be used to endorse or promote products derived from this software without prior written permission. To obtain permission, contact sleator@cs.cmu.edu
THIS SOFTWARE IS PROVIDED BY DANIEL SLEATOR, DAVID TEMPERLEY, JOHN LAFFERTY AND OTHER CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

![[Logo]](/gfx/swish-a.jpg)