Google Open Sources SyntaxNet: A Neural Network System For Parsing

Google has open sourced SyntaxNet, a neural network framework implemented in TensorFlow which provides foundation for Neutral Language Understanding (NLU) system. This release includes all the code which is required to train new SyntaxNet models on your own data, along with Parsey McParseface, which is an English parser that have been trained so that you will be able to analyze English text.
 
The company states,
 
“At Google, we spend a lot of time thinking about how computer systems can read and understand human language in order to process it in intelligent ways. Today, we are excited to share the fruits of our research with the broader community by releasing SyntaxNet, an open-source neural network framework implemented in TensorFlow that provides a foundation for Natural Language Understanding (NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as Parsey McParseface, an English parser that we have trained for you and that you can use to analyze English text.”
 
Parsey McParseface has been built on powerful machine learning algorithms which have learned to analyze the linguistic structure of language which can explain the functional role of each word given in a sentence. As Parsey McParseface is the ‘most accurate such model in the world’ Google is hoping that it will be really useful to developers as well as researchers who are interested in automatic extraction of information, translation and other core application of NLU.
 
SyntaxNet is a framework of what is known in academic circles as a syntactic parser, which is a key first component in several NLU systems. Given in sentence as input, it tags each word with the part of speech (POS) tag which describes the word’s syntactic function, and it also determines the syntactic relationships between words present in a sentence, represented in the dependency parse tree. These syntactic relationships are related directly to the underlying meaning of the sentence in question. Let us take a simple example and consider the following dependency tree for Alice saw Bob as provided in the official blog:
 
Image Source: googleresearch.blogspot.in 
 
This structure encodes that Alice and Bob are nouns and saw is a verb. The main verb saw is the root of the sentence and Alice is the subject (nsubj) of saw, while Bob is its direct object (dobj). You can see that Parsey McParseface analyzes this particular sentence absolutely correctly, and it also understands the following more complex example:
 
 Image Source: googleresearch.blogspot.in
 
The structure again encodes the very fact that Alice and Bob are the subject and object respectively of saw in addition that Alice is modified by a relative clause with the verb reading that saw is modified by the temporal modifier yesterday, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example whom did Alice see?, who saw Bob?, what had Alice been reading about? or when did Alice see Bob?
Google states,
 
“One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context.” As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses, as given in an official blog;
 
 
Image Source: googleresearch.blogspot.in
 
The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises as the preposition in can modify drove or street; this example is an instance of what is known as prepositional phrase attachment ambiguity.
 
The company states,
 
“SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use beam search in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration.”
 
Parsey McParseface and other SyntaxNet models are some of the most complex networks that Google has trained with the TensorFlow framework. Given some data from the Google supported Universal Treebanks project, you can now train a parsing model on your own machine.
 
Google concluded by saying,
 
“On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old Penn Treebank), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already better than any previous approach. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the Google WebTreebank (released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.”