IADS Revision Note: Language Processing Basic

CFG and syntax tree

context-free languages are common to both artificial languages and natural languages, can be resolved by syntax tree, which displays the grammatical consistent structure of a language text and constructing the tree is a step in many language processing (LP) tasks.

grammar specify what syntax tree is possible, consist of terminals of grammar or the leaf of syntax and non-terminals that not appear in sentences of language.

generating syntax tree from grammar, beginning with start symbol, repeatedly expanding non-terminal symbols, and we can generate infinitely many strings from this finite grammar, and the set of all strings generated by this grammar is defined as language, in term of language, we can also define it as set of derivations involving sentential forms, which is any sequence of terminals and non-terminals, that can appear in a derivation starting from the start symbol.

syntax trees may have structural ambiguity, which means one sentence several tree, and to avoid this we need carefully design grammar, but in real life, the natural language, structural ambiguity is a fact of life.

A context free grammar (CFG) consists of a finite set of terminals, a finite set of non-terminals, a choice of start symbols, a finite set of productions.

In terms of context free, we mean the derivations not depend on follow up derivations, there also something called context sensitive grammar, which depend on future derivation, for example, VB is context sensitive programming language.

parsing problem，CYK and CNF

parsing problem is the one convert the string of terminals to syntax tree, to solve this problem, we can use CYK algorithm, with Chomsky normal form (CNF, notice that CNF also an abbreviation of conjunctive normal form), and the impressing thing is that this algorithm is dynamic, and can operate in $O(n^3)$ or, if we allow grammar to vary, the runtime would be $O(mn^3)$, where m is the size of grammar, if we allow ternary rules (conflict with CNF) in CYK, the runtime would be $O(n^4)$ and this is one major reason using CNF in CYK.

We know that any CFG can convert to CNF, which is a grammar form that each RHS consists of either just two non-terminals or just one terminal, just like a binary tree, by adding new non-terminals.

CYK algorithm can identify all possible parses of a string of terminals by a recognizer, which determines if a given string belongs to the given language, also the one has no parses, called phantom constituents, can be recognized.

Getting the smallest CNF grammar equivalent to a CFG is NP-hard, in other words, there are no known polynomial-time algorithm, where NP denotes nondeterministic polynomial time.

Versions of CYK are quite widely used in Natural language context, where sentences typically have <100 words, but $O(n^3)$ runtime is not friendly to parse computer languages as programs may be massive.

LL(1) parsing

The thought behind LL(1) predictive parsing is predicting by rule from just current token, by token we mean the word identify the non-terminal expansion, and ensure unique rule in each cell of a parse table, which is a table telling us which rule to apply in given situation denoted by non-terminal and terminal pair, also there is a new one called Blank entries, which is situations that never arise for a legal input.

LL(1) predictive parsing is done very efficiently with stack, which records the predicted form for the remaining input, with 3 type error messages, Error if no rule, Error if stack symbol is any other terminal, Error if stack empties sooner, moreover, the stack input has two functions, if the next peek is a terminal, it should pop this terminal, if the next peek is a non-terminal, it should check the parse table for the rule.

Notice that the stack pops from left to right, or growing from right to left.

LL(1) is an example of a top-down parser, builds syntax trees from the root, while CYK is bottom-up method, and LL(1) can runs in time $\theta(n)$ with small hidden constants, so it would be great if we try to ensure a CF language has LL(1) grammar.

LL(1) is nice for simple ‘command languages’ and the ‘lightweight’ parsing algorithm is a plus, but for large-scale languages, which want a bit more flexibility, LR(1) parsing is used.