Semere Bitew graduates Cum Laude on Logical Structure Extraction of Electronic Documents

Logical Structure Extraction of Electronic Documents Using Contextual Information

by Semere Bitew

Logical document structure extraction refers to the process of coupling the semantic meanings (logical labels) such as title, authors, affiliation, etc., to physical sections in a document. For example, in scientific papers the first paragraph is usually a title. Logical document structure extraction is a challenging natural language processing problem. Elsevier, as one of the biggest scientific publishers in the world, is working on recovering logical structure from article submissions in its project called the Apollo project. The current process in this project requires the involvement of human annotators to make sure logical entities in articles are labelled with correct tags, such as title, abstract, heading, reference-item and so on. This process can be more efficient in producing correct tags and in providing high quality and consistent publishable article papers if it is automated. A lot of research has been done to automatically extract the logical structure of documents. In this thesis, a document is defined as a sequence of paragraphs and recovering the labels for each paragraph yields the logical structure of a document. For this purpose, we proposed a novel approach that combines random forests with conditional random fields (RF-CRFs) and long short-term memory with CRFs (LSTM-CRFs). Two variants of CRFs called linear-chain CRFs (LCRFs) and dynamic CRFs (DCRFs) are used in both of the proposed approaches. These approaches consider the label information of surrounding paragraphs when classifying paragraphs. Three categories of features namely, textual, linguistic and markup features are extracted to build the RF-CRF models. A word embedding is used as an input to build the LSTM-CRF models. Our models were evaluated for extracting reference-items on Elsevier’s Apollo dataset of 146,333 paragraphs. Our results show that LSTM-CRF models trained on the dataset outperform the RF-CRF models and existing approaches. We show that the LSTM component efficiently uses past feature inputs within a paragraph. The CRF component is able to exploit the contextual information using the tag information of surrounding paragraphs. It was observed that the feature categories are complementary. They produce the best performance when all the features are used. On the other hand, this manual feature extraction can be replaced with an LSTM, where no handcrafted features are used, achieving a better performance. Additionally, the inclusion of features generated for the previous and next paragraph as part of the feature vector for classifying the current paragraph improved the performance of all the models.

[download pdf]

Linear Co-occurrence Rate Networks for Sequence Labeling

by Zhemin Zhu, Djoerd Hiemstra, and Peter Apers

Sequence labeling has wide applications in natural language processing and speech processing. Popular sequence labeling models suffer from some known problems. Hidden Markov models (HMMs) are generative models and they cannot encode transition features; Conditional Markov models (CMMs) suffer from the label bias problem; And training of conditional random fields (CRFs) can be expensive. In this paper, we propose Linear Co-occurrence Rate Networks (L-CRNs) for sequence labeling which avoid the mentioned problems with existing models. The factors of L-CRNs can be locally normalized and trained separately, which leads to a simple and efficient training method. Experimental results on real-world natural language processing data sets show that L-CRNs reduce the training time by orders of magnitudes while achieve very competitive results to CRFs.

[download pdf]

The paper will be presented at the International Conference on Statistical Language and Speech Processing (SLSP) in Grenoble, France on October 14-16, 2014

Our C++ implementation of L-CRNs and the datasets used in this paper can be found on Github.

Comparison of Local and Global Undirected Graphical Models

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher

Conditional Random Fields (CRFs) are discriminative undirected models which are globally normalized. Global normalization preserves CRFs from the label bias problem (LBP) which most local models suffer from. Recently proposed co-occurrence rate networks (CRNs) are also discriminative undirected models. In contrast to CRFs, CRNs are locally normalized. It was established that CRNs are immune to the LBP although they are local models. In this paper, we further compare these two models. The connection between CRNs and Copula are built in continuous case. Also their strength and weakness are further evaluated statistically by experiments.

[download pdf]

The paper was presented at the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) in Bruges (Belgium) on 23-25 April 2014.

Empirical Co-occurrence Rate Networks For Sequence Labeling

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher.

Structured prediction has wide applications in many areas. Powerful and popular models for structured prediction have been developed. Despite the successes, they suffer from some known problems: (i) Hidden Markov models are generative models which suffer from the mismatch problem. Also it is difficult to incorporate overlapping, non-independent features into a hidden Markov model explicitly. (ii) Conditional Markov models suffer from the label bias problem. (iii) Conditional Random Fields (CRFs) overcome the label bias problem by global normalization. But the global normalization of CRFs can be expensive which prevents CRFs from applying to big data. In this paper, we propose the Empirical Co-occurrence Rate Networks (ECRNs) for sequence labeling. ECRNs are discriminative models, so ECRNs overcome the problems of HMMs. ECRNs are also immune to the label bias problem even though they are locally normalized. To make the estimation of ECRNs as fast as possible, we simply use the empirical distributions as the estimation of parameters. Experiments on two real-world NLP tasks show that ECRNs reduce the training time radically while obtain competitive accuracy to the state-of-the-art models.

Presented at International Conference on Machine Learning and Applications (ICMLA) in Miami, Florida

[download pdf]

Empirical Training for Conditional Random Fields

A Closed Form Maximum Likelihood Estimator Of Conditional Random Fields

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers and Andreas Wombacher

Training Conditional Random Fields (CRFs) can be very slow for big data. In this paper, we present a new training method for CRFs called Empirical Training which is motivated by the concept of co-occurrence rate. We show that the standard training (unregularized) can have many maximum like-lihood estimations (MLEs). Empirical training has a unique closed form MLE which is also a MLE of the standard training. We are the first to identify the Test Time Problem of the standard training which may lead to low accuracy. Empirical training is immune to this problem. Empirical training is also unaffected by the label bias problem even it is locally normalized. All of these have been verified by experiments. Experiments also show that empirical training reduces the training time from weeks to seconds, and obtains competitive results to the standard and piecewise training on linear-chain CRFs, especially when data are insufficient.

[download pdf]

Conditional Random Fields on Steroids

I have never been more excited about a paper that I contributed to! In this technical report Zhemin Zhu introduces a new theory for factorizing undirected graphical models, with astonishing results, reducing the training time for conditional random fields from weeks till seconds on a part-of-speech tagging task. Reducing the training time from weeks to seconds is like approaching the moon up to a distance of about 100 meter, or buying a Ferrari F12 for 10 cents!!

Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher

The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. In this paper, we present separate training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to piecewise training, separate training is exact. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.

[download pdf]