Entity Centric Neural Models for Natural Language Processing
Nicola De Cao

Abstract:
Entities are at the center of how we represent and aggregate knowledge. For instance, in Encyclopedias such as Wikipedia information is grouped according to entities (e.g., one entity per article). However, although contemporary NLP technology has become remarkably successful in machine-driven question answering, modern neural network models struggle to incorporate structured information about entities into their decision process. In this thesis, "Entity Centric Neural Models for Natural Language Processing", we investigate how to build effective neural network models exploiting entity information for natural language understanding. We mainly consider three research questions:

How can we exploit entities to tackle Natural Language Understanding tasks? We introduce a neural model that integrates reasons relying on information spread within and across multiple documents (chapter 3). We frame it as an inference problem on a graph. Mentions of entities are nodes of this graph, while edges encode relations between different mentions (e.g., within- and cross-document co-reference). Graph convolutional networks (GCNs) are applied to these graphs and trained to perform multi-step reasoning. Our Entity-GCN method is scalable and compact, and it achieved state-of-the-art results at the time of writing (i.e., 2018) on WikiHop, a popular multi-document question-answering dataset.

How can we exploit large pre-trained language models to identify and disambiguate entities in the text? We propose the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion (chapter 4). Our model mitigates the limitations of well-established two-tower dot-product-based models that potentially miss fine-grained interactions between text and entities in a Knowledge Base. Additionally, we significantly reduced the memory footprint of current systems (up to 15 times) because the parameters of our encoder-decoder architecture scale with vocabulary size, not the entity count. We also extend our approach to a large multilingual setting with more than 100 languages (chapter 5). In this setting, we match against entity names of as many languages as possible, which allows exploiting language connections between source input and target name. Finally, we also propose a very efficient approach that parallelizes autoregressive linking across all potential mentions and relies on a shallow and efficient decoder which allows a >70 faster model with no performance drop (chapter 6).

How can we interpret and control a model's internal knowledge about entities? We introduce a novel post-hoc interpretation technique for inspecting how decisions emerge across layers in neural models in chapter 7. Our system learns to mask-out subsets of vectors while maintaining differentiability. This lets us not only plot attribution heatmaps but also analyze how decisions are formed across network layers. We use this system to study BERT models on sentiment classification and question answering additionally showing that this technique can be applied to the graph-based model presented in chapter 3. Finally, we also propose a method that can be used to edit this factual knowledge about entities and, thus, fix 'bugs' or unexpected predictions without the need for expensive re-training or fine-tuning (chapter 8).