Abstract | With the advent of the Web, textual information has grown at an explosive rate. To digest
this enormous amount of data, an automatic solution, Information Extraction (IE), has become
necessary. Information extraction is a task of converting unstructured text strings into structured
machine-readable data. The first key step of a general IE pipeline is often to analyze entities
mentioned in the text before making holistic conclusions. To fully understand each entity, one needs
to detect their mentions, categorize them into semantic types, connect them with their knowledge
base entries, and identify their attributes as well as the relationships with others.
In this dissertation, we first present the problem of fine-grained entity recognition. Unlike
most traditional named entity recognition systems using a small set of entity classes, e.g., person,
organization, location or miscellaneous, we define a novel set of over one hundred fine-grained
entity types. In order to intelligently understand text and extract a wide range of information, it
is useful to more precisely determine the semantic classes of entities mentioned in unstructured
text. We formulate the recognition problem as multi-class, multi-label classification, describe an
unsupervised method for collecting training data, and present the FIGER implementation.
Next, we demonstrate that fine-grained entity types are closely connected with other entity
analysis tasks. We describe an entity linking system whose prediction heavily relies on these types
and present a simple yet effective implementation, called VINCULUM. An extensive evaluation
on nine data sets, comparing VINCULUM with two state-of-the-art systems, elucidates key aspects
of the system that include mention extraction, candidate generation, entity type prediction, entity
coreference, and coherence.
Finally, we describe an approach to acquire commonsense knowledge from a massive amount
of text on the Web. In particular, a system called SIZEITALL is developed to extract numerical
attribute values for various classes of entities. To resolve the ambiguity from the surface form text,
we canonicalize the extractions with respect to WordNet senses and build a knowledge base on
physical size for thousands of entity classes.
Throughout all three entity analysis tasks, we show the feasibility of building sophisticated IE
systems without a significant investment in human effort to create sufficient labeled data. |