Tuesday, November 19, 3 pm to 5 pm
Ph.D. Committee: Drs. Krishnaprasad Thirunarayan, Advisor, Valerie L. Shalin (Psychology), Keke Chen, Guozhu Dong, Srinivasan Pathasarathy (The Ohio State University), and Steven Gustafson (noonum Inc.)
Information Extraction (IE) techniques are developed to extract entities, relationships, and other detailed information from unstructured text. Majority of the methods in the literature focus on designing supervised machine learning techniques, which are not very practical due to the high cost of obtaining annotations and the difficulty in creating high quality (regarding reliability and coverage) gold standard. Therefore, semi-supervised and distantly-supervised techniques are getting more traction lately to overcome some of the challenges, such as bootstrapping the learning in a faster way.
This dissertation focuses on information extraction, and in particular entities, i.e., Named Entity Recognition (NER), from multiple domains, including social media and other grammatical texts such as news and medical documents. This work explores the ways for lowering the cost of building NER pipelines with the help of available knowledge without compromising the quality of extraction and simultaneously taking into consideration feasibility and other concerns such as the user-experience. I present a type of distantly supervised (dictionary-based), supervised (with reduced cost using entity set expansion and active learning), and minimally-supervised NER approaches. In addition, I discuss the various aspects of my knowledge-enabled NER approaches and how and why they are a better fit for today's real-world NER pipelines in dealing with and the partial overcoming of the difficulties mentioned above.
I present two dictionary-based NER approaches. The first technique is used for location extraction from text streams, which proved very effective for stream processing with competitive performance in comparison with ten other techniques. The second is a generic NER approach that scales to multiple domains and is minimally supervised with a human-in-the-loop for online feedback. The two techniques augment and filter the dictionaries to compensate for the incompleteness of dictionaries (due to lexical variation between dictionary records and mentions in the text) and for eliminating the noise and spurious content in them. The third technique I present is a supervised approach but with a reduced cost. The cost reduction was achieved with the help of a human-in-the-loop and smart instance samplers implemented using entity set expansion and active learning. The use of knowledge, tabbing on the NER models' accuracy, and the full exploitation of inputs from the human-in-the-loop was the key to overcoming the practical, technical, and monetary challenges. I make the data and codes of the approaches presented in this dissertation publicly available.
Log in to submit a correction for this event (subject to moderation).