An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data (Application Paper)
- Resource Type
- Conference
- Authors
- Mattmann, Chris A.; Sharan, Madhav
- Source
- 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI) Information Reuse and Integration, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 2016 IEEE 17th International Conference on. :87-93 Jul, 2016
- Subject
- Computing and Processing
Engineering Profession
General Topics for Engineers
Power, Energy and Industry Applications
Robotics and Control Systems
Metadata
Indexes
World Wide Web
Data mining
US Government
Science - general
Taxonomy
lucene geotopic polar memex apache
- Language
We present an automatic approach for discovering location names in WWW data culled from diverse domains. Our approach builds upon the Apache Tika, Apache OpenNLP, and Apache Lucene frameworks. Tika is used to extract text and metadata from any file. The text and metadata are provided to Apache OpenNLP and its location entity extraction model. The discovered location entities are then delivered to a gazetteer indexed in Apache Lucene derived from the Geonames.org dataset. This paper describes the overall approach and then explains in detail the challenges we faced, and the methodology that we employed to overcome them. We describe the evolution of our geo gazetteer process and algorithm and demonstrate the approach's accuracy in data collected in the DARPA MEMEX and NSF Polar Cyber Infrastructure efforts.