eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Textizing Statistical Tables using OCR at Scale

Resource Type
Authors: Arimoto, Yutaka
Source: 経済研究. 73(1):15-28
Subject
Language: Japanese
ISSN: 0022-9733

Online Access

초록

本稿は，OCRを利用して，統計表を体系的かつ大規模にテキストデータ化するための要件と方法を解説する．統計表をOCRでテキストデータ化するには，高い精度の表レイアウト解析が求められる．筆者が開発しているocrstatsは，バッチ処理，定型的な工程の自動化，外部OCRの利用，実用的な精度の表レイアウト解析を実現し，作業効率の改善を図っている．また，ocrstatsを使って『日本帝国統計年鑑』をテキストデータ化する過程で得られたノウハウや，パネルデータの作成にあたって変数を経年的にリンクする方法も解説する．
This study describes the requirements and methods for textizing statistical tables using OCR（optical character recognition）at scale. A major challenge of textizing statistical tables using OCR is analyzing the table layout with high accuracy. I develop a Python toolkit, ocrstats, which supports the task by providing batch processing, automation of routine processes, use of external OCR, and table layout analysis with practical accuracy. In addition, I explain the practical tips learned from the process of textizing the Japan Imperial Statistical Yearbook using ocrstats.

공지

DAU Library

eArticles

요약정보

Textizing Statistical Tables using OCR at Scale

Online Access

초록