This tutorial will cover the algorithms, design and implementation of the open source OCR engine known as Tesseract.
Designed largely in secret, the methods used in Tesseract are not well known, yet it remains a formidable force in OCR, and continues to improve. The layout analysis was second in the 2009 ICDAR competition, it supports more than 60 languages, including Chinese, and several Indic languages, and recent changes have allowed easy plug-in of new classifiers.
This tutorial will lay all the cards on the table, covering the following topics:
Background/history
Overall architecture
Internal data structures
Layout analysis
Character classification
How to add a new character classifier
Integration of LSTM, which brings Tesseract right up to date.
Segmentation and language models
Training
Challenges of truly multilingual OCR
Live demos
Optional Hands-on opportunity:
This tutorial aims to provide a hands-on experience in which you get to build and run the latest Tesseract on your own machine, to follow along with the demos, and possibly even make your own modifications! Bring along your own laptop with the following configuration to take part:
Hardware: Laptop with external Mouse!! (Scroll wheel needed.)
Ram/Disk: Most laptops under 10 years old should handle it with ease.
Operating system: Linux or Windows. No specific version needed.
C++ compiler: Linux: gcc, Windows: Visual Studio Express 2010 OR Mingw.
Java runtime. Version not important.
Working WiFi or USB port for downloading software and data.
Mac users: The Tesseract demos are very hard to use without a scrollwheel. If you can emulate that somehow, bring it along, and we will give it a try! No guarantees though.