We are ready to start a limited OCR grid test at: https://www.zooniverse.org/projects/zooniverse/oldweather-ocr/
There are 8 data pages loaded from 7 source sets, including 4 ships and 3 samples from the Indian Daily Weather Record (IDWR).
The purpose is to generate a set of test data from different kinds of sources, from De Long's idiosyncratic hand to printed material. The general approach is to separate a page into single variable cells with x-y coordinates that map back to the high-resolution and possibly reprocessed images. These reference snippets will be submitted to a set of transcription engines (including handwriting capable). The reconstructed page could then be shown to OW with high confidence results shown but blocked out, leaving only certified 'waste no-one's time' work. Important note
: when marking a grid follow the printed lines of the table as closely as possible, even if it cuts off part of a character. The system can be set to add a regular buffer to account for dragging tails and misaligned typing.
From this test we hope to learn exactly how well this approach will be able to increase data output *from regular tables* compared to purely manual techniques. We already understand there is much that state-of-the-shelf transcription engines cannot do, especially with blocks of writing, but it may be possible to significantly reduce the amount of numeric material that must be transcribed manually. Perhaps this will allow OW to focus more on the descriptive and historical aspects of the logbooks too, especially if the grid task is done as part of image processing.