Author Topic: OCR limited grid test  (Read 1948 times)

Kevin

  • Old Weather Team
  • Hero Member
  • *****
  • Posts: 558
    • View Profile
OCR limited grid test
« on: January 07, 2017, 06:22:18 pm »
We are ready to start a limited OCR grid test at: https://www.zooniverse.org/projects/zooniverse/oldweather-ocr/  There are 8 data pages loaded from 7 source sets, including 4 ships and 3 samples from the Indian Daily Weather Record (IDWR).

The purpose is to generate a set of test data from different kinds of sources, from De Long's idiosyncratic hand to printed material. The general approach is to separate a page into single variable cells with x-y coordinates that map back to the high-resolution and possibly reprocessed images. These reference snippets will be submitted to a set of transcription engines (including handwriting capable). The reconstructed page could then be shown to OW with high confidence results shown but blocked out, leaving only certified 'waste no-one's time' work.

Important note: when marking a grid follow the printed lines of the table as closely as possible, even if it cuts off part of a character. The system can be set to add a regular buffer to account for dragging tails and misaligned typing.

From this test we hope to learn exactly how well this approach will be able to increase data output *from regular tables* compared to purely manual techniques. We already understand there is much that state-of-the-shelf transcription engines cannot do, especially with blocks of writing, but it may be possible to significantly reduce the amount of numeric material that must be transcribed manually. Perhaps this will allow OW to focus more on the descriptive and historical aspects of the logbooks too, especially if the grid task is done as part of image processing.

mapurves

  • Shipherd
  • Hero Member
  • *****
  • Posts: 1801
    • View Profile
Re: OCR limited grid test
« Reply #1 on: January 07, 2017, 06:52:57 pm »
 8) 8) 8)

and

Hooray!!!

Randi

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 13136
    • View Profile
Re: OCR limited grid test
« Reply #2 on: January 07, 2017, 07:26:09 pm »
It would help if we could rotate the page just a little bit ;)

Larger images would help too. Jeannette's and Pennsylvania's log pages are very small. I will do all 8 pages, but I am finding this very hard on my eyes and neck :'(
« Last Edit: January 07, 2017, 08:14:24 pm by Randi »

Craig

  • Shipherd
  • Hero Member
  • *****
  • Posts: 3360
    • View Profile
Re: OCR limited grid test
« Reply #3 on: January 07, 2017, 07:58:18 pm »
I agree, Randi. It has to be perfectly straight to reuse a saved grid.

Randi

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 13136
    • View Profile
Re: OCR limited grid test
« Reply #4 on: January 07, 2017, 08:01:40 pm »
It has to be perfectly straight to draw a usable grid ;)


Pommy Stuart

  • Shipherd
  • Hero Member
  • *****
  • Posts: 3593
  • A closed mouth gathers no foot.
    • View Profile
Re: OCR limited grid test
« Reply #5 on: January 07, 2017, 08:04:27 pm »
My first time on this project (a break from the Albatross 1900)

I found drawing boxes from bottom right to top left worked better for me, and rows from bottom to top.
I magnify the image and slide it across when needed.

How do you adjust the height of a single row or delete it and re-draw after you have saved the template?

I think they may need a micro rotate function to rotate the page by one or two degrees not 90 degrees

Randi

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 13136
    • View Profile
Re: OCR limited grid test
« Reply #6 on: January 07, 2017, 08:21:46 pm »
I found drawing boxes from bottom right to top left worked better for me, and rows from bottom to top.
Definitely - although I generally do heading boxes from top right to bottom left ;)
I normally do rows from bottom to top too. However, I have found that on the land stations, with no horizontal lines, it is easier to draw the rows from top to bottom (I draw in the margin outside the table area).

I magnify the image and slide it across when needed.
Good point :-[
« Last Edit: January 08, 2017, 02:33:18 pm by Randi »

Caro

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7336
  • Our end is Life. Put out to sea. Louis MacNeice
    • View Profile
Re: OCR limited grid test
« Reply #7 on: January 07, 2017, 08:35:25 pm »
Page rotation needed, definitely.

Clarification: nudge left or right, I mean.  :)
« Last Edit: January 07, 2017, 08:59:46 pm by Caro »

Bob

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1303
    • View Profile
Re: OCR limited grid test
« Reply #8 on: January 07, 2017, 09:02:22 pm »
And sized (camera zoom) identically.

I agree, Randi. It has to be perfectly straight to reuse a saved grid.

Maybe a way to scale the saved grid when using it on a new page? Or set the image zoom to match, if that setting gets saved with the data?
« Last Edit: January 07, 2017, 09:07:35 pm by Bob »

Kevin

  • Old Weather Team
  • Hero Member
  • *****
  • Posts: 558
    • View Profile
Re: OCR limited grid test
« Reply #9 on: January 07, 2017, 09:03:15 pm »
For slightly crooked pages you can re-click on a drawn row, and re-position it to fit better. Definitely need a fine rotation on the page.

Kevin

  • Old Weather Team
  • Hero Member
  • *****
  • Posts: 558
    • View Profile
Re: OCR limited grid test
« Reply #10 on: January 07, 2017, 09:24:36 pm »
I was able to reuse a Pennsylvania grid with minor repositioning. On a production scale it would be nice to be able to share grids among users, even if this was being done as part of the imaging process.

Randi

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 13136
    • View Profile
Re: OCR limited grid test
« Reply #11 on: January 07, 2017, 09:40:37 pm »
On Pennsylvania, do you want the row (with 1, 2, 3, ...) between the header and 1am and the blank row between AM and PM marked?

Kevin

  • Old Weather Team
  • Hero Member
  • *****
  • Posts: 558
    • View Profile
Re: OCR limited grid test
« Reply #12 on: January 07, 2017, 09:50:04 pm »
Row numbers are not needed on WW2 logs. Only wx columns (though for this test it doesn't matter what is in the column, if three mark it up - a number is a number as far as a recognition test goes).

Randi

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 13136
    • View Profile
Re: OCR limited grid test
« Reply #13 on: January 07, 2017, 10:17:16 pm »
I'm not quite sure if we are talking about the same thing :-\
I do not mean the column with the times. Given that hours may be skipped or repeated as time zones are crossed and extra readings are occasionally added, it may be safest to keep the times. I mean the row with the column numbers 1-21.

The example marks all the columns with course and speed information. Do you mean that we should not mark those columns?

Thanks!

Kevin

  • Old Weather Team
  • Hero Member
  • *****
  • Posts: 558
    • View Profile
Re: OCR limited grid test
« Reply #14 on: January 07, 2017, 10:30:20 pm »
I didn't mark course and speed since we don't usually collect that. The column numbers across the top likewise, but the hours are for sure needed.