Old Weather Forum

Old Weather: Arctic => Interface Development => Topic started by: Kevin on January 07, 2017, 06:22:18 pm

Title: OCR limited grid test
Post by: Kevin on January 07, 2017, 06:22:18 pm
We are ready to start a limited OCR grid test at: https://www.zooniverse.org/projects/zooniverse/oldweather-ocr/  There are 8 data pages loaded from 7 source sets, including 4 ships and 3 samples from the Indian Daily Weather Record (IDWR).

The purpose is to generate a set of test data from different kinds of sources, from De Long's idiosyncratic hand to printed material. The general approach is to separate a page into single variable cells with x-y coordinates that map back to the high-resolution and possibly reprocessed images. These reference snippets will be submitted to a set of transcription engines (including handwriting capable). The reconstructed page could then be shown to OW with high confidence results shown but blocked out, leaving only certified 'waste no-one's time' work.

Important note: when marking a grid follow the printed lines of the table as closely as possible, even if it cuts off part of a character. The system can be set to add a regular buffer to account for dragging tails and misaligned typing.

From this test we hope to learn exactly how well this approach will be able to increase data output *from regular tables* compared to purely manual techniques. We already understand there is much that state-of-the-shelf transcription engines cannot do, especially with blocks of writing, but it may be possible to significantly reduce the amount of numeric material that must be transcribed manually. Perhaps this will allow OW to focus more on the descriptive and historical aspects of the logbooks too, especially if the grid task is done as part of image processing.
Title: Re: OCR limited grid test
Post by: mapurves on January 07, 2017, 06:52:57 pm
 8) 8) 8)

and

Hooray!!!
Title: Re: OCR limited grid test
Post by: Randi on January 07, 2017, 07:26:09 pm
It would help if we could rotate the page just a little bit ;)

Larger images would help too. Jeannette's and Pennsylvania's log pages are very small. I will do all 8 pages, but I am finding this very hard on my eyes and neck :'(
Title: Re: OCR limited grid test
Post by: Craig on January 07, 2017, 07:58:18 pm
I agree, Randi. It has to be perfectly straight to reuse a saved grid.
Title: Re: OCR limited grid test
Post by: Randi on January 07, 2017, 08:01:40 pm
It has to be perfectly straight to draw a usable grid ;)

Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 07, 2017, 08:04:27 pm
My first time on this project (a break from the Albatross 1900)

I found drawing boxes from bottom right to top left worked better for me, and rows from bottom to top.
I magnify the image and slide it across when needed.

How do you adjust the height of a single row or delete it and re-draw after you have saved the template?

I think they may need a micro rotate function to rotate the page by one or two degrees not 90 degrees
Title: Re: OCR limited grid test
Post by: Randi on January 07, 2017, 08:21:46 pm
I found drawing boxes from bottom right to top left worked better for me, and rows from bottom to top.
Definitely - although I generally do heading boxes from top right to bottom left ;)
I normally do rows from bottom to top too. However, I have found that on the land stations, with no horizontal lines, it is easier to draw the rows from top to bottom (I draw in the margin outside the table area).

I magnify the image and slide it across when needed.
Good point :-[
Title: Re: OCR limited grid test
Post by: Caro on January 07, 2017, 08:35:25 pm
Page rotation needed, definitely.

Clarification: nudge left or right, I mean.  :)
Title: Re: OCR limited grid test
Post by: Bob on January 07, 2017, 09:02:22 pm
And sized (camera zoom) identically.

I agree, Randi. It has to be perfectly straight to reuse a saved grid.

Maybe a way to scale the saved grid when using it on a new page? Or set the image zoom to match, if that setting gets saved with the data?
Title: Re: OCR limited grid test
Post by: Kevin on January 07, 2017, 09:03:15 pm
For slightly crooked pages you can re-click on a drawn row, and re-position it to fit better. Definitely need a fine rotation on the page.
Title: Re: OCR limited grid test
Post by: Kevin on January 07, 2017, 09:24:36 pm
I was able to reuse a Pennsylvania grid with minor repositioning. On a production scale it would be nice to be able to share grids among users, even if this was being done as part of the imaging process.
Title: Re: OCR limited grid test
Post by: Randi on January 07, 2017, 09:40:37 pm
On Pennsylvania, do you want the row (with 1, 2, 3, ...) between the header and 1am and the blank row between AM and PM marked?
Title: Re: OCR limited grid test
Post by: Kevin on January 07, 2017, 09:50:04 pm
Row numbers are not needed on WW2 logs. Only wx columns (though for this test it doesn't matter what is in the column, if three mark it up - a number is a number as far as a recognition test goes).
Title: Re: OCR limited grid test
Post by: Randi on January 07, 2017, 10:17:16 pm
I'm not quite sure if we are talking about the same thing :-\
I do not mean the column with the times. Given that hours may be skipped or repeated as time zones are crossed and extra readings are occasionally added, it may be safest to keep the times. I mean the row with the column numbers 1-21.

The example marks all the columns with course and speed information. Do you mean that we should not mark those columns?

Thanks!
Title: Re: OCR limited grid test
Post by: Kevin on January 07, 2017, 10:30:20 pm
I didn't mark course and speed since we don't usually collect that. The column numbers across the top likewise, but the hours are for sure needed.
Title: Re: OCR limited grid test
Post by: jil on January 07, 2017, 10:38:32 pm
Although its useful to have the example grid and all the instructions next to the page you're working on it does make the image very small. Would it be possible to have these on a separate window that you could refer to when needed and get rid of once you know what you're doing and make the page image larger?
Title: Re: OCR limited grid test
Post by: Randi on January 07, 2017, 10:47:22 pm
An excellent idea, jil!

Perhaps an example grid for each log type ;)
(If I understand correctly, the current one is misleading.)
Title: Re: OCR limited grid test
Post by: Bob on January 07, 2017, 10:51:53 pm
They could use the slide-out Field Guide tab.  ;)
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 07, 2017, 11:16:45 pm
I didn't mark course and speed since we don't usually collect that. The column numbers across the top likewise, but the hours are for sure needed.

I agree with you kevin, BUT the instructions are as below.

Draw boxes around each bottom row header cell, as shown above. Remember to include the first column. Click done.

This covers all but can confuse old users who know what is collected.

Just a thought.
Title: Re: OCR limited grid test
Post by: Randi on January 08, 2017, 12:03:42 am
It confuses me, but that doesn't take much ::)
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 08, 2017, 12:10:31 am
They could use the slide-out Field Guide tab.  ;)

What is and where is the 'slide-out Field Guide tab'?

I have only got as far as the first page.
Title: Re: OCR limited grid test
Post by: Kevin on January 08, 2017, 12:15:27 am
Yes, for an operational system we would have more specific instructions (and probably I should follow the ones that happen to be there : ) ). All comments will be passed along to the developer. Thanks to all for contributing.
Title: Re: OCR limited grid test
Post by: Bob on January 08, 2017, 12:18:27 am
Oh, sorry, it's a feature available in the Zooniverse Project Builder. It's a tab that slides out from the right side of the screen for containing help and other info. It's not used here...  ;)

What is and where is the 'slide-out Field Guide tab'?
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 08, 2017, 12:23:48 am
Yes, for an operational system we would have more specific instructions (and probably I should follow the ones that happen to be there : ) ). All comments will be passed along to the developer. Thanks to all for contributing.

What do I do for the Lat/Long readings and do you want all three when available?

Title: Re: OCR limited grid test
Post by: Randi on January 08, 2017, 12:47:27 am
What is and where is the 'slide-out Field Guide tab'?

If this page (https://www.zooniverse.org/projects/zooniverse/gravity-spy/classify) comes up correctly for you, you can see it on the far right. You may need to log in.



I don't think we are worrying about the Lat/Long readings for now.
Title: Re: OCR limited grid test
Post by: leelaht on January 08, 2017, 12:50:24 am
It took a few tries, but I think I got the system figured out.  At this stage is it just drawing the grids?  I didn't see where to fill in the data... to see how good the grids were drawn.
Title: Re: OCR limited grid test
Post by: leelaht on January 08, 2017, 12:52:09 am
I like to draw the rows generously to account for sloppy placement, but don't know if that approach will work in practice.
Title: Re: OCR limited grid test
Post by: Randi on January 08, 2017, 01:01:40 am
I like to draw the rows generously to account for sloppy placement, but don't know if that approach will work in practice.

That is what I normally do too, but I'm not quite sure that is what they want here :-\
Important note: when marking a grid follow the printed lines of the table as closely as possible, even if it cuts off part of a character. The system can be set to add a regular buffer to account for dragging tails and misaligned typing.
Of course, when the printed logbook lines are not vertical or horizontal, it's pretty hard to follow them with lines that are :'(
Title: Re: OCR limited grid test
Post by: Kevin on January 08, 2017, 01:34:17 am
If the page is crooked try to adjust the boxes as best as you can with the tools available. We do not need anything other than the tabled data at this point (no dates, lat/long etc). The purpose is to test the grid method with respect to the OCR engines. (There are only 8 x 7 pages to do).
Title: Re: OCR limited grid test
Post by: Randi on January 08, 2017, 01:35:34 am
Is overlapping the boxes OK?
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 08, 2017, 03:19:37 am
Where are the data pages to test the grids?

Title: Re: OCR limited grid test
Post by: Bob on January 08, 2017, 03:22:20 am
I think I might be done, all the pages I get now have the 'Already Seen' tag on them.  :D
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 08, 2017, 03:30:17 am
I think I might be done, all the pages I get now have the 'Already Seen' tag on them.  :D

Hi bob.
Where did you find those pages so I can see if I get the same message.
Title: Re: OCR limited grid test
Post by: Bob on January 08, 2017, 03:45:15 am
Hi, Stuart -

I'm pretty sure we're only drawing grids here. Are you logged into Zooniverse when you're working the pages? Look for a red 'banner' in the upper left corner of the image saying 'Already Seen'. The way the workflow is set up here you'll need to reload the page to cycle past one you've already done.

Hi bob.
Where did you find those pages so I can see if I get the same message.
Title: Re: OCR limited grid test
Post by: Randi on January 08, 2017, 03:45:38 am
It definitely gets easier with practice! ;D
Title: Re: OCR limited grid test
Post by: Bob on January 08, 2017, 03:46:59 am
True that. I was able to use saved grids on about half the pages.  :D

It definitely gets easier with practice! ;D
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 08, 2017, 03:59:25 am
Hi, Stuart -

I'm pretty sure we're only drawing grids here. Are you logged into Zooniverse when you're working the pages? Look for a red 'banner' in the upper left corner of the image saying 'Already Seen'. The way the workflow is set up here you'll need to reload the page to cycle past one you've already done.

Hi bob.
Where did you find those pages so I can see if I get the same message.

This is the page I drew the grids on.
https://www.zooniverse.org/projects/zooniverse/oldweather-ocr/classify (https://www.zooniverse.org/projects/zooniverse/oldweather-ocr/classify)
Title: Re: OCR limited grid test
Post by: Randi on January 08, 2017, 05:58:49 pm
As far as I can tell, when you select a saved grid nothing happens until you click on the image :-\

It's working quite well for Farragut - so far ;) ;D

It worked for Burma too - although adding the two extra rows required in one case was a bit tricky.
Title: Re: OCR limited grid test
Post by: Bob on January 08, 2017, 08:03:54 pm
As far as I can tell, when you select a saved grid nothing happens until you click on the image :-\

It says that on the page, but needs to be highlighted somehow I think...  ;)

"Or, if you have a grid template saved from a past annotation, click on the image to reuse it."
Title: Re: OCR limited grid test
Post by: Kevin on January 08, 2017, 08:56:19 pm
Is overlapping the boxes OK?

Overlapping boxes should be fine --- and if not we'll find out about that too!
Title: Re: OCR limited grid test
Post by: Pommy Stuart on January 08, 2017, 09:07:20 pm
I clicked on the image and the green boxes came up.
Clicked the Annotate arrow and tried clicking a box. Nothing Happened.

Going away for a weeks caravaning, will pick this up when I get back.
Title: Re: OCR limited grid test
Post by: Kevin on January 12, 2017, 02:59:34 am
We're about half way there on the grid test. I'll go for a complete 40 grids.
Title: Re: OCR limited grid test
Post by: Randi on January 12, 2017, 04:00:04 am
I'm still working on it.
Title: Re: OCR limited grid test
Post by: mapurves on January 12, 2017, 04:54:17 am
I did eight or ten - didn't count - and got a message about the page had already been seen, which I thought meant that I was done. If my assumption was incorrect, please let me know and I'll do some more.
Title: Re: OCR limited grid test
Post by: Bob on January 12, 2017, 12:37:30 pm
There's more, I think. You have to reload the page to get a new image here (no ability to skip a page?). When all you get is 'already seen' images, then you're done. There will be a 'retirement' count for each image that's set in the project builder, don't know what that is for this demo.

I did eight or ten - didn't count - and got a message about the page had already been seen, which I thought meant that I was done. If my assumption was incorrect, please let me know and I'll do some more.
Title: Re: OCR limited grid test
Post by: mapurves on January 12, 2017, 06:49:52 pm
I did my morning chores and went to add to the OCR test. This is what I got, after logging in:

Quote
Great work! Looks like this project is out of data at the moment!
Title: Re: OCR limited grid test
Post by: Kevin on January 12, 2017, 06:52:31 pm
I think the retirement is set to 4.
Title: Re: OCR limited grid test
Post by: Randi on January 12, 2017, 06:57:33 pm
I just did a Farragut page (9 Jan 1942) and was presented with a land station page starting with Ceylon.
When I clicked on OLDWEATHER OCR, I got the main screen with the message "Great work! Looks like this project is out of data at the moment!".
If I click on Get Started, I go back to the Ceylon page.
Title: Re: OCR limited grid test
Post by: Hanibal94 on January 12, 2017, 07:08:27 pm
The site statistics say:

Retirement limit: 3
Images retired: 23 / 20
Classifications: 147 / 60
Title: Re: OCR limited grid test
Post by: Kevin on January 15, 2017, 04:31:24 pm
I'll be in touch with Laura about the next step this week. Thanks everyone for pitching in, and I'll let you know if there was a glitch in the closeout. Also, for your information, we hope the capabilities we are developing will be transferable to any project working on tabular data, and how best to use them may include grid or computer vision analysis as part of the initial processing stage rather than as a citizen-science activity. 
Title: Re: OCR limited grid test
Post by: mapurves on January 15, 2017, 04:40:38 pm
How much severance pay will we be getting when the computer takes over and we're all laid off?  ;D ;D ;D
Title: Re: OCR limited grid test
Post by: Bob on January 15, 2017, 04:46:18 pm
I heard it would be at least 1.5 times our current rate.  ;)

How much severance pay will we be getting when the computer takes over and we're all laid off?  ;D ;D ;D
Title: Re: OCR limited grid test
Post by: Randi on January 15, 2017, 09:30:54 pm
Thanks for the update, Kevin!



I sure hope that that doesn't kick me into a higher tax bracket ;D
Title: Re: OCR limited grid test
Post by: Kevin on January 15, 2017, 11:13:30 pm
Well, for those concerned with getting laid off I can assure you that won't be happening. Testing so far suggests there will be an 'eyes only' requirement - you just won't be asked to transcribe the fraction of numbers that can be OCR'd. Best case guess 10-40% will require review and correction. For logbooks the need will also remain for data and information on the remarks page. Hopefully the system will be good enough that bulk processing of pure data tables like the Indian Daily (IDWR) examples in the test.

FYI, we are about to start writing a 3-year proposal to image the remainder of the pre-WW2 logs in the US National Archives. Not counting the many we've already done that's 10,427 volumes in 118-A and 7,119 boxes in 118G-A..Z. Probably we will look at prioritizing early 20th c. typed material and the civil war era. For the latter Mark M is interested in also imaging a related collection of oversize muster rolls, which should open up new opportunities for historical scholarship - especially on the lives of the ordinary sailor. Up to now a researcher would have to visit A1 to work with these records but hopefully they'll be online if we are successful.
Title: Re: OCR limited grid test
Post by: AvastMH on January 17, 2017, 07:06:36 pm
Well, for those concerned with getting laid off I can assure you that won't be happening.

All holidays cancelled then  :'( :'( :'( ( ;) ;D )
Title: Re: OCR limited grid test
Post by: Kevin on April 08, 2017, 10:14:39 pm
To keep everyone up to date: we have now tested a several OCR and script recognition systems and have found that the current state of the shelf does not produce reliable enough results for us. So far auto-recognition doesn't increase the efficiency of data conversion because the manual correction component is high and too variable. However, we did learn some important things about what is possible and worth developing, and we will continue to work on these. There will be no layoffs this year.
Title: Re: OCR limited grid test
Post by: Craig on April 08, 2017, 10:49:44 pm
I guess that's not too surprising, Kevin, but as you say, it was worth a try. I suppose using neural network algorithms is too expensive?
Title: Re: OCR limited grid test
Post by: Randi on April 08, 2017, 11:05:27 pm
There will be no layoffs this year.

Whew!
Title: Re: OCR limited grid test
Post by: AvastMH on April 08, 2017, 11:45:34 pm
There will be no layoffs this year.

Whew!

I'll second Randi!
Title: Re: OCR limited grid test
Post by: mapurves on April 09, 2017, 12:16:49 am
There will be no layoffs this year.

Great relief here, although the pain of layoffs can be relieved somewhat by a generous separation bonus...  ;D  (I'm thinking something along the lines of Goldman Sachs and other such organizations...)  ;)
Title: Re: OCR limited grid test
Post by: Craig on April 09, 2017, 01:32:32 am
Will OW stock options satisfy you?  ;D
Title: Re: OCR limited grid test
Post by: mapurves on April 09, 2017, 01:51:48 am
Will OW stock options satisfy you?  ;D

You betcha!  ;D ;D ;D ;D

Speaking of whom, I can see Russia from the top of the mast!   ;)