Since 2015 there have been 827 CHNC users who have corrected over 5,521,820 lines of text in the Colorado Historic Newspapers Collection! That’s an impressive number and with that in mind we have recently been reviewing our Text Correcting FAQs. We thought it was a good time to share some new tips we have learned over the last year. First let’s go over some OCR basics.
How is the text of the historic newspapers searchable in the CHNC?
The searchable text and titles in this collection have been automatically generated using Optical Character Recognition (OCR) software. OCR, is a process by which software reads a page image and translates it into a text file by recognizing the shapes of the letters. OCR enables searching of large quantities of full-text data, but it is never 100% accurate. The level of accuracy depends on the print quality of the original issue, its condition at the time of microfilming, the level of detail captured by the microfilm scanner, and the quality of the OCR software.
Why is text correcting so important?
OCR technology alone can only do so much. Newspaper issues with poor quality paper, small print, mixed fonts, multiple column layouts, or damaged pages may have poor OCR accuracy. Since CHNC only searches the translated text, incorrect translation can cause fewer search results.
For example (below), “Oak Creek” may be translated into text as “Oak Crook.” The mistake in this OCR created text can only be corrected manually. It’s this human intervention that makes the collection even more valuable to the tens of thousands of students, genealogists, and researchers who use it every month. Text correcting helps improve your searching and the searching of every other CHNC user.
Can anyone correct the OCR errors in the newspaper text?
Yes, anyone can correct text. All you have to do is register for a free user account.
How do I correct text?
The text correction functionality is accessed through the “Correct this text” link on the left menu when viewing the individual article. You must be logged into your personal account in order to edit/correct text. This view is split into two parts: the right side shows the page images that make up the document, and the left side is used for editing the lines of text.
When you move your mouse over the page images in the right pane, the blocks making up the pages will highlight. You can scroll this view by dragging with the mouse, or zoom in/out using the buttons above the viewer. Clicking a highlighted block will select it and load a form for editing that block into the left pane.
Correct the text line by line. A red box is displayed in the right pane to help you determine what text should be included in the line. Once you have finished correcting text, click “Save”. The changes you make will take effect immediately. Alternatively, clicking the “Cancel” button will discard any unsaved changes you have made.
You can then make further corrections to the same block, move onto the next block by clicking the “Next” button, select another block in the right pane, or exit the text correction view by clicking the “Return to viewing mode” link. Clicking “Save & exit” instead of “Save” will save the changes and then return you to the normal viewing mode automatically.
Frequently Asked OCR Questions
Where do I start?
Many correctors start with newspapers from their chosen area – either they live there or their ancestors lived there. However some correctors don’t base their corrections off a geographic area but rather a subject area they are researching – so they will correct articles based on subject regardless of what newspaper title the article appears in. All users are welcome to correct in whatever way they would like. However, if you are looking for titles in need, any recent addition to CHNC (see our home page – New News) will likely need some amount of correcting. Those titles will always be a good place to start.
The CHNC database also includes statistics on articles, pages and issues that are mostly complete. See here. This information can help guide your corrections as well as help us complete the corrections.
How do I know what has already been corrected?
In many cases it’s fairly obvious that the text has not been corrected. For example in the below, the OCR transcribed text is gibberish due to a poor microfilming process. Even with good OCR transcribed text, if you see random characters in the text (such as, / | / ), it likely hasn’t been corrected since most correctors will remove those characters.
In the example of the Aurora Democrat below, one can tell with just a cursory look that the transcribed text looks pretty good. Also, notice above the article title the list of “Contributors.” This means those 3 users have done some amount of correcting in the article.
When you are in text correcting mode, you will see a checkbox that notes whether the block of text has been completely corrected or not. Currently this information only displays when you are in text correcting mode.
Should I correct all the text of a given paragraph by typing the text into one line of the text correcting interface? It is easier for me if I do it this way.
We understand that it may seem time consuming to correct the text line by line. However there is a reason the system requests you do it this way. Correcting the text line by line ensures that the search results highlight the correct line of text.
For example – Correcting line by line and the resulting search results.
Example – Typing all of the corrected text into one line and the resulting search results. All the corrected text of the two paragraphs have been placed in the 3rd line of the text correcting window.
The system then believes the requested search term “scalp almost torn” is in the 3rd line of the first paragraph rather then 2nd to last line of the second paragraph. This results in an error in the highlighted search results and causes user confusion.
Should I correct misspelled words?
It is a bit of a judgement call. Generally we encourage users to correct the misspellings because the system retrieves articles based on searching the computer generated text. For example if the town Urvan is misspelled in the original newspaper as “Urvana” it may affect user’s search results. Since most people are going to search for Uravan as “Uravan” so their search would not retrieve the article on “Urvana.”
Should I correct an incomplete word or word connected by hyphen, for example “depart-ment”?
If the word has a hyphen the system will search the word by removing the hyphen and join the two parts together, to give the expected “department”. This only happens if the hyphen exists at the end of “depart-“, so you do not want to remove the hyphen in these instances. If the word is split between two lines without a hyphen, please correct it by joining the two parts together.
Do I have to correct all the blank spaces and miscellaneous punctuation and symbols? You can fix those if you want but those issues do not affect the searching (except in question #2) so it is not necessary. Some users like to clean up those types of OCR mistakes if only for the appearance reason. Plus it’s sometimes easy enough to do while you’re correcting other mistakes.
Can anyone correct the OCR errors in the newspaper text? Yes, anyone can correct text. All you have to do is register for a free user account.
Is it important to “exit” periodically? I have been saving every five minutes, and returning to the site nearly daily to find myself still logged in and ready to go.
No you do not have to exit and logout of your account. You do have to periodically save your progress. But the system will remind you to save every five minutes.
If I absolutely cannot read a line or several words, what should I do ? Erase the garbled OCR and leave a blank space, leave the Garbled OCR, or what?
I would just leave the garble. Another user might come along and be able to read it.
Will I ever become a Top Journeyman Editor?
One can only hope! The system keeps a tally of the words that you have corrected. The more lines you correct the higher you will rise on the ladder to Top Journeyman Editor. Our current Top Journeyman Editors have collectively corrected over 2 million lines of text! Regardless of your ranking, always remember that it’s this human intervention that makes the collection even more valuable to the tens of thousands of students, genealogists, and researchers who use it every month.
Text correcting is a wonderful way to get involved in CHNC and volunteer. Best of all it’s free to sign up and you can do it from your home.
- Watch and Learn: Value-Added! Digitizing your Community’s Historic Newspapers - September 27, 2024
- 2024 Support for Newspaper Digitization - November 13, 2023
- CHNC: The Largest Digital Collection of Colorado-based Japanese American Newspapers - November 8, 2023