OCR solutions give access to an X(H)ML document with the X Position, Y Position, Height, pixel layout metrics of every character, word and line plus font style, weight and family properties. These markers are stored as metrics and used to modify the XML area map. The blue region lines in the table ensures it can be assembled correctly into cells after OCR. Content above and below these will not come out in the final OCR'ed text. The two horizontal purple lines are header and footer boundaries. This is the IGP:OCR-CG Latin desktop application in FixIt mode showing region lines. This set of pixel accurate classification OCR Zones is then saved and used by the OCR post processor when creating the final output XHTML for proofing.įor a simple novel it is only a matter of setting the header and footer exclusion lines once and applying them to the whole document. The editor can put exclude lines for headers and footers, plus zone instructions for the assembly processor. The IGP:OCR-CG Latin network linked desktop application allows the zoning of images before they go through OCR. Save hours by zoning the content to avoid an OCR content mash-up. Zone control OCR extraction examples are academic articles and magazines with columns, indexes, floated figures, tables and sidebars, self-help books with margin-notes and strange text layouts, images containing extensive text. It's easier to identify content to be eliminated from a picture of the page than it is from a cloud of OCR'ed text. Not required content is headers and footers, perhaps images that contain extensive text, advertisements in magazines and similar. Many documents and books have regions that are not required in the output, or for some special reason need zone controls. That means having the tools for the job and automating what can and should be automated as long as any machine process doesn't introduce or miss defects.
In digital content production quality-speak everything is a value-addition process except proofing of OCR text, which is a defect-reduction process. In 2000, with an OCR and proofing department of 500+ innocent young proofers feeding the tagging and format teams it was important to give them a chance to do good work and be proud of the output quality generated. JPG noise especially can have a catastrophic effect, particularly on diacritics and punctuation, but also other letters. These types of inputs can increase OCR defects tremendously. They are inevitably 300dpi JPG scans with gutter distortion and all sorts of left-right page rotation problems. The rule is OCR never gets better than the quality of the inputs.įrom time-to-time clients say we can provide scans and they are really good. Rotation can introduce (barely discernible) disruptions in text lines in the middle of a page.
In our production systems we will never put an image to OCR less than 600 dpi which has been scanned geometrically accurately without rotation. The characteristics of fonts, quality of scanning, quality of typesetting all contribute to the accuracy of OCR output. Still miles away from the required minimum quality measurement (and even that has conditions). Depending on a lot of things OCR native accuracy can be from 95% to 99.9% (OCR manufacturers will claim 99.99% when the the Moon is in the seventh house and Jupiter aligns with Mars). When the minimum proofing quality standard is 99.995% (5 defects per 100,000 characters) there is not a lot of room for error. However given the analogue world problem it is solving it is probably as good as it will get. Perhaps you have to see tens of thousands of pages across thousands of books to really understand the murkiness of OCR technology. The OCR Problemīrilliant OCR technology sucks! No matter how good it is claimed to be, the quirks of typography and fonts laugh-at and make amazing OCR technology look mediocre.
OCR FONT FAMILY HTML HOW TO
That means obscure Jawi manuscripts no-one has read for 500 years, impossibly rare Chinese documents that Ho Chi Minh scribbled in his spare time, 800 year old palm leaf manuscripts on how to cure warts and infertility, 10's of thousands of academic books, thousands of textbooks and thousands of trade books on everything from fantasy to food in many of languages. We work in the semi-slo-mo, hyper-zing-zone combo areas of digital content.
OCR FONT FAMILY HTML FULL
In full disclosure Infogrid Pacific is a high-volume, high-capacity solution driven operation that does awesome things with really old and really new content.