Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.
This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Developer's Guide". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". There is also an advanced manual known as "The Clara OCR Advanced User's Manual".
This Guide is a collection of entry points to the Clara OCR source code. Some notes explain punctual details about how this or that feature was implemented. Others are higher-level descriptions about how one entire subsystem works.
Clara OCR is written in ANSI C (with some GNU extensions) and requires the services of the C library and the Xlib. The development is using 32-bit Intel GNU/Linux (various different distributions), GCC, Gnu Make, Bash, XFree86 and Perl 5 (required for producing the documentation).
Clara source code started, of course, as being one only file named clara.c. At some point we divided it into smaller pieces. Currently there are 18 files:
Note that there are only two headers: common.h and gui.h. It's complex to maintain one header for each module. Most functions are not prototyped, but we intend to prototype all them in the near future.
Clara OCR relies on the memory allocator both for allocation or resizing of some large blocks used by the main data structures, and for allocation of a large number of small arrays. Currently Clara OCR does not include or use an special memory allocator, but implements an interface to realloc. The alloca function is also used sometimes along the code, generally to allocate buffers for sorting arrays.
The interface is the function c_realloc. The function c_free must be used to free the blocks allocated or resized by c_realloc. In the near future, c_realloc will build a list of the currently allocated blocks, their sizes and some bits more in order to help to trace flaws.
Concerning security, the following criteria is being used:
1. string operations are generally performed using services that accept a size parameter, like snprint or strncpy, except when the code itself is simple and guarantees that a overflow won't occur.
2. The CGI clara.pl invokes write privileges through sclara, a program specially written to perform only a small set of simple operations required for the operation of the Clara OCR web interface.
The following should be done:
1. Memory blocks should be cleared before calling free().
A naive support for runtime index checking is provided through the macro checkidx. This checking is performed only if the code is compiled with the macro MEMCHECK defined and the command-line switch '-X 1' is used.
In fact, only those points on the source code where the macro checkidx is explicitly used will perform index checking. We've added calls to checkidx on some critical functions due to its complexity, or because segfaults were already were detected there.
Clara OCR decides at runtime if the GUI will be used or not. So even when using Clara OCR in batch mode (-b command-line switch), linking with the X libraries is required.
When the -b command-line switch is used, Clara OCR just won't make calls to X services. The source code tests the flag "batch_mode" before calling X services. So it won't create the application window on the X display, and automatically starts a full OCR operation on all pages found (as if the "OCR" button was pressed with the "work on all pages" option selected). Upon completion, Clara OCR will exit.
Clara OCR uses a lot of global variables. Large data structures, flags, paths, etc, use stored on global variables. In some cases we use a naming strategy to make the code more readable. The important cases are:
a. The main data structures of Clara OCR are global arrays that grow as required. The following a convention was created for the names associated with these arrays:
b. Menus are referred by their registration indexes. These indexes are stored on variables named CM_X. The menu items registration indexes are stored on variables named CM_X_SOMETHING (all capital). If the item has an associated flag, the flag is named cm_x_something (all small).
Most path variables are computed from the path given through the -f command line option. The variable "pagename" is the filename of the PBM image of the page being processed, not including the path eventually specified through the -f switch. For instance, if the OCR is started with
Clara stores on the variable pagelist the null-separated list of all names of pbm files found on this directory. Even in this case, the variable pagename will store the filename of the page being processed (at any moment Clara will be processing one and only one page).
The directory that contains the pbm files that Clara will process is stored on the variable pagesdir. In the example above, the value of the variable pagesdir is "mydocs/".
The variable workdir stores the path of the directory where Clara will create the files *.html, *.session, "patterns" and "acts". This path is assumed to be equal to pagesdir, unless another path is given through the -w switch. The variable doubtsdir will be the concatenation of workdir with the string "doubts/" (doubtsdir is ignored if -W is not used).
Clara stores bitmaps in a linear array of bytes, following closely the pbm raw format. The first line of a bitmap with width w is stored on the first (w/8)+((w%8)!=0) bytes of the array. The remaining bits (if any) are left blank, and so on. The leftmost bit on each byte is the most significative one (black, or "on", is 1, and white, or "off" is 0). An example follows:
In order to allow the GUI to refresh the application window while one OCR run is in course, Clara does not use multiple threads. The main function alternates calls to xevents() to receive input and to continue_ocr() to perform OCR. As the OCR operations may take long to complete, a very simple model was implemented to allow the OCR services to execute only partially.
Such services (for instance load_page()) accept a "reset" parameter to allow resetting all static data, and they're expected to return 0 when finished, or nonzero otherwise. Therefore, a call to such services must loop until completion. The continue_ocr() calls the OCR steps using this model, and some OCR steps call other services (like load_page()) that implement this model too.
When Clara OCR exits, the exit code will diagnose the finalization status:
Even for non-developers, a knowledge of the internal data structures used by Clara OCR is required for fine tuning and to make simple diagnostics.
The basic elements stored are the "closures". Sets of one or more closures are called "symbols". Symbols are arranged in lists forming "words". The words are arranged in lists forming "lines".
Closures of black pixels by contiguity are a first attempt to identify the atomic symbols of the document. The name "closure" is of course due to the consideration of the contiguity as a relation (in the mathematical sense of the word). Starting (for instance) from (i,j), we compute the set of black pixels ("X" and "*" in the figure). The limits (l,r,t,b) define the bounding box of the closure.
As one character of the document may be composed by two or more closures (for instance when it's broken), it's convenient to work not with closures, but with sets of closures. So we define the concept of "symbol" as being a set of one or more closures. Initially, the OCR generates one unitary symbol for each closure. Subsequent steps may define new symbols composed by two or more closures.
For instance, let's present three closures that do not correspond to atomic symbols: "a" and "i" linked (one closure) and a broken "u" (two closures). As a principle, Clara OCR do not try to break closures into smaller closures. Instead of that, the classification heuristic try to compose various patterns to resolve symbols like the "ai" in the figure. Concerning the "u", the classification heuristic is expected to merge the two closures into one symbol and apply a "u" pattern to resolve it.
Each symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a document-related basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.
So two different entries of the mc array never have the same set of closures. The entries of the mc array are created by the new_mc service. When some procedure tries to create a new symbol informing a list of closures for which already exists a symbol, the service new_mc detects it and returns to the caller not the index of a newly created symbol, but the index of that already created one.
One same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.
So the OCR will need to choose, from all symbols that contain a given closure, the one to be preferred. In fact, Clara OCR maintains dynamically a partition of the set of closures on "preferred" symbols. This is the ps array. Some manual operations, like fragment merging and symbol disassembling (activated by the context menu on the page tab), change that partition dinamically, as well as some automatic procedures, like the merge step on the OCR run.
The font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.
Clara OCR generally uses millimeters for presenting sizes, but we'll soon express sizes in "points". Let's see an example. One inch corresponds to 72.27 printer's point (pt) (The METAFONTBook pg 21, note). So when using 600 dpi, each pt will correspond to 600/72.27 = 8.3 pixels. For 10 point roman characters, Knuth defines the height of lowercase letters as being 155/36 pt, so 35.7 pixels for us. Therefore, to compute the font size (f) from the height in pixels (h) of one lowercase letter, the formula is f = 10*h/35.7.
The vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.
The current Clara OCR alignment support was developed for the Latin alphabet, and is being adapted for other alphabets. Four vertical alignemnt positions are considered. These positions are referred as usual (ascent, baseline and descent). We use the Knuth's identifier "x_height" to refer the height of lowercase letters without ascenders.
Clara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.
It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.
When building the OCR output, Clara will combine words in lines. Each line is a sequence of words (that is, wdesc structures). The array "line" is the sequence of the heads of the detected lines. Each entry of this array is a lndesc structure. The left and right limits of words must be carefully computed and compared in order to the OCR partitionate then in columns, when dealing with multi-column pages.
The "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.
As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.
The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).
The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.
Each act stores some data about the reviewer and also the submission date. As we plan to reuse revision data, each act also stores some data about the "original reviewer" and the "original submission date". These fields are meaningful only for reused acts.
Clara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).
So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.
So, for instance, when one classifier deduces that one symbol is "a", it generates a "vote" for the transliteration of that symbol to be "a". At the same time, another heuristic could generate another vote for the transliteration to be, say, "o". The diagram illustrates this situation:
As the total stored information about one symbol may be large, Clara maintains for each symbol its "transliteration class", used by the heuristics to categorize each symbol and also to test the current transliteration status (is it known? is it dubious?), frequently used along the source code.
The election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.
The transliteration preference is the integer
U is 1 if the transliteration was confirmed by the arbiter, or 0 otherwise.
T is 0 if this transliteration was confirmed by no trusted source, 1 if it was confirmed by some trusted source.
S is 0 if this transliteration was not shape-deduced from trusted input, or 1 if it was shape-deduced from trusted input.
E is 1 if this transliteration was deduced from spelling, or 0 otherwise.
A is 0 if this transliteration was confirmed by no anon source, 1 if it was confirmed by some anon source.
N is 0 if this transliteration was not shape-deduced from anon input, or 1 if it was shape-deduced from anon input.
Once we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.
There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.
The first uses the following criteria to compute the transliteration class:
1. If the symbol has no transliteration at all, its class is UNDEF.
2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.
Just before the distribution of all symbols on words we propagate CHARs. All CHAR symbols are searched, and for each one we look its neighbours that seem to compose with it one same word. Such neighbours, if untransliterated, will be classified as SCHARs.
Clara OCR allows to create "zones". Zones are usually used to identify one text block in the page. For instance, a page containing two text columns should use one zone to limit each column. The zone limits are given by the "limits" array. The top left is (limits,limits) as presented by the figure:
The first method implemented by Clara OCR for symbol classification was skeleton fitting. Two symbols are considered similar when each one contains the skeleton of the other.
Clara OCR implements five heuristics to compute skeletons. The heuristic to be used is informed through the command-line option -k as the SA parameter. The value of SA may be 0, 1, 2, 3 or 4.
Heuristics 0, 1 and 2 considerer a pixel as being a skeleton pixel if it is the center of a circle inscribed within the closure, and tangent to the pattern boundary in more than one point.
The discrete implementation of this idea is as follows: for each pixel p of the closure, compute the minimum distance d from p to some boundary pixel. Now try to find two pixels on the closure boundary such that the distance from each of them to p does not differ too much from d (must be less than or equal to RR). These pixels are called "BPs".
To make the algorithm faster, the maximum distance from p to the boundary pixels considered is RX. In fact, if there exists a square of size 2*BT+1 centered at p, then p is considered a skeleton pixel.
As this criteria alone produces fat skeletons and isolated skeleton pixels along the closure boundary, two other conditions are imposed: the angular distance between the radiuses from p to each of those two pixels must be "sufficiently large" (larger than MA), and a small path joining these two boundary pixels (built only with boundary pixels) must not exist (the "joined" function computes heuristically the smallest boundary path between the two pixels, and that distance is then compared to MP).
The heuristics 1 and 2 are variants of heuristic 0:
1. (SA = 1) The minimum linear distance between the two BPs is specified as a factor (ML) of the square of the radius. This will avoid the conversion from rectangular to polar coordinates and may save some CPU time, but the results will be slightly different.
2. (SA = 2) No minimum distance checks are performed, but a minimum of MB BPs is required to exist in order to consider the pixel p a skeleton pixel.
The heuristic 3 is very simple. It computes the skeleton removing BT times the boundary.
The heuristic 4 uses "growing lines". For angles varying in steps of approximately 22 degrees, a line of lenght RX pixels is drawn from each pixel. The heuristic check if the line can or cannot be entirely drawn using black pixels. Depending on the results, it decides if the pixel is an skeleton pixel or not. For instance: if all lines could be drawn, then the pixel is center of an inscribed circle, so it's considered an skeleton pixels. All considered cases can be found on the source code.
The heuristic 5 computes the distance from each pixel to the border, for some definition of distance. When the distance is at least RX, it is considered a skeleton pixel. Otherwise, it will be considered a skeleton pixel if its distance to the border is close to the maximum distance around it (see the code for details).
All parameters for skeleton computation are informed to Clara through the -k command-line option, as a list in the following order: SA,RR,MA,MP,ML,MB,RX,BT. For instance:
Pairing applies to letters and digits. We say that the symbols a and b (in this order) are paired if the symbol b follows the symbol a within one same word. For instance, "h" and "a" are paired on the word "that", "3" and "4" are paired on "12345", but "o" and "b" are not paired on "to be" (because they're not on the same word).
The function s_pair tests symbol pairing, and returns the following diagnostics:
0 .. the symbols are paired 1 .. insuficcient vertical intersection 2 .. one or both symbols above ascent 3 .. one or both symbols below descent 4 .. maximum horizontal distance exceeded 5 .. incomplete data 6 .. different zones
If p is nonzero, then store the inferred alignment for each symbol (a and b) on the va field of these symbols, except when these symbols have the va field already defined.
If rd is non-null, returns the dot diameter in *rd. If an estimative for the dot diameter cannot be computed, does not change *rd.
The "build" OCR step, implemented by the "build" function, distributes the symbols on words (analysing the distance, heights and relative position for each pair of symbols), and the words on lines (analysing the distance, heights and relative position for each pair of words). Various important heuristics take effect here.
The first step of build is to distribute the symbols on words. This is achieved by:
a. Undefining the next-symbol ("E" field) and previous-symbol ("W" field) links for each symbol, the surrounding word ("sw" field) of each symbol, and the next signal ("sl" field) for each symbol.
Obs. The next-symbol and previous symbol links are used to build the list of symbols of each word. For instance, on the word "goal", "o" is the next for "g" and the previous for "a", "g" has no previous and "l" has no next).
b. Undefining the transliteration class of SCHARs and the uncertain alignment information.
2. Distributing symbols on words
The second step is, for each CHAR not in any word, define a unitary word around it, and extend it to right and left applying the symbol pairing test.
3. Computing the alignment using the words
Some symbols do not have a well-defined alignment by themselves. For instance, a dot may be baseline-aligned (a final dot) or 0-aligned (the "i" dot). So when computing their alignments, we need to analyse their neighborhoods. This is performed in this step.
4. Validating the recognition
Shape-based recognitions must be validated by special heuristics. For instance, the left column of a broken "u" may be recognized as the body of an "i" letter. A validation heuristic may refuse this recognition for instance because the dot was not found. These heuristics are per-alphabet.
5. Creating fake words for punctuation signs
To produce a clean output, symbols that do not belong to any word are not included on the OCR output. So we need to create fake words for punctuation signs like commas of final dots.
6. Aligning words
Words need to be aligned in order to detect the page text lines. This is perfomed as follows:
a. Undefine the next-word and previous-word links for each word. These are links for the previous and next word within lines. For instance, on the line "our goal is", "goal" is the next for "our" and the previous for "is", "our" has no previous and "is" has no next.
b. Distribution of the words on lines. This is just a matter of computing, for each word, its "next" word. So for each pair of words, we test if they're "paired" in the sense of the function w_pair. In affirmative case, we make the left word point to the right word as its "next" and the rigth point to the left as its "previous".
The function w_pair does not test the existence of intermediary words. So on the line "our goal is" that function will report pairing between "our" and "is". So after detecting pairing, our loop also checks if the detected pairing is eventually "better" than those already detected.
c. Sort the lines. The lines are sorted based on the comparison performed by the function "cmpln".
7. Computing word properties
Finally, word properties can be computed once we have detected the words. Some of these properties are applied to untransliterated symbols. The properties are:
1. The baseline left and right ordinates.
2. The italic and bold flags.
3. The alphabet.
4. The word bounding box.
The function list_cl lists all closures that intersect the rectangle of height h and width w with top left (x,y). The result will be available on the global array list_cl_r, on indexes 0..list_cl_sz-1. This service is used to clip the closures or symbols (see list_s) currently visible on the PAGE window. It's also used by OCR operations that require locating the neighbours of one closure or symbol (see list_s).
The parameter reset must be zero on all calls, except on the very first call of this function after loading one page.
Every time a new page is loaded, this service must be called informing a nozero value for the reset parameter. In this case, the other parameters (x, y, w and h) are ignored, and the effect will be preparing the page-specific static data structures used to speed up the operation.
Closures are located by list_cl from the static lists of closures clx and cly, ordered by leftmost and topmost coordinates. Small and large closures are handled separately. The number of closures with width larger than FS is counted on nlx. The number of closures with height larger than FS is counted on nly.
The clx array is divided in two parts. The left one contains (topcl+1)-nlx indexes for the closures with width not larger than FS, sorted by the leftmost coordinate. The right one contains the other indexes, in descending order.
The cly array is divided in two parts. The left one contains (topcl+1)-nly indexes for the closures with height not larger than FS, sorted by the topmost coordinate. The right one contains the other indexes, in descending order.
So the small closures on the rectangle (x,y,w,h) may be located through a combination of bynary searches on both axis. The large closures are located by a brute-force linear loop. As nlx and nly are expected to be very small, this brute force loop won't waste CPU time.
1. Clara OCR GUI uses only 5 colors: white, gray, darkgray, verydarkgray and black. The RGB value for each one is customizable at startup (-c command-line option). On truecolor displays, graymaps are displayed using more graylevels than the 5 listed above.
2. The X I/O is not buffered. Buffered X I/O is implemented but it's not being used.
3. Only one X font is used for all needs (button lables, menu entries, HTML renderization, and messages).
4. Asynchronous refresh. The OCR operations just set the redraw flags (redraw_button, redraw_wnd, redraw_int, etc) and let the redraw() function make its work.
5. No toolkit is used. The graphic code is very specific to Clara, and it was not written to be reusable. So it's very small. The disadvantage of this approach is that Clara look and behaviour will be slightly different from the typical ones found on popular environments like GNOME or KDE.
The source code frequently refers some global variables that define the position and size of the main componts (the plate, buttons, etc). Most of these variables are set by comp_wnd_size. The variables are:
The current window is informed through the CDW global variable (set by the setview function). The variable CDW is an index for the dw array of dwdesc structs. Some macros are used to refer the fields of the structure dw[CDW]. The list of all them can be found on the headers under the title "Parameters of the current window".
The portion of the document being displayed is defined by the macros X0, Y0, HR and VR, where (X0,Y0) is the top left and HR and VR are the width and heigth, measured in pixels (graphic documents) or characters (text documents):
The scrollbars inform the relative portion of the document being exhibited. The viewable region of the document (in the sense just defined) is defined by X0, Y0, HR and VR:
The Bitmaps on HTML windows and on the PAGE window are exhibited in "reduced" fashion (a square RFxRF of pixels from the bitmap is mapped to one display pixel). If RF=1, then each bitmap pixel will map to one display pixel.
The windows PATTERN, PAGE_FATBITS, and PAGE_MATCHES exhibit bitmaps in "zoomed" mode (one bitmap pixel maps to a ZPSxZPS square of display pixels). In this case a grid is displayed to make easier to distinguish each pixel. The variables GW and GS contain the grid width and the "grid separation" (GS=ZPS+GW).
Clara is able to read a piece of HTML code, render it, display the rendered code, and attend events like selection of an anchor, filling a text field, or submitting a form. Note that anchor selection and form submission activate internal procedures, and won't call external customizable CGI programs.
Most windows displayed by Clara are produced using this HTML support. When the "show HTML source" option on the "View" menu is selected, Clara will display unrendered HTML, and it will become easier to identify the HTML windows. Note that all HTML is produced by Clara itself. Clara won't read HTML from files or through HTTP.
Perhaps you are asking why Clara implements these things. Clara does not intend to be a web browser. Clara supports HTML because we were in need of a forms interface, and the HTML forms is around there, ready to be used, and extensively proved on practice as an easy and effective solution. Note that we're not trying to achieve completeness. Clara HTML support is partial. There is only one font available, tables cannot be nested and most options are unavailable, PBM is the only graphic format supported, etc. However, it attends our needs, and the code is surprisingly small.
Let's understand how the HTML windows work. First of all, note that there is a html flag on the structure that defines a window (structure dwdesc). For instance, this flag is on for the window OUTPUT (initializition code at function setview).
When the function redraw is called and the window OUTPUT is visible on the plate, the service draw_dw will be called informing OUTPUT through the global variable CDW (Current Window). However, before making that, redraw will test the flag RG to check if the HTML contents for the OUTPUT window must be generated again, calling a function specific to that window. For instance, when a symbol is trained, this flag must be set in order to notify asynchronously the need to recompute the window contents, and render it again.
HTML renderization is performed by the function html2ge. It will create an array of graphic entities. Each such entity is a structure informing the geometric position (x,y,width,height) of something, and this something (a piece of text, a button and its label and state, a PBM image, etc). Finally, the function draw_dw will search the elements currently visible on the portion of the document clipped by the window, and display them.
The rendering of each element on the HTML page creates one graphic element ("GE" for short).
Free text is rendered to one GE of type GE_TEXT per word. This is a "feature". The rendering procedures are currently unable to put more than one text word per GE.
IMG tags are rendered to one GE of type GE_IMG. Note that the value of the SRC element cannot be the name of a file containing a image, but must be "internal" or "pattern/n". These are keywords to the web clip and the bitmap the pattern "n". The value of the SRC attribute is stored on the "txt" field of the GE.
INPUT tags with TYPE=TEXT are rendered to one GE of type GE_INPUT. The predefined value of the field (attribute VALUE) is stored on the field "txt" of the GE. The name of the field (attribute NAME) is stored on the field "arg" of the GE.
The Clara OCR HTML support added INPUT tags with TYPE=NUMBER. They're rendered like TYPE=TEXT, but two steppers are added to faster selection. So such tags will create three GEs (left stepper, input field, and right stepper).
INPUT tags with TYPE=CHECKBOX are rendered to one GE of type GE_CBOX. The variable name (attribute NAME) is stored on the "arg" field. The argument to VALUE is stored on the field "txt". The status of the checkbox is stored on the "iarg" field (2 means "checked", 0 means "not checked").
INPUT tags with TYPE=RADIO are rendered just like CHECKBOX. The only difference is the type GE_RADIO instead GE_CBOX.
SELECT tags (starting a SELECT element) are rendered to one GE of type SELECT. In fact, the entire SELECT element is stored on only one GE. Each SELECT element originates one standard context menu, as implemented by the Clara GUI. The "iarg" field stores the menu index. The free text on each OPTION element is stored as an item label on the context menu. The implementation of the SELECT element is currently buggy: (a) for each renderization, one entry on the array of context menus will be allocated, and will never be freed, and (b) The attribute NAME of the SELECT won't be stored anywhere.
INPUT tags with TYPE=SUBMIT are rendered to one GE of type GE_SUBMIT. The value of the attribute VALUE is stored on the "txt" field. The value of the ACTION attribute is stored on the field "arg". The field "a" will store HTA_SUBMIT.
TD tags are rendered to one GE of type GE_RECT. The value of the BGCOLOR attribute is stored on the "bg" field as a code (only the colors known by the Clara GUI are supported: WHITE, BLACK, GRAY, DARKGRAY and VDGRAY). The coordianates of the cell within the table are stored on the fields "tr" and "tc".
All other supported tags do not generate GEs.
We decided to use XML because of the facilities of using non-binary encodings to store, analyse, change and transmit information, and also because XML is a standard. Currently we do not have DTDs, and until now we didn't try to load, using the Clara parser, XML code not produced by Clara itself.
The Clara OCR GUI tries to apply immediately all actions taken by the user. So the HTML forms (e.g. the PATTERN window) do not contain SUBMIT buttons, because they're not required (some forms contain a SUBMIT button disguised as a CONSIST facility, but this is just for the user's convenience).
The editable input fields make auto-submission mechanisms a bit harder, because we cannot apply consistency tests and process the form before the user finishes filling the field, so auto-submission must be triggered on selected events. The triggers must be a bit smart, because some events must be attended before submission (for instance toggle a CHECKBOX), while others must be attended after submission (for instance changing the current tab). So auto-submission must be carefully studied. The current strategy follows:
a. When the window PAGE (symbol) or the window PATTERN is visible, auto-submit just after attending the buttons that change the current symbol/pattern data (buttons BOLD, ITALIC, ALPHABET or PTYPE).
b. When the window PAGE (symbol) or the window PATTERN is visible, auto-submit just before attending the left or right arrows.
c. When the user presses ENTER and an active input field exists, auto-submit.
d. Auto-submit as the first action taken by the setview service, in order to flush the current form before changing the current tab or tab mode.
e. Auto-submit just after opening any menu, in order to flush data before some critic action like quitting the program or starting some OCR step.
f. Auto-submit just after attending CHECKBOX or RADIO buttons.
Auto-submission happens when the service auto_submit_form is called, so it's easy to locate all triggering points (just search the string auto_submit_form). This service takes no action when the current form is unchanged.
This section describes the variables and functions exported by Clara OCR for extensionability purpuses. Note that Clara OCR currently does not have an interface for extensions. The first such interface planned to be added will use the Guile interpreter, available from the GNU Project.
The redraw flags inform the function redraw about which portions of the application window must be redraw. The precise meaning of each flag depends on the implementation of the redraw function, that can be analysed directly on the source code.
If redraw_menu is 2, the menu is entirely redrawn. If redraw_menu is 1, then the draw_menu function will redraw only the last selected item and the newly selected item, except if the menu is being drawn by the first time.
The progress bar is displayed on the bottom of the window to reflect the progress of some slow operation. By now, the progress bar is unused.
The OCR run in course (if any) stores various statuses on global variables. For instance, the ocring macro will be nonzero if one OCR run is in course. The GUI informs the OCR control routines about what to do along the OCR run using various global variables. Some of them drive the classification procedures:
The stop_ocr variable is set by the GUI when the STOP button is pressed. Its status will be tested by the routines that control the OCR run in course. Note that the variable cannot_stop may be set by the current OCR step in course. It's effect is to inhibit the GUI setting the stop_ocr status. It's used by routines that cannot be stopped, otherwise the data structures they're handling would rest in a irrecuperable inconsistency.
The OCR control routines handle the following statuses:
The other to_* variables together with nopropag store information about the revision operation requested from the GUI:
The to_arg variable stores the flagment to merge to the current symbol or the symbol to add to the current word.
The variable ocr_other stores which operation to perform by the OCR_OTHER step. This step is reserved to operations that are outside the OCR run main stream, but require the control provided by the continue_ocr function.
As each window is displayed on only one mode and each mode belongs to only one tab, in order to set a given mode or a given tab, just call setview informing one window present on that mode as parameter. That is the only parameter received by setview. The geometry of each window will be re-computed by setview, so setview is not called only to change the current mode, but also after operations that change the geometry of the windows, just like resizing the application X window or hiding the scrollbars, or resizing the PAGE window, etc.
Messages are put on the status line (on the bottom of the application X window) using the show_hint service. The show_hint service receives two parameters: an integer f and a string (the message).
If f is 0, then the message is "discardable". It won't be displayed if a permanent message is currently being displayed.
If f is 1, then the message is "fixed". It won't be erased by a subsequent show_hint call informing as message the empty string (in practical terms, the pointer motion won't clear the message).
If f is 2, then the message is "permanent" (the message will be cleared only by other fixed or permanent message).
If f is 3, clear any current message.
Starts a complete OCR run or some individual OCR step on one given page, or on all pages. For instance, start_ocr is called by the GUI when the user presses the "OCR" button or when the user requests loading of one specific page.
In fact, almost all user requested operation is performed as an "ocr step"in order to take advantage from the execution model implemented by the function continue_ocr. So start_ocr is the starting point for attending almost all user requests.
If p is -1, process all pages, if p < -1, process only the current page (cpage) otherwise process only the page p. If s>=0 then run only step s, otherwise run all steps.
If the flag r is nonzero, will ignore the current classes (if any) and recompute them again (this is meaningful only to the symbol classification step).
It's not hard to add a bitmap comparison method to Clara OCR. This may become very important when the available heuristics are unable to classify the symbols of some book, so a new heuristic must be created. In order to exemplify that, we'll add a naive bitmap comparison method. It'll just compare the number of black pixels on each bitmap, and consider that the bitmaps are similar when these numbers do not differ too much.
Please remember that the code added or linked to Clara OCR must be GPL.
In order to add the new bitmap comparison method, we need to write a function that compares two bitmaps returning how similar they are, add this function as an alternative to the Options menu, and call it when classifying the page symbols. We'll perform all these steps adding a naive comparison method, step by step. The more difficult one is to write the bitmap comparison method. This step is covered on the subsection "How to write a bitmap comparison function".
Let's present the other two steps. To add the new method to the Options menu, we need:
a. Declare the macro CL_NBP:
To use the new method, recompile the sources, start Clara OCR and select the new method on the tune tab.
The bitmap comparison function required for the example we're presenting has the following prototype:
Clara OCR will call bmpcmp_nbp once informing status 1 every time a new symbol c is chosen, so bmpcmp_cp will be able to bufferize symbol data on static areas. Note that to classify each symbol, Clara will perform various calls to the bitmap comparison function, because it will check X events (like the STOP button), and, when visual modes are enabled, Clara will need to refresh the screen displaying the progress of the classification.
The block of bmpcmp_nbp corresponding to status 1 will merely store on the static variable np the value of the nbp field of the symbol structure.
When Clara OCR wants to effectively ask if the pattern k matches the symbol c, it calls the bitmap comparison function informing status 3. The function must return a similarity index ranging from 0 (no similarity) to 10 (identity). Now we must take care of the fourth parameter (mode). It informs if Clara is asking for a direct (mode == 1) comparison or indirect (mode == 0) comparison. This applies for asymmetric comparison methods. For instance, when using skeleton fitting, "direct" means that the pattern skeleton fits the symbol, and "indirect" means that the symbol skeleton fits the pattern. Clara will make both calls to avoid false positives.
In our example we'll on both cases return 10 if the test 5 * abs(nbp-m) <= (nbp+m) results true, where m is the number of black pixels of the pattern.
These are the steps to add a new button:
1. Create a new button macro after those already existing (bzoom, balpha, etc). Note that each button macro is defined as an unique integer (0, 1, 2, etc).
3. Add a new block to attend this button on mactions_b1 and, if desired, on mactions_b2 (just copy one existing block and adapt it). It's mandatory to attend help requests. On/off and multi-state buttons must circulate the acceptable values of the respective entry of the array "button" in order to change the current state, and set the redraw_button flag.
(Some) Major tasks
1. Vertical segmentation (partially done).
2. Heuristics to merge fragments.
3. Spelling-generated transliterations
4. Geometric detection of lines and words
5. Finish the documentation
6. Simplify the revision acts subsystem
1. Change sprintf to snprintf.
2. Fix assymetric behaviour of the function "joined".
3. Optimize bitmap copies to copy words, not bits, where possible (partially done).
4. Support Multiple OCR zones (partially done).
5. Make sure that the access to the data structures is blocked during OCR (all functions that change the data structures must check the value of the flag "ocring").
6. Use 64-bit integers for bitmap comparisons and support big-endian CPUs (partially done).
7. Clear memory buffers before freeing.
8. Allow the transliterations to refer multiple acts (partially done).
9. Rewrite composition of patterns for classification of linked symbols.
10. The flea stops but do not disappear when the window lost and regain focus.
11. Substitute various magic numbers by per-density and per-minimum-fontsize values.
12. The local binarizer is slower on deskewed pages (why?).
Clara OCR is free software. Its source code is distributed under the terms of the GNU GPL (General Public License), and is available at http://www.claraocr.org/. If you don't know what is the GPL, please read it and check the GPL FAQ at http://www.gnu.org/copyleft/gpl-faq.html. You should have received a copy of the GNU General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free Software Foundation can be found at http://www.fsf.org.
Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.
Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.
Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.
Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.
The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.
The names cited by the CHANGELOG and not cited before follow (small patches, bug reports, specfiles, suggestions, explanations, etc).
Brian G., Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (packager), Tim McNerney, Tyler Akins.