In order to allow BAsCET to recognize bibliographic references logical structure, an appropriate Concept Network has to be built. This Concept Network have to contain knowledge on the logical structure of the bibliographic references of the base, but also on their physical structure. This model can be divided according its nodes genericity, or according to their logical or physical aspect.
From the generic point of view, the model contains two main parts: the generic one, and the specific one (containing terms of the database). The figure below shows these two parts and their links (REFERENCE contains AUTHOR, SEP:AU-T, TITLE, etc.; AUTHOR followed by SEP:AU-T; A:C.Y.Suen instance of A; W:ocr co-occurs with Y:1993). The generic part contains the fields hierarchy and separators among them. The specific part contains only logical data, as the physical one is generic (for references of one type, fields separators are the same).
From the physical / logical point of view, nodes beginning with SEP are the separators and are the physical nodes of the Concept Network; others are the logical ones, that is to say, they belong to the database.
Figure 1: Concept Network's structure for the references
+--------------------REFERENCE--------------------+
| +-------+ | | |
v v v v v Generic
AUTHOR---->SEP:AU-T-->TITLE-->SEP:T-... --->...---->YEAR
| | |
v v |
A <-> SEP:A-A SEP:W-W <--> W |
^ ^ |
| | +----------+
+----------------+ +------+----+ | |
|~~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~~~~~~|~~~~~~~~~~|~~~~~~~~~~~
+-+ | | | | |
v v v v v v
A:C.Y.Suen <-> A:S.N.Srihari W:ocr <-> W:document ... Y:1995 Y:1993
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | | |
| | +---------------------+ +------------+ | |
| | | | | |
| +------------------------------------------+ | Specific
+---------------------------+ +------------------------------------+
The automatic building of the Concept Network implies the use of a knowledge database and the use of conversion tools. The bibliographic database uses the BibTeX format is converted into XML (using Dilib).
Figure 2 shows the format that a reference from the database has. The first step is its recast, using a Dilib's tool, into XML. Figure 3 gives the result of this operation.
Figure 2: Example of BibTeX reference, from article type
@ARTICLE{joseph92a,
AUTHOR = {S. H. Joseph and T. P. Pridmore},
TITLE = {Knowledge-Directed Interpretation of Mechanical
Engineering Drawings},
JOURNAL = {IEEE Transactions on PAMI},
YEAR = {1992},
NUMBER = {9},
VOLUME = {14},
PAGES = {211--222},
MONTH = {September},
KEYWORDS = {segmentation, forms},
ABSTRACT = {The approach is based on item extraction}
}
Figure 3: BibTeX reference, converted to XML
<doc>
<ref>joseph92a</ref>
<author><a>S. H. Joseph</a><a>T. P. Pridmore</a></author>
<title><mot>Knowledge-Directed</mot><mot>Interpretation</mot><mot>of</mot>
<mot>Mechanical</mot><mot>Engineering</mot><mot>Drawings</mot>
</title>
<journal>IEEE Transactions on PAMI</journal>
<year>1992</year>
<number>9</number>
<volume>14</volume>
<pages>211--222</pages>
<month>September</month>
<keywords><k>segmentation</k><k>forms</k></keywords>
</doc>
An advantage of the use of a BibTeX base is the facile getting of its physical version, using LaTeX and dvips. A program which "substracts" the logical version from the physical one, which automatically extract the separators was written. That allows to treat many bibliographic styles: one is not compelled to know of the printing rules of all the styles; an unknown style is treated as a known one.
That's what Figure 4 shows: a tool translating the logical references into PostScript and into the XML format used is sufficient to generate a Concept Network adapted to the recognition of references using the same bibliographic style, into the format of the database.
Figure 4: building of a Concept Network for bibliographic references
Generic
__----- XML-------> Structure extraction ---->+----------------+
/ (fields and sub-fields) | |
references | |
database ---PostScript---> Separators detection ----->| Statistic +---> Concept Network
(BibTeX) | (occurrences, |
\ | co-occurrences)|
+-------XML-------> Terms extraction --------->| |
(fields instances) +----------------+
Specific
See also:
The previous nodes show the working of a system that automatically build a Concept Network from a BibTeX database and a bibliographic style. From the logical point of view, the Concept Network is divided into two parts: the generic one, containing the fields' hierarchy, and the specific one, containing links between the leaves instances of the fields' hierarchy, and these instances. The physical aspect of the network is wholly located in its generic part, because it has to be general for a bibliographic style: it consists only in "separator" nodes, linking two fields of the same level in the fields' hierarchy, and contained in higher-level nodes. No node from the specific part gives information on the physical aspect of a reference (at least, on its typography, or on its punctuation).
As the computation of the node's attributes, and especially of the co-occurrence links between the specific nodes only depends on occurrence and co-occurrence counting, one can consider the building of system that "learns" when it discovers something. That is to say, it could integrate a recognized reference (possibly validated by a human operator) into the Concept Network, simply by incrementing counters in the links (co-occurrence counters) and in the nodes (occurrence counters). Notwithstanding that one should rather remember the counters than the weights in the Concept Network, some problems raise: how to know that the recognized reference is not already integrated in the network? This reference may already exist in the network under a slightly different form? In this case, it should be better not to bend statistics, adding already existing co-occurrences. This is a complex problem of data unicity in a base, of distance between two records... Fortunately, as soon as the number of references is sufficient the relative importance of doubles addition diminishes. Indeed, all weight that are likely to be changed are local, that is to say that they depend of the number of occurrences of the nodes from which their links start. One can thus assert that adding doubles in the Concept Network is negligible as soon as the number of known references is already big.