The logical part of the Concept Network (see logical structure of a bibliographic reference) is constituted of a translation of the information of the base: they are the generic part (fields' hierarchy, see Building a Concept Network to represent bibliographic references) and the specific one (instances of the fields found in the base).
The reference base used to build the model is a BibTeX-formatted base, containing references on the field of handwriting recognition and document analysis.
It contains 908 references (or records) fixed to be more "ideal" references, that is to say matching the format, and the most coherent possible. Indeed, in spite of the practice of the users of this base, some references were badly written: bad type (article instead of inproceedings, techreport instead of phdthesis, etc), bad choice of fields (number instead of volume, note instead of other more specific fields, content of the address field in the publisher field, etc), bad syntax in the fields (authors separated by a comma instead of and, and al. instead of and others, etc).
The base contains French and English references. Table 1 shows that the main types (75%) are inproceedings and article.
Table 1: repartition of the references’ types in the database
Type Occurrence
inproceedings 387 43%
article 288 32%
book 61 7%
techreport 51 6%
phdthesis 39 4%
misc 30 3%
inbook 23 3%
incollection 16 2%
manual 10 1%
proceedings 2 0%
booklet 1 0%
A second level field is a sub-field added at the time of conversion from the BibTeX format to XML. Second level sub-fields only have an implicit existence in BibTeX. For example, the a field is a sub-field of the author field.
Every author being separated from others by " and ", one can syntactically separate each author from the others. Likewise, for the editor field which gives e, and publisher which gives pub.
The keywords field contains keywords separated by commas. In XML, they are placed in the k sub-field. This field almost never appears in the physical version of a reference (and never when the plain bibliographic style is used), but it can be useful to keep its instances in the Concept Network to link various terms, it is a not negligible information, in order to activate more quickly the right concept.
The word field is a sub-field of title: the conversion program from DILIB (BibTeX2Sgml) separates this field's words, and the words of the fields booktitle, journal, type, chapter, month, and school.
Table 2: repartition of the fields in the 908 references
Field # references # instances 2nd level
Address address 381 142
Author a 900 1,150 X
Booktitle bw 403 348 X
Category 6 4 X
Chapter cword 22 74 X
Editor e 78 60 X
Howpublished howpublished 31 31
Institution institution 55 37
Doc journal jw 287 135 X
Key key 6 6
Keywords k 555 405 X
Month mw 257 75 X
Note note 17 15
Number number 195 58
Organization organization 22 16
Pages pages 574 564
Publisher pub 154 120 X
Ref ref 908 908
School sw 40 63
Series series 2 2
Title word 907 1,917 X
Type tw 40 39 X
Volume volume 308 82
Year year 906 44
Table 2 show the number of references containing each field, the number of different instances of each field (for the leaves of the hierarchy), and the membership or not to the second level.
For the first level fields, one notes that the instances number is lower than or equal to the references number. It is normal, considering that if, for example, the instance of the address field "New York" appears in n references, then this instance will be counted only once as an instance.
The specific logical part contains the instances of the fields hierarchy leaves (to eliminate some instances, as empty words, on pretext of that they a priori don’t bring any exploitable information, is not a good calculation), but it contains mostly intra- and inter-fields links. That is to say that there exist links between terms of one same field (intra-field links), but also between terms belonging to different fields (inter-fields links).
Like that, one remembers the links that an author (for example) has with the other authors, but also those that he has with the words of the title (to look for words that he often use), with the names of the journals or conferences in which he is often published, ...
All these links are weighted according to the formula of the inclusion index (Ii→j= Ci j / Ci).
The specific network is a priori wholly connected, that means that, for model containing 6,295 nodes, the number of links has a value of 2 x 6295 2, that is to say 79 millions, which is enormous to manage. Now, many of these links have a null weighting (all the terms that never appeared in common references). For example, the intra-field links of the year field are all null, because this field is unique in a reference. All the links having a null weighting are deleted from the model.
Nevertheless, there remain links so feeble, that they can be neglected. These ones too are deleted. Knowing that the BAsCET parameters varied along the tuning of the application, the threshold have been stated experimentally. In practical, on 6,000 terms, one keeps only 96,000 links.
Figure 1: Specific structure of a Concept Network for references
+----------+ +-----+ +------+
+->|A:C.Y.Suen|<=========>|W:ocr|<===== ===>|Y:1995|<=++
| +----------+ +-----+ \ / +------+ ||
| ^ X ||
| +-------------+ +----------+ / \ +------+ ||
+->|A:S.N.Srihari|<====>|W:document|<== ===>|Y:1993| ||
+-------------+ +----------+ +------+ ||
^ ||
|| ||
++=============================================++
Legend: <====> inter-field links <----> intra-field links
Concerning empty words, they are nevertheless added to the model, because, even if they do not bring meaning, they can help to discriminate the fields: if the "of" string is found, one can be sure that it does not belong to the field year.
In order to not deactivate too rapidly the terms appearing the more often in the base (and thus, on which one can base oneself to obtain good results), their Conceptual Importance (CI) is higher than that of the terms appearing less often.
Figure 2: Instance of a hierarchy leaf
DOC
|
contains
|
v
(lea)F
|
instantiated to
|
v
I(nstance)
Let F be the leaf of the hierarchy that is the father of I (its instance, see figure 2), CI(F), its conceptual importance, Occ(I), the number of occurrences of I in the base, and MaxOcc(I), the maximal number of an instance of F in the base.
IC(i) = IC(F) + (100 - IC(F)) x Occ(I) / MaxOcc(I)
This formula lets the conceptual importance of an instance be between its father’s conceptual importance and 100, the most frequent instances having the higher conceptual importance. Thus, terms which are sure, that is to say, those that are in the Concept Network, known from the system, will deactivate slower. The field from in which they are will receive more activation, and thus will deactivate slower than if its instance had been discovered but, without being confirmed by the presence in the Concept Network of a term known to belong to it.
The decay rate of each node depends on the number of its incoming links, thus also from number of nodes that influence it (see activation propagation).
As a matter of fact, the more a node receive influences, the more it risks to be activated by one or more of its influencing nodes already activated. That’s why the decay rate should be higher for nodes having many incoming links. But if the decay rate was a linear function of the number of incoming links, there would be a problem: each influencing node is not necessarily activated, at a given moment. There are even nodes that are scarcely activated, and that still would influence greatly many nodes, if they were activated. A node appearing rarely in the base would surely have outcoming links strongly weighted towards all the terms that appeared in the same reference. Yet, if it appeared only once in the base, it would have hardly no chance a priori to be discovered in a reference to recognize.
Thus, we chose to use a logarithmic scale. Table 3 shows that the number of incoming links of the fields instances varies between 1 and 36, knowing that the average is 8.21 incoming links per node (for 5,895 inter-instances links). Therefore, the more a link have incoming nodes, the higher is its decay rate. The nodes decay rates depends on the number of their incoming links (IL), according to this formula:
DR = 100 – (100 x ln 3) / ln(3 + IL)
Table 3: Number of incoming links of the instances
a address bw category cword e howpublished institution jw k key word
Min 1 1 1 3 1 3 2 2 1 1 3 1
Max 23 23 27 13 24 25 11 17 36 36 10 36
mw note number organization pages pub ref sw tw volume year
Min 1 4 1 1 1 1 1 3 1 1 1
Max 28 15 17 19 25 26 25 18 19 18 18
Agents are described in a coming node (BAsCET Agents for bibliographic reference recognition), but for now, one wants to know what kind of agents are needed, to what nodes they will be assigned, and what a priori urgency value they will have. The a priori urgency value is a base for obtaining the real urgency value for each agent in the coderack. It is multiplied by the activation value of the father node.
Here are the different agents types:
- instance seeker, that searches the Blackboard to find all the instances of the specific node that launches it
- separator seeker, that searches the Blackboard to find all the instances of the separator node that launches it
- field seeker, that searches the Blackboard to find all the instances of the field node that launches it, according to the separator surrounding it
- zone seeker, that searches the Blackboard to find all the instances of the field node that launches it, according to what this field can contain
- stop, that determines randomly and according to the temperature, if the treatment has to stop.
There are three types of nodes: fields, specific nodes, and separators. Here are the agent types and the urgency values assigned to each of these types:
- field has three agents types:
- zone seeker: average urgency value = 50
- field seeker: urgency value = 55, to have a priority a little more higher than that of the zone seeker, thus it is more often run, and the zone seeker, when run later can correct possible forgetting of the field seeker.
- stop: urgency value = 5, to be chosen only at the end of a step, and that one does not miss opportunities.
- specific node has only one agent type:
- instance seeker: 50, average value inferior to that of the field seeker, to let the global priority to the generic search for fields. It is not useful to look for known words if you realize that no term, or very few, is present in the blackboard.
The specific nodes are the more numerous in the Concept Network, that’s why if they had each a stop agent, this type of agent would be highly majority in the coderack, and the behavior of the program would be changed. Stop agents would have a higher run-probability, thus the system would stop much earlier. This strategy of precocious stop is not wished, so the specific node have no stop agent.
- separator has also only one agent type:
- separator seeker: 60, its the higher urgency value, because it’s particularly on separators that one can, and that one have to, base on to find the limit between fields. If the known terms are few, it is useless to rely on zone seekers to find fields, the only efficient agent being that relying on found separators. The separator seeker thus have to be run before the field seeker (which has a 55 urgency value).
An agent may have nothing to do before a certain step of the treatment (the stop agent is not really appropriate in the first step, one does not think that the problem will be solved in only one step). That’s why a new value has been defined: that of the beginning step. It has been experimentally fixed. The first thing that the system looks for is separators. Thus the most precocious beginning step is the separator seeker’s: 0. Then, in order to that activation values propagate early enough in the specific part of the Concept Network, and in order to activate the right fields, the agents allowed to run then are the instance seekers (1). Then comes the field seeker (6), the zone seeker (8), and the stop agent (8).
Table 4 recapitulates the agents types characteristics in the system.
Table 4: Characteristics of the different agent types
Agent name urgency beginning father node
Stop ST 5 8 FIELD
Field seeker FS 55 6 FIELD
Zone seeker ZS 50 8 FIELD
Instance seeker IS 50 1 SPECIFIC
Separator seeker SS 60 0 SEPARATOR
Disclaimer: as I don't speak fluently English, I accept all suggestions to improve writeups.