This writeup details BAsCET's application on bibliographic references recognition. See BAsCET for an overview.
The zone seeker agent has to detect a coherent zone which almost all elements belong to the Father field. It first locates the nodes that are likely to become field's descriptions, delimits the zone and sees if it can clean it, that is to say eliminate the useless or foreign nodes, when they are not enough homogeneous.
As the FS (field seeker), the ZS (zone seeker) first looks for the instance of the field containing the field to find, which is hierarchically the nearest in the Concept Network. For example, when looking for a zone corresponding to the a field, one first searches the author's descriptions, if an instance of this field exists, else one searches the descriptions of the doc (root) field.
Among the descriptions of the found containing field, the agent select those that could be interesting in the search for the Father field. They are those that, in the Concept Network, lie below the Father node, hierarchically speaking (or an instance already existing of this node, possibly incomplete). These compatible descriptions are used to delimit the zone to search.
If such a description does not exist, the agent inhibits oneself for one cycle.
Next, it has to compute the coherence score for every object inside the delimited zone. To avoid the deleting of relevant objects, it also computes a coherence score for the objects that would not be compatible a priori with the field that one looks for.
The agent defines first a preliminary score for each object, according to Algorithm 1. For each object i included in the zone that it just delimited, the coherence preliminary score depends on the similar objects' score (here, they are subordinate of the same Concept Network node), on these objects' length, and on the distance from these objects to i. Indeed, the more a zone is coherent, the more its objects tend to have the same type, that is to say sub-fields or separators of a same field. Also, it takes into account the happiness of these objects, because a bad recognized object can lie among others, preventing the recognition of an interesting zone, allowing the recognition of a zone with a bad type (another field). It takes into account the distance of the other objects: the more an object is near from i, the more it counts (this seems obvious for a coherence score).
Algorithm 1: Computing of the preliminary coherence score of an object.
For Each object i inside the zone Do
LeftScore = 0
For Each object j to the left of i inside the zone Do
If i and j are subordinate of the same node of the Concept Network Then
Distance = begin (i) - end (j)
LeftScore += happiness (j) x length (j) / Distance
End For j
RightScore = 0
For Each object j to the right of i in the zone Do
If i and jare subordinate of the same node of the Concept Network Then
Distance = begin (j) - end (i)
RightScore += happiness (j) x length (j) / Distance
End For j
If i is a separator Then
ScoreSep = 1
ScoreSep = 0
If i is isolated Then
Isolated = 1
Isolated = 0
If i is partially isolated (one separator lacks) Then
PartIsolated; = 1
PartIsolated = 0
Score = LeftScore + RightScore+ length (i) x (1 + Isolated + PartIsolated + ScoreSep x (Isolated + 0,5))
If i is an instance of Father Then
Score += LeftScore/2 + RightScore/2
End For i
Next, final score is computed from this preliminary coherence score. If the object to the left of the current object has the same type (subordinate to the same field), one adds the preliminary score of this object to that of i. If not, one subtracts its score from the one of i. Thus, if there is many objects having the same type, ones beside others, they have a positive score.
Then, one eliminates from the Blackboard the descriptions which final score is negative (if they are not subordinate to the field to find), and one counts the number or right-type candidates remaining, and also the number of those that do not belong to the right field.
If there remains(?) only candidates belonging to the Father node, the object is built, and all candidate descriptions for the containing field are moved to the just built field, and this object is added as a description for the containing field. (am I clear? While translating my work, I often wonder if others could understand what I write so badly)
Figure 1: A zone detection sample.
Location Content Father node Score
-------- -------------------- ---------------- ------
166-170 "Actes" field:bw 969
171-171 " " sep:month-year -2,467
172-189 "11eme Colloque GRE" field:booktitle 1,108
190-190 "T" field:pusblisher -1,895
191-191 "S" field:bw 397
Delete " " (sep:month-year: ,70)
Delete "T" (field:publisher,90)
BUIILDING OF THE FIELD "field:booktitle"
Add "Actes" to the node "Actes 11eme Colloque GRETS" (field:booktitle)
Rise the descriptions of "11eme Colloque GRE" one step
Delete object "T" (field:pub)
Add "S" to the node "Actes 11eme Colloque GRETS" (field:booktitle)
Add a description "Actes 11eme Colloque GRETS" to
"<Times-Roman>B. Wrobel and O. Monga. Segmentation d'images naturelles :
cooperation entre un detection-contour et un detecteur-region. In
</Times-Roman><Times-Italic>Actes 11eme Colloque GRETSI</Times-Italic>
<Times-Roman>, Nice, June 1987.</Times-Roman>"
In the sample of Figure 1, the agent deletes two fields that actually do not have to be in the delimited zone. Those were a bad-located separator (and this, bad recognized, or less happy than objects surrounding it), and a field very badly recognized, containing only one letter (to weak length to be significant for a field such as publisher). It builds a booktitle field going from the first marked object that belongs to booktitle: a "bw" (a conference title word), to another. One can see here, that there already existed a booktitle field, perhaps created by another ZS agent. This field is incorporated to the new field, thanks to the "rising" of its descriptions. One deletes the "field:pub:T" object, that was a description for the publisher field, already deleted.
Note that the built field is still not complete (it lacks one letter). The separator seeker agent, based only on existing objects, could not "know" that GRETSI was better than GRETS, unless to have a Concept Network node that matches GRETSI and having an instance in the Blackboard.
See also: Building Hierarchical Structures in the Blackboard, Building the logical part of a Concept Network representing bibliographic references, Building a Concept Network to represent bibliographic references, BAsCET's application on bibliographic references recognition, BAsCET.