This writeup details BAsCET's application on bibliographic references recognition. See BAsCET for an overview.

This agent looks the Blackboard for the string forming a separator. This separator is represented by the name of the separator node (Father) that launched it into the Coderack. This Concept Network node is given as a parameter to the seeker; its name's structure is sep:field1-field2:string. That means, as shown in Figure 1, that the Father node has a precedes-type link coming in from field1, a precedes-type link coming out to field2, and a contains-type link coming in from field3. field3 is the father node of the Blackboard's objects the agent has to search, looking for the string (string) that separates field1 and field2.


Figure 1: Separator seeker agent's parameter is the Father node.

                                    +--------------+
                                    |    field3    |
                                    |  field:name  |
                                    +--------------+
                                            | contains
                                            v
        +----------+           +------------------------+           +----------+
        |  field1  | precedes  |         Father         | precedes  |  field2  |
        |field:name|---------->|sep:field1-field2:string|---------->|field:name|
        +----------+           +------------------------+           +----------+

Therefore, the agent begins with looking for the field1 node in the Concept Network. Let's call it the Container node. Next, it looks for an instance of this node in the Blackboard. Let's call it the IContainer. If the instance does not exist, the agent deactivates the Father node, because that means that the agent had not to be run, as the field supposed to contain this separator is not yet found. Moreover, one does not give the permission to run a similar agent before the next cycle. Indeed, it is possible that several similar agents have been launched by the same node and wait in the Coderack.

The ToSearch string is the content of the IContainer object. The agent looks in this string for the string ToFind, that is the string part of the Father node's name. It is an approximate search for all the matching strings, with a threshold of 90%. It gives every time the string supposed to match, its location (beginning and end in the ToSearch string), and its matching score (between 90 and 100%). This score is refined thanks to the separator location's statistics given during the automatic extraction of the separators, as part of the Concept Network building process.

The formula 10 x Location / Reference's length, rounded, fives the location of a separator in a reference, and gives 11 parts of equal size.

The addition to the score value, called AddToScore depends on the occurrence's number of this separator at the location inside which the string was found. Let T be the total number of occurrences of this separator in the physical versions of all the references in the base. Let N be the occurrences' number of the separator at the same location. The value added by the agent to the score of each found string is 30 x N / T.

In this manner, the string's score can be improved according to reliable statistical base (if the reference base is sufficiently big).

The final score of the string is 70 x Score / 100 + AddToScore.

All found strings allow the creation of separator-type objects in the Blackboard. In fact, for the first-level field separators which appear only once (in the case of this database), this may appear as a waste of resources. Indeed, one could keep only the best candidate. But that raised problems. For example, when one looks for a separator "AUTHOR-TITLE:. " and there is several authors in the AUTHOR field, then one finds several strings constituted from a point followed by a space, as the authors' initials also have this form. In the case of a non-canonical reference (a reference that does not respect the separator location statistics), one chooses the bad string and cuts the authors' field. Yet, the system is not based only on separators to determine the location of the fields: it uses the zone seeker too. When one author is already found, it is sufficient to get a much better result.

At last, when the agent found at least one separator, it is inhibited during 4 cycles. It could be inhibited until the end of the treatment, but given the Blackboard's changeableness, it is better to relaunch the agent in order to check that the found objects have not been deleted, or that a new field is to be searched. The agent also reactivates the field seekers of the field(s) preceding and following the found separators, for the case where they would have been inhibited.

When no separator was found, the Father node is deactivated, and its agent is inhibited for two cycles.


See also: Building Hierarchical Structures in the Blackboard, Building the logical part of a Concept Network representing bibliographic references, Building a Concept Network to represent bibliographic references, BAsCET's application on bibliographic references recognition, BAsCET.