Search functionality (inclusive of ranking) is handled by the embedded IB sub-system. IB is a development of BSn's NONMONOTONIC Lab in Munich.
Mainstream search engines are about finding any information: "a list of all documents containing a specific word or phrase”. Because of this, search engines paradoxically return both too much information (i.e., long lists of links) and too little information (i.e., links to content, not content itself). IB, by contrast, is about exploiting document structure, both implicit (XML and other markup) and explicit (visual groupings such as paragraph), to zero in on relevant sections of documents, not just links to documents.
IB intelligently provides information search and retrieval services to text, data (a large number of standard types such as numerical, ranges, dates etc), images/video/audio, geographic information, network objects and databases. It exploits document structure, both implicit (XML and other markup) and explicit (visual groupings such as paragraph), to zero in on and retrieve relevant information.
IB Key Features and Benefits:
The default modus is to index all the words and all the structure of documents. It provides powerful and fast search without prior knowledge about the content yet enables arbitrarily complex questions across all the content and from different perspectives. Not bound by the constraints of "records" as unit of information, one can immediately derive value from content with the flexibility to enhance content and the application incrementally over time without "breaking anything".
IB was designed from the ground up to address three key goals: universal SGML/XML (and other document formats) hierarchical/context search, distributed objects (transparent integrated views to other sources of information such as relational DBs, search services and object brokers) and to provide optimal support for features (current and future) of the ISO 23950 (ANSI/NISO Z39.50) Information Retrieval Protocol services standard to allow for standard interoperable interfaces.
[More on engine design concept and motivations]
Despite its speed, power and flexibility IB is tiny. It runs well on everything from low power embedded processors (as in our BLAU "information router" based on AMD's Geode) to supercomputer clusters and from traditional Unix environments (Solaris, True64, HP-UX, BSD) to Linux to even Microsoft's Windows (XP/Vista/Windows-7).
IB is embedded (as dynamically linked shared libs) into language interpretors such as Python, TCL, Perl, Ruby, PHP, Java and others via SWIG integration. In this application we have an interface (Drupal module) in PHP talking to (remote) HTTP based servers. The search server was written in IB embedded Python. On the search page there is a link request: which shows the server in action.
Lets look a bit closer by example of XML fragments (from SGML/XML markup of Shakespeare's works by Jon Bosak):
<SPEECH> <SPEAKER>LADY MACBETH</SPEAKER> <LINE>Out, damned spot! out, I say!--One: two: why,</LINE> <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE> <LINE>lord, fie! a soldier, and afeard? What need we</LINE> <LINE>fear who knows it, when none can call our power to</LINE> <LINE>account?--Yet who would have thought the old man</LINE> <LINE>to have had so much blood in him.</LINE> </SPEECH>
First off I we have an idea of "nearness": being in the same leaf. The words "out" and "spot" are in the same node (with path ...\SPEECH\LINE ). Its named SPEECH ancestor is the above speech--- the only speech in all of Shakespeare's works to have the words "out" and "spot" in the same LINE. The SPEAKER descendant of that SPEECH is "LADY MACBETH".
The word "spot" is said within the works, by contrast, in many other speeches by speakers in addition to Lady Macbeth: SALISBURY in `The Life and Death of King John', BRUTUS as well as ANTONY in `The Tragedy of Julius Caesar', MISTRESS QUICKLY in `The Merry Wives of Windsor', VALERIA in `The Tragedy of Coriolanus', ROSALIND in `As You Like It' and MARK ANTONY in `The Tragedy of Antony and Cleopatra'.
<Videos>
<Video ASIN="B0000DJ7G9">
<Title Screenplay="William Shakespeare" Alt="Othello">The Tragedy of Othello: The Moor of Venice</Title>
<Director>Orson Welles
<Length>90 minutes</Length>
<Format>DVD</Format>
<Certificationg>U</Certification>
</Video>
<Videos>In the above the value of the Screenplay attribute to Title is William Shakespeare. In IB one can search for the content of the tag (TITLE), the content of an attribute (named mapped with a default of TITLE@SCREENPLAY in this example) but also for information like films with Venice in the title whose screenplay was from Shakespeare. Here we appeal to an implicit direct parent (an abstract virtual container that contains the both attribute and field data).
In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other in a container. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance.
In XML we do not, however, have any well-defined order among the siblings (different LINEs). The XML 1.0 well-formedness definition specifically states that attributes are unordered and says also nothing about elements. Document order (how they are marked-up) and the order a conforming XML parser might decide to report the child elements of SPEECH might not be the same. Most systems handling XML from a disk and using popular parsers typically deliver it in the same order but the standard DOES NOT specify that it need be--- and for good reason.
See also: Presentation Methods and Elements.
<SPEECH> <SPEAKER>LADY MACBETH<<SPEAKER> <LINE>Yet here's a spot.</LINE> </SPEECH>
These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE"
In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance.
One could then specify an inclusion (within the same unnamed or named field or path), an order and even a character (octet) metric.
We have not attempted to implement a word metric as the concept of word is more complicated then commonly held. Is edz@bsn.com a single word? Two words? One word? Maybe even 3? What about a URL? Hyphenation as in "auto-mobile"? Two words? On the other hand what does such a distance mean?
Our metric of distance is defined as the file offsets (octets) as the record is stored on the file system. This too is less than clear cut as the render of Überzeugung and Überzeugung are equivalent but their lengths are different.
zum Thema Präsentieren und Überzeugen
The file system offsets between the word "Thema" and "und" depends upon how "Präsentieren" was encoded. As UTF-7, as UCS-2, Latin-1 or with entities ¨ and &am;;Uuml; or &#nnn form. Does one, instead, treat the rendered level? This too is misleading since different output devices might have quite different layouts. What about columns? And the tags?
Tags are, of course, even worse with words since we have started to associate a semantics for distance. Look at an XML fragment like
<name>Edward C. Zimmermann</name><company>NONMONOTOMIC Lab</company>
What is the word distance between "Lab" and "Edward" keeping in mind that XML markup is equivalent if NAME is specified before or after COMPANY.
<company>NONMONOTOMIC Lab</company><<name>Edward C. Zimmermann</name>
These are equivalent content and the word distance? That's right.. its not well defined!
Order and inclusion do make sense, even some rough guide to distance but words? We are, of course, open to convincing examples!
GetNodeOffsetCount(GPTYPE HitGp, STRING FieldNameorPath= "", FC *Fc= NULL)
where HitGp is the address of a hit, FieldNameofPath the name, resp. path, of the container we are interested in and Fc (optional) is its calculated start/end coordinates.
This is useful to be able to determine which page (paragraph or line) a hit is on but may also be used in an abstract sense.
Example:
<ingredients>
<item>Chocolate</item>
<item>Flour</item>
<item>Butter</item>
<item>eggs</item>
</ingredients>
The oder is the order in the mark-up. Chocolate might be the 25th item in the document but its the first item in the particular instance of INGREDIENTS. This is possible since we can check the count of the instance of INGREDIENTS and can look at the previous we can count also the offset of ITEM.
In conventional search/retrieval systems the fundamental unit of recall is the "record" defined prior to indexing. If, for example, one indexed the complete works of Shakespeare on a per-play basis then searching for terms would only return the unit of recall play (or an element thereof, for example, its TITLE).
In IB, by, contrast, one can define in a search query the unit of recall as ancestor/descendant of found (matched) elements. One can easily request (in the Shakespeare, index per play, example) the name of the acts (and the text of acts) that match a query just as one can ask for the speech (and its speaker) or even the lines as the unit of recall.
This functionality is mission critical to, among other fields, the search of literature and legal texts.
In IBU News its used to distinguish between stories (<item>s in RSS) and the feeds in which they belong. Indexing on a per feed basis (and not item) also allows us to search for inter-relationships between different items in the same feed to ask questions like "What news feed had both an article on Putin and on Ahmadinedjad and what were the stories". Sometimes the story is told in the context of multiple articles in a feed and not just in one.
The most primitive set is the set of records that contain a single term (word or phrase). IB supports also fielded data with extremely powerful and flexible methods. In its most basic form (when performing a search) you may specify a field, or a field path--- If no field is specified it means ANYWHERE in the record. Fields are specified with a / as in
field/term
Membership of a field too can be viewed as an operation acting upon the set of records that contain a term.
Terms may contain wild cards. IB supports the so-called " Glob Expression Syntax":
| Syntax | Sematics | * | Match zero or more characters. term*, for example, is equivalent to right truncated search |
|---|---|
| ? | Match one character |
| [...] | Match any of the characters (set) enclosed by the brackets. Characters '*" and '?" are interpreted in the set as normal characters and not as wildcards. |
| [!...] | Any character NOT in the set is matched. |
| [.-.] | A '-' between two characters denotes a range. The set [A-C], for example would match any character between A and C: namely 'A', 'B' or 'C'. |
| {.,.}..{.,.} | Match, for example, {1,2}{a,b} to 1a 1b 2a 2b. The term "L{e,i}banon would match "Lebanon" as well as "Libanon". |
| \ | The character '\' is an escape. When used with wildcards or other special characters it means that the character should match itself and not have its special sematics. \*, for instance, matches '*'. |
| Operator | Semantics |
|---|---|
| < <= | Objects like date, numbers etc. have an order and <, resp. <=, applies as one might expect. For general "string" fields its interpreted as the equivalent to "*" (right truncation) |
| = | For general "string" fields its interpreted just as "/", viz. a member of the field. |
| > >= | As </<= above. For general "string" types the sematics are left truncation (as in *term) |
Out of the box it supports queries expressed in both Infix and RPN notations.
See: IB Infix Notation.
In RPN the operands precede the operator, removing the need for parentheses. For example, the expression 3 * ( 4 + 7) would be written as 3 4 7 + *.
See: IB RPN Notation.