This site is built around services of the extremely powerful XML-savvy OO search engine IB from our NONMONOTONIC Lab hooked into the open source Drupal CMS/Blog system with a PostgreSQL RDBMS backend. Its running for the most part on inexpensive PCs with the open source operating system freeBSD.
Main Hard and Software
| Component Software |
|---|
| Webserver(s) | AOLServer, Apache, Nanottp (internal development) |
| CMS | Drupal |
| Browser Technology | XML, AJAX, CSS, Javascript, Cookies |
| RDBMS | PostgreSQL |
| Search Engine/XML-database | BSn's NONMONOTONIC Lab's IB 3.x search and retrieval engine |
| Search Server | internal development using IB (as a SWIG generated dynamically loadable module/extension for Python) |
| Script language | Python |
| RSS Spider | internal development (Python, C) |
| Hardware |
|---|
|
| Sun Microsystems SunFire Server (back-end) | Solaris 10 x64 |
| OEM AMD-based 64-bit "PC"s (front-end) | freeBSD-641 |
General Site Flow
- Feeds (RSS, RSS/RDF, Atom and CAP) are harvested by one or more machines via the "RSS Spider" sub-system.
- A RDMS back-end system drives events (when to check for updates).
- Feeds are collected, parsed and converted into a canonical internal (RSS like) format for indexing.
- Since contextual queries are built around the revealed data structure and RSS 2.0 has, at current, the widest acceptance we've chosen to make the common "unified format" as RSS 2.0 as possible.
- The format is only "exposed" to the user via the structure for search.
- Data is also harvested from the feed for the Spider database (update frequency, syntactical data quality, SPAM etc.).
- These (RSS-like) records are indexed by the IB-based search sub-system.
- This is typically done in blocks of multiple appends and/or record updates.
- At scheduled times (adaptive) garbage is collected.
- Stories which get comments are exported via RSS as persistent content to a RDBMS back-end.
- Search is via a remote protocol and distributed. A mini-protocol server (nano_http) speaking a tiny sub-set of htttp (written in Python with IB embedded) sits on a port and handles requests.
- Rendering is via Drupal themes. Search (to the remote mini-protocol) is via a Drupal module (PHP).
- Result pages are also provided as fully compliant RSS 2.0.
- These results are suitable for use as "live bookmarks", RSS for a news reader or as a feed to other systems.
- Style sheets are provided to also allow the XML to be rendered by modern browsers.
- All links to the remote information (records or stories) are via the IBU URL redirector.
- The IBU redirector collects statistics on remote story access.
- The redirector statistics are used to model popular stores in IBU news as well as to track the popularity of informations sources.
- Item popularity stories is used to
- Derive, track and rank items as "Popular Stories"
- Support editors to define new canned queries for story routing and aggregation.
- Track the "market demand" for articles from given feeds for among other measures: information provider popularity.
- Provide inputs into other "popularity" heuristics (including clustering).
References- Raining Content: IB for Drupal
- RSS 2.0 Specification: http://cyber.law.harvard.edu/rss/rss.html
- CAP: Common Alerting Protocol: http://www.incident.com/cap.
- GeoRSS: Geographically Encoded Objects for RSS feeds: http://georss.org.
- Atom: The Atom Syndication Format (RFC4287): http://www.ietf.org/rfc/rfc4287.txt
1) The developer and provider of this service, BSn, is a long term founding sponsor of the Munich Berkeley Unix Interest Group "BIM".
Submitted by Edward C. Zimmermann on Sun, 2006-11-12 04:50.