Weblogs: Intelligent Agents

Progress and RDF Ideas

Thursday, September 12, 2002

I'm starting to get some readable content out of my script that parses websites. CNN pages seem to be clean enough to read now, Ananova is not that far behind. I've given up on treating the <big> as a block level container (even though Ananova use it thusly). It just wasn't right pretending an inline element is actually a block-level, and it would probably cause niggling problems later anyway. The alternative solution to getting a header was even easier to implement - grab the content of the <title> element - there tends to be some useful stuff in there.

I've also implemented a small link weighting function to paragraphs, so if a paragraph of text doesn't contain any non-linked words, it is assumed that the content is just a mere link list or menu and filtered out of the page.

I've managed to get my head around RDF, and its gotten me thinking about building some useful functions into an intelligent agent. To put it simply, RDF is a specification that codifies relationships between resources (or objects) in a machine readable or machine parseable format. Using a set of relationships it is possible to encapsulate knowledge about a particular subject. These set of facts could be used to create dynamic links between documents imbueing the current document with innate knowledge. Adding to this an RDF rule structure that codifies what the current reader is interested or not interested in, it would be very feasible to "personalise" newsitem selection down to those topics that a visitor wants to see. I see this as being a sort of weighting function, so the more "subjects" a visitor has expressed an interest in, the more personalised his list of related resources becomes.

The framework for developing such an application isn't all that difficult. The crunch points are probably masking the "factual rules" with the "personal preference rules" to produce a unique knowledge representation of news items (or any information for that matter).

Of course, building such relataionships into dynamic links on a content page could be made much easier when the content is clean to start with. So I have some purpose and motivation to keep going on the above parsing script.

[ Weblog | Categories and feeds | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 ]