Re: What route for the XHTML importer?

From: Hubert Figuiere <>
Date: Tue May 18 2004 - 01:03:18 CEST


On Fri, 2004-05-14 at 14:31 +1000, Martin Sevior wrote:

> As it currently stands the XHTML importer is very fragile and
> very strict. If a HTML file doesn;t exactly fit XHTML spec we barf on
> it.
> Now as you all know there are a lot of broken HTML files that render
> just fine in IE, Mozilla and many other browsers.
> So my question is:
> Should we attempt to import broken HTML files or just barf on them and
> say "Illegal document"?
> I would MUCH rather attempt to import them as well as possible.

Here is my 0.02 CAD opinion:

We should not limit to XHTML ? Why ? Simply because Joe Average wants to
import "HTML documents" made by crappy software, and there is simply too
much out there. What about HTML ? We should be as permissive as we can.

Thing we can assume to fail:
-HTML markup generated by scripts

Thing we must eat:
-mixed-case tags
-not closed tags
-inconsistent tags
-tag in the wrong context
-some extenstions

Parsing HTML is a lot of work. I'd pretty much prefer us "stealing" some
code from another Free software project, something that would come from
Lynx, links, w3m, khtml (or Apple's incarnation), gtkhtml, etc. Even
Mozilla, but I'm not sure it does not bring too much.

So in short: don't barf too soon on a document.

For the test bed, just use wget <sigh>

