Pete’s Guide: XHTML Page with Unicode BOM test





[an error occurred while processing this directive]


Pete’s Guide

XHTML Page with Unicode Byte Order Mark/Signature and Line Separator Test

This page has been saved (and posted to the server) as UTF-8, and with the Unicode Byte-Order-Mark/Signature string
(U+FEFF) at the beginning of the file, as is necessary for text editors and other programs to identify a file
as formatted in Unicode without any doubt or hacks.





It also uses the Unicode Line Separator character, instead of the Unix-style LF or DOS-style CR/LF end-of-line
conventions. This is a perfectly valid Unicode text file, but the use of this method of line breaks is somewhat ambigious
in HTML, XHTML, and XML documents. The LSEP character is not defined as whitespace, so is not supposed to appear
where arbitrary text is not supposed to appear, either. But choice of a line-separator delimeter is an all-or-nothing
affair. You can’t use it just inside markup tags and not anywhere else in a file.





Most older and recent browsers display the LSEP character as either an open box or a question mark, plus a line
break, thus ruining the presentation of the page, since there will be one of these for each line break in the source
file. Some very recent browsers, including IE 6.0b, do not display any unwarranted characters, but do treat the LSEP
character (if it appears outside of markup tags) as a line break that should be applied to the resulting page. I’m not
sure if this is the right way to treat the character, since it interferes with switching from the ambiguous CR/LF and LF
characters for the end-of-line indication, but it is better than how the older browsers treat it.





And I, for one, believe that ammending XML to treat LSEP as whitespace would be a good and necessary thing. The
lack of a standard for line breaks across platforms can often make it difficult to translate and deal with documents
originating on another platform. Disallowing the use of this character in XHTML and XML would promote the
continuance of this ugly state of affairs, and must not continue to be.





Unfortunately, many Web browsers have a problem with these characters, and display an empty box at the beginning
of the page, or do much worse things. This character is important and it is not going to go away, so if you are a
browser developer, you had better fix this problem, pronto.





[an error occurred while processing this directive]