PDA

View Full Version : Extracting text from XHTML using XSL(T)


Barbra
02-04-2005, 03:29 PM
Hi!!
I am new to XML/XSL(T) technologies :confused:. I would like to know if it is possible to extract the text contained in an xhtml file by means of an xslt transformation.
Thank you in advance!

mpjbrennan
02-05-2005, 09:33 AM
If you mean controlling the display of text client-side the answer is yes. Serve the file with an .xml suffix, and write templates for all the html elements contained within it. But why would you want to do this?

patrick

Barbra
02-06-2005, 03:26 PM
Thank you patrick! :)

Have you ever seen an xsl(t) transformation to do so?
I mean, I have a program that takes in input a plain text and produce a certain output. The problem is my texts are encoded in xhtml files, and I have to extract the texts! I am not a bright programmer (and programs such as detagger does not completely fit my requirements).... Is Xslt easier to keep update?
Thank you again

mpjbrennan
02-06-2005, 08:01 PM
Barbara,

Just to show it can be done:-
Here are an html file and an xsl file. Save the html file as .html, .xhtml and .xml. Save the xsl file as test_xsl.xml. Then view the three versions of the html file in your browser. The .html version will ignore the reference to the xsl file and revert to default styling, whereas the other two will use the styling in the xsl file.

However, if you want to extract text from an xhtml file why not open it with your Word processor and then save it as a text document. This will strip out all the tags with the exception of the <?xml version=...> and the <?xml-stylesheet....> processing instructions. Much easier than messing about with xslt!

Patrick

HTML file
---------------
<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<?xml-stylesheet type="text/xsl" href="test_xsl.xml"?>
<html>

<head>
<title>Henry Joseph Brennan</title>

</head>



<body>
<h1>Henry Joseph Brennan</h1>

<p>Henry Joseph (Harry) Brennan was born in Jarrow on 21st July 1899. He was baptised at St. Bede's Church on 23rd July, his godparents being his uncle Harry Kelly and aunt Alice Daly. At the age of 13 he was enrolled in Mount St. Mary's College, a Catholic Boarding School at Spinkhill near Sheffield. On Thursday, 11th April 1918 he was killed at Steenwerck in Belgium during the first phase of the <a href="lys.html">Battles of the Lys</a> (April 7th - 25th 1918). The following excerpt is taken from "The Mountaineer" the college magazine, commenting on Harry's death:</p>

</body>

</html>

XSL file
-------------
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xforms="http://localhost/xsl/xforms">
<xsl:output type="html" />

<xsl:template match="/">
<html>
<body>
<xsl:apply-templates select="*" />
</body>
</html>
</xsl:template>

<xsl:template match="title">
<title>

<xsl:value-of select="."/>

</title>
</xsl:template>

<xsl:template match="h1">

<p style="font-size:18pt; color:blue">

<xsl:value-of select="."/>

</p>

</xsl:template>

<xsl:template match="p">

<p style="color:red">

<xsl:apply-templates />
</p>

</xsl:template>

<xsl:template match="a">

<a style="color:blue">
<xsl:attribute name="href"><xsl:value-of select="@href"/></xsl:attribute>
<xsl:value-of select="."/>
</a>

</xsl:template>


</xsl:stylesheet>