...

View Full Version : PHP DOM function to parse HTML data source as if csv.



MattyUK
02-11-2010, 04:06 PM
I am trying to understand a PHP DOM function I found to parse a HTML data source. It is close to what I need but I need to understand and adjust it.

I have the following as a data source format I can't change.

<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>

Awful I know. Yes it is given to me as a single line. It is basically a html table wrapped in html and body tags.

I need a two dimensional array as if it had been read as a csv file. So I need this:

Array
(
[0] => Array
(
[0] => header1
[1] => header2
[2] => header3
[3] => header4
[4] => header5
[5] => header6
[6] => header7
[7] => header8
[8] => header9
)

[1] => Array
(
[0] => Value1
[1] => value2
[2] => value3
[3] => value4
[4] => value5
[5] => value6
[6] => value7
[7] => value8
[8] => value9
)

[2] => Array
(
[0] => value10
[1] => value11
[2] => value12
[3] => value13
[4] => value14
[5] => value15
[6] => value16
[7] => value17
[8] => value18
)

[3] => Array
(
[0] => value19
[1] => value20
[2] => value21
[3] => value22
[4] => value23
[5] => value24
[6] => value25
[7] => value26
[8] => value27
)
)

Now I hail from PHP4 days so I thought regex might be a way forward. A quick bounce around google and I had the message DON'T USE REGEX TO PARSE HTML thrown at me more than a few times.

In the course of looking I found this function on:
http://www.phpro.org/examples/Get-Text-Between-Tags.html



/*
* @get text between tags
* @param string $tag The tag name
* @param string $html The XML or XHTML string
* @param int $strict Whether to use strict mode
* @return array
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;

/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}

/*** discard white space ***/
$dom->preserveWhiteSpace = false;

/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);

/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}

Now I understand some of it but am not familiar with the PHP DOM.Yes I've read his tutorial and much of the manual. I haven't reached that magic function or example explained that helps it fall into place in my visualizations.

So using this to call the function above:


$sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';
print '<pre>';print_r(getTextBetweenTags("td",$sHtml,"0"));print '</pre>';


and I get:


Array
(
[0] => header1
[1] => header2
[2] => header3
[3] => header4
[4] => header5
[5] => header6
[6] => header7
[7] => header8
[8] => header9
[9] => value1
[10] => value2
[11] => value3
[12] => value4
[13] => value5
[14] => value7
[15] => value8
[16] => value9
[17] => value10
[18] => value11
[19] => value12
[20] => value13
[21] => value14
[22] => value15
[23] => value16
[24] => value17
[25] => value18
[26] => value19
[27] => value20
[28] => value21
[29] => value22
[30] => value23
[31] => value24
[32] => value25
[33] => value26
[34] => value27
)

I could chunk the array down into what I need (only discovered that yesterday, thank you forum), but I suspect it will be much better to alter the function above to give the result I need.

If well explained I also think it'd be a good introduction and practical example of the PHP DOM scripting that is new to me.

Would anyone mind showing me how to adjust the function above to be what I need and perhaps explain it a little as they go?

Thanks

Matt

ahallicks
02-11-2010, 04:19 PM
You could change $out[] = $item->nodeValue; to $out[$tag] = $item->nodeValue;

Which would basically give you a multi-dimensional array like:



Array
(
[th] => Array
(
[0] => header1
[1] => header2
[2] => header3
[3] => header4
)
[td] => Array
(
[0] => value1
[1] => value2
[2] => value3
[3] => value4
)
)

etc

PHP Dom, in my humble opinion, is brilliant! But it does take a little getting used to. The easiest way to start learn is to get your head around actual DOM, that is the Document Object Model. Think of it as a tree of nodes, somehave children, some have parents and some have siblings. Using PHPDom you can reference various elements by using these 'ancestor' references.

The basic steps are; you create a DomDocument() with $dom = new DOMDocument();
Then you load a string (XML/HTML/etc)/XML File/HTML File/etc into this DOMDocument and use the DOM fuctions (such as getElementsByTagName) to fetch items from the DOMDocument.

You can then use these elements as you see fit. Look into simplexml too as that is also a powerful parser for PHP

MattyUK
02-11-2010, 06:03 PM
Hi ahallicks

Firstly thank you for your reply. It's good to hear the DOM is worth learning. I've dabbled in javascript DOM before but never in PHP until now.

Wouldn't your adjustment give a single dimension associative array?

The $tag variable would be the "td" supplied as the function parameter. Creating a single dimension associative array overwritten in the loop so it only contains the last value.


/*
* @get text between tags
* @param string $tag The tag name
* @param string $html The XML or XHTML string
* @param int $strict Whether to use strict mode
* @return array
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;

/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}

/*** discard white space ***/
$dom->preserveWhiteSpace = false;

/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);

/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[$tag] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}

$sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

print '<pre>';print_r(getTextBetweenTags("td",$sHtml,"0"));print '</pre>';


Gives output:

Array
(
[td] => value27
)



Thank you for the idea however. I guess one approach might be to get the code to step through any child tags. Then we could pass it the tr tag knowing it'll get the values from all td tags it contained. I'm not sure of the array insert code that would be needed. Simply haven't thought about it yet. Might try a few mock ups now.

Thanks.

Dormilich
02-11-2010, 06:10 PM
Firstly thank you for your reply. It's good to hear the DOM is worth learning. I've dabbled in javascript DOM before but never in PHP until now.

PHP DOM is just as easy as JavaScript DOM, because DOM is a language independent API. you only have to fit it in the actual language’s syntax. (and there could hardly be a greater difference)

ahallicks
02-11-2010, 06:20 PM
You could use:

You could change $out[] = $item->nodeValue; to $out[$tag][] = $item->nodeValue;

MattyUK
02-11-2010, 11:06 PM
Thank you both.

ahallicks, that modification would just make an associative array again. 2 dimensions yes, but not grouped via the row.

the problem I think is the $tag variable since it is simply the string passed to tell the function what tag in the DOM to target. because each of the values returned, comes from the same tag type it'll overwrite the previous or in the second case be lumped together under the be lumped together under the associative array it caused.

Check out what I mean below:

/*
* @get text between tags
* @param string $tag The tag name
* @param string $html The XML or XHTML string
* @param int $strict Whether to use strict mode
* @return array
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;

/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}

/*** discard white space ***/
$dom->preserveWhiteSpace = false;

/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);

/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[$tag][] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}

$sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

print '<pre>';print_r(getTextBetweenTags("td",$sHtml,"0"));print '</pre>';

Gives:

Array
(
[td] => Array
(
[0] => value1
[1] => value2
[2] => value3
[3] => value4
[4] => value5
[5] => value6
[6] => value7
[7] => value8
[8] => value9
[9] => value10
[10] => value11
[11] => value12
[12] => value13
[13] => value14
[14] => value15
[15] => value16
[16] => value17
[17] => value18
[18] => value19
[19] => value20
[20] => value21
[21] => value22
[22] => value23
[23] => value24
[24] => value25
[25] => value26
[26] => value27
)

)

I guess you may have meant the actual tag name being interrogated at that moment. In which case you may have meant:

$out[][$item->nodeName] = $item->nodeValue;

But that produces:

Array
(
[0] => Array
(
[td] => value1
)

[1] => Array
(
[td] => value2
)

[2] => Array
(
[td] => value3
)

[3] => Array
(
[td] => value4
)
...
and so on

I think this is because the call to getElementsByTagname($tag); essentially says get all the TD's, irrespective of their relationship as children of any specific TR tag.

Even if we somehow changed it to refference the parent node name perhaps...

$out[][$item->parentNode->nodeName] = $item->nodeValue;

The next 'row' would still overwrite the previous ones values since they are using the same associative array key.

I don't know the dom commands/functions to start at a TR collect all child TD elements then move onto the next TR.

Dormilich, Then I may not have been using the dom after all. I just recall accessing elements under the old document.all. structure and changing attributes, reading values etc.

I can picture the dom structure but don't know how to move around it well enough in PHP. Hence asking here. How would we foreach each TD child element of every TR element?

Thank you again for the replies.

Dormilich
02-12-2010, 07:48 AM
I don't know the dom commands/functions to start at a TR collect all child TD elements then move onto the next TR.

1. get all TRs
2. loop over them
3. in the loop, get all child TDs
4. loop

that is, a nested loop.

ex. (simplified)

$tr = $dom->getElementsByTagName("tr");
$l = $tr->length;
for ($i = 0; $i < $l; $i++)
{
$td = $tr[$i]->getElementsByTagName("td");
$m = $td->length;
for ($j = 0; $j < $m; $j++)
{
// further code
}
}

MattyUK
02-12-2010, 01:10 PM
Thank you. I saw the dom functions firstChild, next sibling but couldn't quite see how to look at only siblings of a particular type. I guess I was over-thinking it. This approach should work. Thank you. I'll rework the code based on a nested loop approach.

Thanks again Dormilich. I guess a nights sleep helped me see your point more clearly too.

Dormilich
02-12-2010, 01:14 PM
Thank you. I saw the dom functions firstChild, next sibling but couldn't quite see how to look at only siblings of a particular type.
Node->nodeType, Node->nodeName, Node->localName, Element->tagName. would work too, but you need more check statements.

MattyUK
02-13-2010, 04:55 PM
Hi Dormilich

Thanks for your help. I slapped myself upside the head and looked at it again from a nested viewpoint. I have this so far.



function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html

$dom = new domDocument;
//How should I sanitize the $html input? Do I need too since using dom?
//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach ($rows as $row)
{
//How to cleanly accommodate header cells? Don't want to replicate the entire loop

//get the cells in the row
$cells = $row->getElementsByTagname('td');
//try OR
//$cells = $row->getElementsByTagname('th'||'td');//Bad.
//Concat?
//$cells = $row->getElementsByTagname('th');
//$cells .= $row->getElementsByTagname('td');//Bad
//Addition?
//$cells = $cells + $row->getElementsByTagname('td');//Bad
//How do you join two donNode objects together? or use getElementsByTagname with OR multiple tags
//loop the cells
foreach ($cells as $cell)
{
//add to output array
$rArr[$rCount][] = $cell->nodeValue;
}//from: foreach ($cells as $cell){
//increment row count
$rCount++;
}//from: foreach ($rows as $row){

//Return output
return $rArr;

}//from: function getHtmlTableText($tag,$html,$strict=0){


$sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

print '<pre>';print_r(getHtmlTableText($sHtml,"0"));print '</pre>';


I put the remaining questions inside the code, but this is working so far. Thanks.

I put the step through the nodes using increments aside and went with a foreach after getting all nodes of a particular type. My thinking was that they may change the format down the line and introduce comment tags or malform the source.

Does that make sense or am I off on a wrong track again?


Node->nodeType, Node->nodeName, Node->localName, Element->tagName. would work too, but you need more check statements.

I'll look those up. Thank you.

My goal now is to answer the questions in the code above. Particularly how to handle the header row as cleanly as possible.


Thanks for everyone's help so far.

Dormilich
02-13-2010, 05:12 PM
How should I sanitize the $html input? Do I need too since using dom?

something like validating against a HTML DTD?


Anyway of saying tag A or B? th or td. Or if there is a way of appending two domNode objects together. $headers.$cells?

you can of course merge arrays (thanks to PHP not sticking to the DOM output data types (that wouldn’t work in JavaScript)). you could also say: get all TDs and if there are none get all THs.

$headers.$cells wouldn’t work anyway (string concatenation on arrays!)

MattyUK
02-13-2010, 07:31 PM
Thanks.


something like validating against a HTML DTD?
Well the $html is essentially user input. I know I need to check it for what is expected but not sure of the best approach in this case. Validate against a DTD! Can you give me a pointer to starting on that approach if there is no better way of thwarting malicious code in the source.


$headers.$cells wouldn’t work anyway (string concatenation on arrays!)

I thought objects were returned. Not arrays. I haven't a single clue how to concat objects so gave everything I could think of a go. I'm not even sure how to examine an object fully to discover more about it. vardump isn't help all that much. I was just eager not to introduce more loops if possible.

I tested they were objects with this code.


...
//get the cells in the row
$cells = $row->getElementsByTagname('td');
if(is_object($cells)){return '$cells is an object';}
if(is_array($cells)){return '$cells is an array';}
...


Anyway thanks to your help I now have the following code. I'd appreciate your feedback or improvements on it:


function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich for help.
$dom = new domDocument;

//How should I sanitize the $html input? Do I need too? Won't the dom parsing simply fail if it is badly formatted/encoded.

//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach ($rows as $row)
{
//get the cells in the row if they are td or th
if(strtolower($row->firstChild->nodeName)=='th'||strtolower($row->firstChild->nodeName)=='td')
{
$cells = $row->getElementsByTagname($row->firstChild->nodeName);
}
else
{
//If both td and th fail then what on earth are we reading??
//Better run away.
return false;
}
//$cells = $row->getElementsByTagname('td');
//if(is_object($cells)){return '$cells is an object';}
//if(is_array($cells)){return '$cells is an array';}

//loop the cells
foreach ($cells as $cell)
{
//add to output array
$rArr[$rCount][] = $cell->nodeValue;
}
//increment row count
$rCount++;
}//from: foreach ($rows as $row){
//Return output
return $rArr;

}//from: function getHtmlTableText($tag,$html){



$sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

print '<pre>';print_r(getHtmlTableText($sHtml,"0"));print '</pre>';

Dormilich
02-13-2010, 08:01 PM
Well the $html is essentially user input. I know I need to check it for what is expected but not sure of the best approach in this case. Validate against a DTD! Can you give me a pointer to starting on that approach if there is no better way of thwarting malicious code in the source.

prepending a DTD can be made before loading into the DOMDocument.

that’s a DTD => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

validating is done through DOMDocument->validate(); although you should be prepared that most user probably don’t know that there is a HTML standard at all and therefore the validation fails.


I thought objects were returned. Not arrays. I haven't a single clue how to concat objects so gave everything I could think of a go. I'm not even sure how to examine an object fully to discover more about it. vardump isn't help all that much. I was just eager not to introduce more loops if possible.

hm, makes sense after all. well, currently I see no way to merge those 2 objects.

the objects returned conform to the DOM, that is, every method or property is listed in the DOM (resp. in the PHP manual)



Anyway thanks to your help I now have the following code. I'd appreciate your feedback or improvements on it:

although it doesn’t seem to matter, always write the method and property names correctly cased (i.e. getElementsByTagName, not getElementsbyTagname), other languages (e.g. JavaScript) will throw an error there.

(comments inside)


function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich for help.
$dom = new domDocument;

//How should I sanitize the $html input? Do I need too? Won't the dom parsing simply fail if it is badly formatted/encoded.
# there is W3C’s "Tidy" … I haven’t used it with PHP (actually I don’t use it, because I know how valid HTML looks like)

//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach ($rows as $row)
{
//get the cells in the row if they are td or th
# see below
if(strtolower($row->firstChild->nodeName)=='th'||strtolower($row->firstChild->nodeName)=='td')
# if you’re unlucky, the first child is neither TD or TH, but the second one is
{
$cells = $row->getElementsByTagname($row->firstChild->nodeName);
}
else
{
//If both td and th fail then what on earth are we reading??
# invalid code ;)
//Better run away.
# not necessary, $cells will be empty and thus the loop not executed
return false;
}
//$cells = $row->getElementsByTagname('td');

//loop the cells
foreach ($cells as $cell)
{
//add to output array
$rArr[$rCount][] = $cell->nodeValue;
}
//increment row count
$rCount++;
}//from: foreach ($rows as $row){
//Return output
return $rArr;

}

$cells = $tr->getElementsByTagName("td");
if (0 == $cells->length)
{
$cells = $tr->getElementsByTagName("th");
}

note: text should be retrieved using the CharacterData->data or Text->wholeText (that one was added from PHP, I think) properties.

PS. just to have it mentioned, the sample HTML is not valid (see http://validator.w3.org)

MattyUK
02-13-2010, 08:53 PM
Good call/catch thank you again.

This is the end result of the efforts:

function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich
$dom = new domDocument;

/*Later on. Sanitize HTML. Dormilich: that’s a DTD => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
validating is done through DOMDocument->validate();
Source isn't valid html but is what is provided.
Not sure if this will protect against malicious but valid code.
Scripts etc so could use htmlspecialchars and or strip_tags.
*/

//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach($rows as $row)
{
//get the cells in the row if they are th or td this approach doesn't rely on firstChild requirement
$cells = $row->getElementsByTagName('th');
if(0 == $cells->length)
{
$cells = $row->getElementsByTagName('td');
}//from: if(0 == $cells->length)

//loop the cells
foreach ($cells as $cell)
{
//add to output array
//Note: Look up CharacterData->data or Text->wholeText??? rather than nodeValue
//$rArr[$rCount][] = htmlspecialchars($cell->nodeValue);
$rArr[$rCount][] = strip_tags($cell->nodeValue);
}//from: foreach ($cells as $cell)
//increment row count
$rCount++;
}//from: foreach ($rows as $row)
//Return output
return $rArr;
}//from: function getHtmlTableText($tag,$html,$strict=0){



$sHtml='<html><body><table><tr><th>header1</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

print '<pre>';print_r(getHtmlTableText($sHtml,"0"));print '</pre>';


I feel as if I learned a lot from your help. Thank you.

Dormilich
02-13-2010, 08:57 PM
$cell->nodeValue
// should be
$cells->firstChild->data
// or
$cells->firstChild->wholeText
// because that makes sure, you actually get text.

// unfortunately, this is not implemented (yet) (DOM-3)
$cells->textContent

ah, and the text of an element does not contain tags (strip_tags() is not required), because they are child elements. on the other hand you could loop through all child elements and return the text data …

if I’d be mean, I explain how to do that with SimpleXML or XSLT-deserialisation … but I didn’t want to ruin the DOM learning experience.


I feel as if I learned a lot from your help. Thank you.
I don’t mind getting a reputation for that. *gg*

MattyUK
02-13-2010, 09:08 PM
AHH, ok that makes more sense to me now.

Thank you for not being mean. There is the deep end learning experience then there is the ocean drowning experience. All depends on how much energy you have to swim and how far from solid ground you are.

I do appreciate your help. Thank you.

The finished (perhaps) item:


function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich
$dom = new domDocument;

//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach($rows as $row)
{
//get the cells in the row if they are th or td. thanks Dormilich
$cells = $row->getElementsByTagName('td');
if(0 == $cells->length)
{
$cells = $row->getElementsByTagName('th');
}//from: if(0 == $cells->length)

//loop the cells
foreach ($cells as $cell)
{
//add to output array
//$rArr[$rCount][] = strip_tags($cell->nodeValue);
//Better avoid child elements nested within the cell. thanks Dormilich
$rArr[$rCount][] = $cell->firstChild->wholeText;
}//from: foreach ($cells as $cell){
//increment row count
$rCount++;
}//from: foreach ($rows as $row){
//Return output
return $rArr;
}//from: function getHtmlTableText($tag,$html,$strict=0){

Dormilich
02-13-2010, 09:26 PM
//loop the cells
foreach ($cells as $cell)
{
//add to output array
//$rArr[$rCount][] = strip_tags($cell->nodeValue);
//Better avoid child elements nested within the cell. thanks Dormilich
$rArr[$rCount][] = $cell->firstChild->wholeText;
}//from: foreach ($cells as $cell){

if you have the need to loop through the child nodes, like in

<td> some text<br>and some more, <a href="…">with a link</a> and even more text …</td>
then you’d need to loop through them (including a filter)

foreach ($cells->childNodes as $child)
{
// test the node type
if ($child instanceof DOMText) // no comments, text only
{
$rArr[$rCount][] = $child->data; // $child->wholeText;
}
}

MattyUK
02-13-2010, 10:26 PM
Humm ok so I need to look up more now.

I changed the source test data to:

$sHtml='<html><body><table><tr><th>hea<b>d</b>er1 and more</th><th>header2</th><th>header3</th>
...

and got:

[0] => Array
(
[0] => hea
[1] => header2
[2] => header3
...

So now I see what you mean.

Looking at your new code I think it is selecting any child elements withing the cell (1 nest layer deep) then if they have textual content it is adding it a new array element on the same dimension as the output.

If I'm about right then your code should be placed with the existing foreach loop, have $cells changed to $cell. So something like this in total.


function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich
$dom = new domDocument;

//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach($rows as $row)
{
//get the cells in the row if they are th or td. thanks Dormilich
$cells = $row->getElementsByTagName('td');
if(0 == $cells->length)
{
$cells = $row->getElementsByTagName('th');
}//from: if(0 == $cells->length)

//loop the cells

foreach ($cells as $cell)
{
//add to output array
//$rArr[$rCount][] = strip_tags($cell->nodeValue);
//Better avoid child elements nested within the cell. thanks Dormilich

///*
foreach ($cell->childNodes as $child)
{
// test the node type
if ($child instanceof DOMText) // no comments, text only
{
$rArr[$rCount][] = $child->data; // $child->wholeText;
}
}
//*/

//$rArr[$rCount][] = $cell->firstChild->wholeText;

}//from: foreach ($cells as $cell){

//increment row count
$rCount++;
}//from: foreach ($rows as $row){
//Return output
return $rArr;
}//from: function getHtmlTableText($tag,$html,$strict=0){



$sHtml='<html><body><table><tr><th>hea<b>d</b>er1 and more</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

echo '<pre>';print_r(getHtmlTableText($sHtml,"0"));echo '</pre>';


Trouble is I'm wrong, I think. The output is:

[0] => Array
(
[0] => hea
[1] => er1 and more
[2] => header2
[3] => header3
...

Whereas I'd ideally desire:

[0] => Array
(
[0] => header1 and more
[1] => header2
[2] => header3
[3] => header4
...
or even:

[0] => Array
(
[0] => heaer1 and more
[1] => header2
[2] => header3
[3] => header4
...

In truth I haven't considered what is best. I just thought about making the function as robust as possible to a badly formed source or a source containing malicious tags. I don't want to support inline html but I'd rather not lose the data if they throw in bold tags and such like.

Any insight would be very much appreciated. Thank you for the DOM learning experience. I'm still at snorkel depth thankfully.

Dormilich
02-13-2010, 11:17 PM
Trouble is I'm wrong, I think. The output is:

[0] => Array
(
[0] => hea
[1] => er1 and more
[2] => header2
[3] => header3
...

Nope, that’s correct. remember that we took only the text child nodes, which leave <b>d</b> out (since this is an element child)


Whereas I'd ideally desire:

[0] => Array
(
[0] => header1 and more
[1] => header2
[2] => header3
[3] => header4
...

then you probably have to make a self calling function, like


// pseudo code
function getTextFromElement($node)
{
foreach (childNodes)
{
if (childNode instanceof DOMText)
{
childNode->data;
}
if (childNode instanceof DOMElement)
{
getTextFromElement(childNode);
}
}
}

In truth I haven't considered what is best. I just thought about making the function as robust as possible to a badly formed source or a source containing malicious tags. I don't want to support inline html but I'd rather not lose the data if they throw in bold tags and such like.

this could turn into quite an expanding project. considering what the browser vendors have to cope with.

_______________

informational section

that’s where I prefer JavaScript … guess you might find some familiar structure though

// printing out an element’s source code
Element.prototype.getString = function ()
{
var str, atb, l, i, tag, cn;
// opening tag
str = "<" + tag;
// attributes
atb = this.attributes;
for (i = 0, l = atb.length; i < l; i++) {
if (i in atb) {
str += " " + atb[i].name + '="' + atb[i].value + '"';
}
}
cn = this.childNodes;
// if empty element
if (0 === cn.length) {
return str + "/>";
} else {
str += ">";
}
// loop through child nodes (text/cdata/elements)
for (i = 0, l = cn.length; i < l; i++) {
// text or cdata
if (cn[i] instanceof Text) {
if (String.trim) {
str += cn[i].data.trim();
} else {
str += cn[i].data;
}
}
// elements (recursive)
else if (cn[i] instanceof Element) {
str += cn[i].getString();
}
}
// closing tag
return str + "</" + tag + ">";
};

MattyUK
02-13-2010, 11:38 PM
this could turn into quite an expanding project. considering what the browser vendors have to cope with.

lol. Yup, I've already put more time in than I thought it might take. I don't think I would have got this far without your help. Browser makers certainly have my respect.

Previously I just accepted that nodeValue shouldn't be used. But since it gives me the desired result without a loop through the children. Can I ask why it is best avoided? Assuming I could apply htmlspecialchars or strip_tags onto the data before it hit the array for additional safety.


function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich
$dom = new domDocument;

//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach($rows as $row)
{
//get the cells in the row if they are th or td. thanks Dormilich
$cells = $row->getElementsByTagName('td');
if(0 == $cells->length)
{
$cells = $row->getElementsByTagName('th');
}//from: if(0 == $cells->length)

//loop the cells

foreach ($cells as $cell)
{
//add to output array
$rArr[$rCount][] = $cell->nodeValue;
//$rArr[$rCount][] = htmlspecialchars(strip_tags($cell->nodeValue), ENT_QUOTES, 'UTF-8');

}//from: foreach ($cells as $cell){

//increment row count
$rCount++;
}//from: foreach ($rows as $row){
//Return output
return $rArr;
}//from: function getHtmlTableText($tag,$html,$strict=0){


$sHtml='<html><body><table><tr><th>hea<b>d</b>er1 and <br/> more</th><th>header2</th><th>header3</th><th>header4</th><th>header4</th><th>header5</th><th>header6</th><th>header7</th><th>header8</th></tr><tr><td>value1</td><td>value2</td><td>value3</td><td>value4</td><td>value5</td><td>value6</td><td>value7</td><td>value8</td><td>value9</td></tr><tr><td>value10</td><td>value11</td><td>value12</td><td>value13</td><td>value14</td><td>value15</td><td>value16</td><td>value17</td><td>value18</td></tr><tr><td>value19</td><td>value20</td><td>value21</td><td>value22</td><td>value23</td><td>value24</td><td>value25</td><td>value26</td><td>value27</td></tr></table></body></html>';

echo '<pre>';print_r(getHtmlTableText($sHtml,"0"));echo '</pre>';

Outputs:

Array
(
[0] => Array
(
[0] => header1 and more
[1] => header2
[2] => header3
[3] => header4
[4] => header4
...


For you to suggest nodeValue shouldn't be used to get the data probably means I'm missing some nasty 'gotcha' down the line.

Thanks.

Dormilich
02-14-2010, 12:10 AM
the point is, I can’t exactly say, what the value of an Element really is (you could say it’s its text, but what about nested elements then). while the nodeValue of a text node is obvious, and thus free from any further tags (but then you can as well use data, to make it clear and safe (only text nodes have a data property!)).

EDIT: looked up the nodeValue (http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-F68D080) of Element, it turned out DOM defines it as … null.

the problem with the inline elements is, that while we see text in a formatted way, the DOM sees it as a collection of nodes, which do not have to make a lexical sense.

IMO, the only reason why Element has a nodeValue is that it inherits it from its parent Node, which in turn kind of makes sense …

MattyUK
02-14-2010, 05:54 AM
Well thank you again. This has been an education in DOM. I'm pretty sure I wasn't on the right track with my previous reading on the matter. This function and your help has made a difference. I just need to build on it now. I guess I'll use nodeValue for now until I can navigate a little better or want to tackle the recursive function.

I wish there was a way to better examine an object to discover the possible values/attributes/methods etc.

I understand the DOM doesn't have to make lexical sense I need a way of visualizing a 'structure' before it makes all that much sense. More reading on the horizon I guess.

Any books you can recommend? The manual isn't cutting it for me.

Dormilich
02-14-2010, 08:39 AM
I understand the DOM doesn't have to make lexical sense I need a way of visualizing a 'structure' before it makes all that much sense. More reading on the horizon I guess.

Any books you can recommend? The manual isn't cutting it for me.

problem is, the only PHP book I ever touched is "PHP 5 Power Programming", and that was only the OOP chapter … other than that, online articles and lots of programming experience*.

everything I learned about DOM I got from working with JavaScript, and the DOM specs @ W3C (DOM-2 (http://www.w3.org/TR/DOM-Level-2-Core/core.html), DOM-3 (http://www.w3.org/TR/DOM-Level-3-Core/core.html))

* - and another one that helps: “thinking like a parser” (logic required)

Dormilich
02-14-2010, 08:47 AM
$cell->nodeValue
// should be
$cells->firstChild->data
// or
$cells->firstChild->wholeText
// (EDIT) or
$cells->textContent

found it! textContent is defined, I only looked at the wrong interface. use that instead of nodeValue.

textContent

This attribute returns the text content of this node and its descendants

MattyUK
02-14-2010, 02:12 PM
Humm interesting, thank you.

I need to learn start the DOM3 specifications at beginning. I've only ever dropped in midway to use them for reference and got confused on how to read the document. More confused after that since I'm not entirely sure how to access the properties it talks about. Whenever I think something should be there it isn't.


$rArr[$rCount][] = $cell->textContent;

Works very well. Thank you again.

How did you determine textContent applied?

From my perspective we had a DOMNode 'object', how could we determine more about it, like it's type and what properties applied. Is there a function to examine the DOMObject to learn more? We can var_dump or print_r a multi-dimension array to discover more about it, how do we go on with DOMNode objects?

I think if there is a way it'd help the current 1) trial, 2) error, 3) run to the php manual and get confused about which bit applies, then 4) cry for help approach.

Dormilich
02-14-2010, 11:06 PM
start with the DOM-2, it’s relatively well arranged (DOM-3 is much more crowded)


How did you determine textContent applied?

I (coincidentally) checked the right interface. and I know it from JavaScript DOM.


From my perspective we had a DOMNode 'object', how could we determine more about it, like it's type and what properties applied. Is there a function to examine the DOMObject to learn more? We can var_dump or print_r a multi-dimension array to discover more about it, how do we go on with DOMNode objects?

I don’t know of any function. you probably have to write your own. maybe a foreach will do …

MattyUK
02-15-2010, 04:42 PM
Cool. Well thank you. I've learned a good deal and function works well for me. DOM-2 it is.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum