Now I hail from PHP4 days so I thought regex might be a way forward. A quick bounce around google and I had the message DON'T USE REGEX TO PARSE HTML thrown at me more than a few times.
/*
* @get text between tags
* @param string $tag The tag name
* @param string $html The XML or XHTML string
* @param int $strict Whether to use strict mode
* @return array
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}
Now I understand some of it but am not familiar with the PHP DOM.Yes I've read his tutorial and much of the manual. I haven't reached that magic function or example explained that helps it fall into place in my visualizations.
I could chunk the array down into what I need (only discovered that yesterday, thank you forum), but I suspect it will be much better to alter the function above to give the result I need.
If well explained I also think it'd be a good introduction and practical example of the PHP DOM scripting that is new to me.
Would anyone mind showing me how to adjust the function above to be what I need and perhaps explain it a little as they go?
PHP Dom, in my humble opinion, is brilliant! But it does take a little getting used to. The easiest way to start learn is to get your head around actual DOM, that is the Document Object Model. Think of it as a tree of nodes, somehave children, some have parents and some have siblings. Using PHPDom you can reference various elements by using these 'ancestor' references.
The basic steps are; you create a DomDocument() with $dom = new DOMDocument();
Then you load a string (XML/HTML/etc)/XML File/HTML File/etc into this DOMDocument and use the DOM fuctions (such as getElementsByTagName) to fetch items from the DOMDocument.
You can then use these elements as you see fit. Look into simplexml too as that is also a powerful parser for PHP
__________________
"write it for FireFox then hack it for IE."
Quote:
Originally Posted by Mhtml
Domains are like women - all the good ones are taken unless you want one from some foreign country.
Firstly thank you for your reply. It's good to hear the DOM is worth learning. I've dabbled in javascript DOM before but never in PHP until now.
Wouldn't your adjustment give a single dimension associative array?
The $tag variable would be the "td" supplied as the function parameter. Creating a single dimension associative array overwritten in the loop so it only contains the last value.
PHP Code:
/*
* @get text between tags
* @param string $tag The tag name
* @param string $html The XML or XHTML string
* @param int $strict Whether to use strict mode
* @return array
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[$tag] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}
Thank you for the idea however. I guess one approach might be to get the code to step through any child tags. Then we could pass it the tr tag knowing it'll get the values from all td tags it contained. I'm not sure of the array insert code that would be needed. Simply haven't thought about it yet. Might try a few mock ups now.
Firstly thank you for your reply. It's good to hear the DOM is worth learning. I've dabbled in javascript DOM before but never in PHP until now.
PHP DOM is just as easy as JavaScript DOM, because DOM is a language independent API. you only have to fit it in the actual language’s syntax. (and there could hardly be a greater difference)
__________________
please post your code wrapped in [CODE] [/CODE] tags
ahallicks, that modification would just make an associative array again. 2 dimensions yes, but not grouped via the row.
the problem I think is the $tag variable since it is simply the string passed to tell the function what tag in the DOM to target. because each of the values returned, comes from the same tag type it'll overwrite the previous or in the second case be lumped together under the be lumped together under the associative array it caused.
Check out what I mean below:
PHP Code:
/*
* @get text between tags
* @param string $tag The tag name
* @param string $html The XML or XHTML string
* @param int $strict Whether to use strict mode
* @return array
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[$tag][] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}
I think this is because the call to getElementsByTagname($tag); essentially says get all the TD's, irrespective of their relationship as children of any specific TR tag.
Even if we somehow changed it to refference the parent node name perhaps...
The next 'row' would still overwrite the previous ones values since they are using the same associative array key.
I don't know the dom commands/functions to start at a TR collect all child TD elements then move onto the next TR.
Dormilich, Then I may not have been using the dom after all. I just recall accessing elements under the old document.all. structure and changing attributes, reading values etc.
I can picture the dom structure but don't know how to move around it well enough in PHP. Hence asking here. How would we foreach each TD child element of every TR element?
Thank you. I saw the dom functions firstChild, next sibling but couldn't quite see how to look at only siblings of a particular type. I guess I was over-thinking it. This approach should work. Thank you. I'll rework the code based on a nested loop approach.
Thanks again Dormilich. I guess a nights sleep helped me see your point more clearly too.
Thanks for your help. I slapped myself upside the head and looked at it again from a nested viewpoint. I have this so far.
PHP Code:
function getHtmlTableText($html){ //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
$dom = new domDocument; //How should I sanitize the $html input? Do I need too since using dom? //Load html into dom object $dom->loadHTML($html); //discard white space $dom->preserveWhiteSpace = false; //get the rows $rows = $dom->getElementsByTagname('tr'); //initialize the output array $rArr = array(); //row count int var $rCount = 0; //loop the rows foreach ($rows as $row) { //How to cleanly accommodate header cells? Don't want to replicate the entire loop
//get the cells in the row $cells = $row->getElementsByTagname('td'); //try OR //$cells = $row->getElementsByTagname('th'||'td');//Bad. //Concat? //$cells = $row->getElementsByTagname('th'); //$cells .= $row->getElementsByTagname('td');//Bad //Addition? //$cells = $cells + $row->getElementsByTagname('td');//Bad //How do you join two donNode objects together? or use getElementsByTagname with OR multiple tags //loop the cells foreach ($cells as $cell) { //add to output array $rArr[$rCount][] = $cell->nodeValue; }//from: foreach ($cells as $cell){ //increment row count $rCount++; }//from: foreach ($rows as $row){
//Return output return $rArr;
}//from: function getHtmlTableText($tag,$html,$strict=0){
I put the remaining questions inside the code, but this is working so far. Thanks.
I put the step through the nodes using increments aside and went with a foreach after getting all nodes of a particular type. My thinking was that they may change the format down the line and introduce comment tags or malform the source.
Does that make sense or am I off on a wrong track again?
Quote:
Node->nodeType, Node->nodeName, Node->localName, Element->tagName. would work too, but you need more check statements.
I'll look those up. Thank you.
My goal now is to answer the questions in the code above. Particularly how to handle the header row as cleanly as possible.
Thanks for everyone's help so far.
Last edited by MattyUK; 02-13-2010 at 04:08 PM..
Reason: typos. code header question attempts added
How should I sanitize the $html input? Do I need too since using dom?
something like validating against a HTML DTD?
Quote:
Anyway of saying tag A or B? th or td. Or if there is a way of appending two domNode objects together. $headers.$cells?
you can of course merge arrays (thanks to PHP not sticking to the DOM output data types (that wouldn’t work in JavaScript)). you could also say: get all TDs and if there are none get all THs.
$headers.$cells wouldn’t work anyway (string concatenation on arrays!)
__________________
please post your code wrapped in [CODE] [/CODE] tags
Well the $html is essentially user input. I know I need to check it for what is expected but not sure of the best approach in this case. Validate against a DTD! Can you give me a pointer to starting on that approach if there is no better way of thwarting malicious code in the source.
Quote:
$headers.$cells wouldn’t work anyway (string concatenation on arrays!)
I thought objects were returned. Not arrays. I haven't a single clue how to concat objects so gave everything I could think of a go. I'm not even sure how to examine an object fully to discover more about it. vardump isn't help all that much. I was just eager not to introduce more loops if possible.
I tested they were objects with this code.
PHP Code:
...
//get the cells in the row
$cells = $row->getElementsByTagname('td');
if(is_object($cells)){return '$cells is an object';}
if(is_array($cells)){return '$cells is an array';}
...
Anyway thanks to your help I now have the following code. I'd appreciate your feedback or improvements on it:
PHP Code:
function getHtmlTableText($html){
//Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html
//Thanks to Dormilich for help.
$dom = new domDocument;
//How should I sanitize the $html input? Do I need too? Won't the dom parsing simply fail if it is badly formatted/encoded.
//Load html into dom object
$dom->loadHTML($html);
//discard white space
$dom->preserveWhiteSpace = false;
//get the rows
$rows = $dom->getElementsByTagname('tr');
//initialize the output array
$rArr = array();
//row count int var
$rCount = 0;
//loop the rows
foreach ($rows as $row)
{
//get the cells in the row if they are td or th
if(strtolower($row->firstChild->nodeName)=='th'||strtolower($row->firstChild->nodeName)=='td')
{
$cells = $row->getElementsByTagname($row->firstChild->nodeName);
}
else
{
//If both td and th fail then what on earth are we reading??
//Better run away.
return false;
}
//$cells = $row->getElementsByTagname('td');
//if(is_object($cells)){return '$cells is an object';}
//if(is_array($cells)){return '$cells is an array';}
//loop the cells
foreach ($cells as $cell)
{
//add to output array
$rArr[$rCount][] = $cell->nodeValue;
}
//increment row count
$rCount++;
}//from: foreach ($rows as $row){
//Return output
return $rArr;
Well the $html is essentially user input. I know I need to check it for what is expected but not sure of the best approach in this case. Validate against a DTD! Can you give me a pointer to starting on that approach if there is no better way of thwarting malicious code in the source.
prepending a DTD can be made before loading into the DOMDocument.
that’s a DTD => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
validating is done through DOMDocument->validate(); although you should be prepared that most user probably don’t know that there is a HTML standard at all and therefore the validation fails.
Quote:
I thought objects were returned. Not arrays. I haven't a single clue how to concat objects so gave everything I could think of a go. I'm not even sure how to examine an object fully to discover more about it. vardump isn't help all that much. I was just eager not to introduce more loops if possible.
hm, makes sense after all. well, currently I see no way to merge those 2 objects.
the objects returned conform to the DOM, that is, every method or property is listed in the DOM (resp. in the PHP manual)
Quote:
Anyway thanks to your help I now have the following code. I'd appreciate your feedback or improvements on it:
although it doesn’t seem to matter, always write the method and property names correctly cased (i.e. getElementsByTagName, not getElementsbyTagname), other languages (e.g. JavaScript) will throw an error there.
(comments inside)
Quote:
PHP Code:
function getHtmlTableText($html){ //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html //Thanks to Dormilich for help. $dom = new domDocument;
//How should I sanitize the $html input? Do I need too? Won't the dom parsing simply fail if it is badly formatted/encoded. # there is W3C’s "Tidy" … I haven’t used it with PHP (actually I don’t use it, because I know how valid HTML looks like)
//Load html into dom object $dom->loadHTML($html); //discard white space $dom->preserveWhiteSpace = false; //get the rows $rows = $dom->getElementsByTagname('tr'); //initialize the output array $rArr = array(); //row count int var $rCount = 0; //loop the rows foreach ($rows as $row) { //get the cells in the row if they are td or th # see below if(strtolower($row->firstChild->nodeName)=='th'||strtolower($row->firstChild->nodeName)=='td') # if you’re unlucky, the first child is neither TD or TH, but the second one is { $cells = $row->getElementsByTagname($row->firstChild->nodeName); } else { //If both td and th fail then what on earth are we reading?? # invalid code ;) //Better run away. # not necessary, $cells will be empty and thus the loop not executed return false; } //$cells = $row->getElementsByTagname('td');
//loop the cells foreach ($cells as $cell) { //add to output array $rArr[$rCount][] = $cell->nodeValue; } //increment row count $rCount++; }//from: foreach ($rows as $row){ //Return output return $rArr;
function getHtmlTableText($html){ //Source foundation: http://www.phpro.org/examples/Get-Text-Between-Tags.html //Thanks to Dormilich $dom = new domDocument;
/*Later on. Sanitize HTML. Dormilich: that’s a DTD => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> validating is done through DOMDocument->validate(); Source isn't valid html but is what is provided. Not sure if this will protect against malicious but valid code. Scripts etc so could use htmlspecialchars and or strip_tags. */
//Load html into dom object $dom->loadHTML($html); //discard white space $dom->preserveWhiteSpace = false; //get the rows $rows = $dom->getElementsByTagname('tr'); //initialize the output array $rArr = array(); //row count int var $rCount = 0; //loop the rows foreach($rows as $row) { //get the cells in the row if they are th or td this approach doesn't rely on firstChild requirement $cells = $row->getElementsByTagName('th'); if(0 == $cells->length) { $cells = $row->getElementsByTagName('td'); }//from: if(0 == $cells->length)
//loop the cells foreach ($cells as $cell) { //add to output array //Note: Look up CharacterData->data or Text->wholeText??? rather than nodeValue //$rArr[$rCount][] = htmlspecialchars($cell->nodeValue); $rArr[$rCount][] = strip_tags($cell->nodeValue); }//from: foreach ($cells as $cell) //increment row count $rCount++; }//from: foreach ($rows as $row) //Return output return $rArr; }//from: function getHtmlTableText($tag,$html,$strict=0){
$cell->nodeValue // should be $cells->firstChild->data // or $cells->firstChild->wholeText // because that makes sure, you actually get text.
// unfortunately, this is not implemented (yet) (DOM-3) $cells->textContent
ah, and the text of an element does not contain tags (strip_tags() is not required), because they are child elements. on the other hand you could loop through all child elements and return the text data …
if I’d be mean, I explain how to do that with SimpleXML or XSLT-deserialisation … but I didn’t want to ruin the DOM learning experience.
Quote:
Originally Posted by MattyUK
I feel as if I learned a lot from your help. Thank you.
I don’t mind getting a reputation for that. *gg*
__________________
please post your code wrapped in [CODE] [/CODE] tags
Last edited by Dormilich; 02-13-2010 at 08:11 PM..