PDA

View Full Version : any function/script/snippet to output only textual content from url?


student
01-10-2007, 06:35 PM
hello,

you may have seen some "search engine spider simulator" tools online that show only the textual content of a web page stripping all html codes, javascript, images etc...

is there any any function/script/snippet to output only the textual content from a given url?

thank you.

Fumigator
01-10-2007, 06:38 PM
There is the function strip_tags() (http://us3.php.net/manual/en/function.strip-tags.php) which you may find useful.

student
01-10-2007, 06:59 PM
There is the function strip_tags() (http://us3.php.net/manual/en/function.strip-tags.php) which you may find useful.

but i think it doesnt strip javascript. i would like to output only textual content from the input website url.
can anybody please inform me how to do this.
thank you.

Fumigator
01-10-2007, 07:17 PM
Yeah it's not the only step, but it is a step. To strip Javascript you'd want to use preg_replace() to replace everything in between <script and /script> with ''. You'd probably want to do the same type of thing on the stuff in between <head> and </head>.

student
01-10-2007, 07:40 PM
Yeah it's not the only step, but it is a step. To strip Javascript you'd want to use preg_replace() to replace everything in between <script and /script> with ''. You'd probably want to do the same type of thing on the stuff in between <head> and </head>.

thanks for the information.
i am new to php.
can you please show me how to use this function with regard to <script and /script>.

thank you.

Fumigator
01-10-2007, 08:07 PM
The syntax is preg_replace($pattern, $replacement, $string). I'm no good with regular expressions so I'm not going to attempt to write the pattern... the code would look something like


$pattern = 'insert your regular expression here';
$replacement = '';
$string = '<script>code from the page you are parsing</script>';

$rawstring = preg_replace($pattern, $replacement, $string);


If you are not familiar with regular expressions, google is a great place to start.

student
01-11-2007, 04:22 AM
thank you Fumigator,
i have tried a lot without much success to get rid of jacascript and special characters.

can anybody please show me the exact php code to strip all javascript, special characters, symbols etc.. from a file/string/input.
thank you very much.