PDA

View Full Version : cleaning text


Kiwi
12-09-2002, 07:23 PM
Another random php request. I'm looking for references on checking files are valid, before loading them to my server. The files are generally the contents of html <body> tags (exlcuding the actual <body> component)-- with some other limitations.

Basically I want to be able to strip out any 'dangerous' code (including all PHP). The files will have some html tags left in them, but not all of them (no head element and all script tags are not allowable).

Can any point me to tools for cleaning text in PHP. My guess is it will involve some tricky regex, but there might be a better idea that I don't know about.

Thanks.

Spookster
12-10-2002, 02:03 AM
http://www.php.net/manual/en/function.strip-tags.php

Kiwi
12-10-2002, 09:43 AM
Thanks. That's useful for the second half of the problem (cleaning the remaining html). But there's also the first half: I also want to be able to strip all of the content between some of the tags (including between '<? ?> tags).

If I use a simple regex, like /<?.*?>/ it takes out everything between the first and last instance of the tags. Instead, I want it to take out the string between each instance of <? and the next instance of ?>.

Is this possible without char by char parsing?

<edit> An example might make that clearer. If I submit the following string to the regex I described:<tr><td>Template</td>
<td><select name="template" value='0001'><?php template_select_list(0); ?></select></td></tr>
<tr><td>Type</td>
<td><select name="type" value='0001'><?php type_select_list(0); ?></select></td></tr>
Then it returns:<tr><td>Template</td>
<td><select name="template" value='0001'></select></td></tr>I want something that will return:<tr><td>Template</td>
<td><select name="template" value='0001'></select></td></tr>
<tr><td>Type</td>
<td><select name="type" value='0001'></select></td></tr>

Kiwi
12-10-2002, 05:03 PM
I wrote a couple of functions that do the job. I think they're useful.function kill_tags($text,$hunt){

$a_text=preg_split("/<($hunt)/",$text);
$n=count($a_text);
for ($i=0; $i < $n; $i++){
$a_temp=preg_split("/($hunt)>/",$a_text[$i]);
$a_text[$i]=$a_temp[1];
}
return $a_text;
}
kill_tags will delete the start and the finsh of the tag, and anything in-between. You need to pass the string to be cleaned and a regex that for the tags you want killed, excluing the < and > (eg "script|\?|style" will get rid of all script, style and php sections).function my_strip_tags($text,$hunt){

$a_text=preg_split("/<\/?($hunt)/",$text); // find the start of the tag
$n=count($a_text);
for ($i=0; $i < $n; $i++){
$a_temp=preg_split("/>/",$a_text[$i]); // find the end of the tag
$a_temp[0]=""; // kill the first line
$a_text[$i]=preg_replace("/^>/","",implode(">",$a_temp));
}
return $a_text;
}Is a complement of the existing php strip_tags function. It only strips the tags you specify, including all properties etc of the tags> It works with either paired tags (eg <head> ... </head>) and non-paired tags(eg <link ... >). The parameters are the same as for kill_tags.

I'm sure these can be improved -- if anyone has any more elegant solutions, they would be much apprectiated.

Oh, the point of doing this is to allow you to upload clean files to use in a cms system. Doing this means that you can trust the files that you upload.

At least, that's the theory.

<edit> there's a few bugs in that code. if anyone is interested, drop me a line and I'll give you the fixed versions</edit>