View Full Version : Stripping Unwanted Characters

06-04-2007, 07:48 PM

I know there must be a simple way of doing this but I just can't find it. I am developing a system that allows users to upload word docs and download them. The upload script uses file_get_contents() to convert the file into a string and store it in a database for full text searching. The users never actually see the text - its just used to ascertain the relevance of the word doc before they download it.

The problem is that before and after the main text I'm getting random characters. Any ideas how I can strip these away? Example:

╨╧рб▒с����������������>��■  ���������������W����������Y������■   ����V���                                                                                                                                                                                                                                                                                                                                                                                                                                                    ье┴� @ ��°┐�������������6!���bjbj0ж0ж������������������

06-04-2007, 08:35 PM
Word documents aren't plain text files so you can't use file_get_contents() unless you are able to reliably parse out the control codes (all those crazy characters you want to strip).

There are utilities out there that can create Word documents from PHP... you might try googling to see if you can find something that does the reverse.

06-05-2007, 01:52 PM
Thanks. I'll have a look.

I was thinking that there might be a generic function for stripping out characters - like trim() but it doesn't seem to work

06-05-2007, 01:55 PM
trim() removes white space
are you trying to add an image or something without the appropriate tags? thats what was causing it to print them strange chars whilst i was adding captcha

06-06-2007, 11:39 AM

I using file_get_contents to convert a word doc into text so that I can store in db and do full text search on it.