Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 14 of 14
  1. #1
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts

    How to write cyrillic symbols in a correct way using saveHTMLFile function

    I have a file with text like this:

    Windows 8: Beautiful and Fas

    Если поставят в 7-ми дюймовый планшет, так пусть и в какой нить смартфон поставят. Преинтереснейшая штука получится !

    About this file Linux command file tells that this is
    Code:
    $file comments
    comments: UTF-8 Unicode text
    This file was created in Linux using Kate editor.
    I am reading this file using file function, then loading HTML file using loadHTMLFile method:
    PHP Code:
                $dom = new DOMDocument();
                
    $dom->loadHTMLFile('comments.html');
                
    $dom->validateOnParse true;
                
    $dom->preserveWhiteSpace false
    Source of comments.html file:
    Code:
    <!DOCTYPE html>
    <html>
    <head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
            <div id="allComments">
                testka
                <p class="comment">
                    Comments
                </p>
           </div>
    But when I am writing changes to a comments.html file, after I appended text from comments file to a DOM tree, I am getting this:
    Code:
    <html>
    <head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
    …
    <p class="comment"><img src="avatars/avatar95.jpg">VadimWindows 8: Beautiful and Fas<br>Current date: 21/05/13 07:54 am</p><p class="comment"><img src="avatars/avatar98.jpg">АндрейЕсли поставят в 7-ми 
    дюймовый 
    …
    symbols within HTML code. But when load comments.html into browser text on the page looks like it should be and when outputs content of file with help of echo function to Konsole from KDE too.
    Last edited by badcoder2013; 05-21-2013 at 10:56 PM.

  • #2
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I mean this symbols
    Attached Thumbnails Attached Thumbnails How to write cyrillic symbols in a correct way using saveHTMLFile function-output.jpg  

  • #3
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Quote Originally Posted by badcoder2013 View Post
    But when I am writing changes to a comments.html file, after I appended text from comments file to a DOM tree, I am getting this:
    How are you appending text to the file? Manually using a basic editor like vi and editing the comments.html file, adding them through using DOMDocument and PHP, or using an XML Editor to insert them into the tree?
    DOMDocument should be in utf-8 encoding as well. Make sure you start the object as such: new DOMDocument('1.0', 'utf-8'); when constructing. I don't write in different charsets, so I don't know if that will "fix" it or not.
    Those are the entities for the unicode output as described in HTML entities. If its the DOMDocument that's adding them in, I'm not sure if that can be corrected by using the utf-8 encoding, or if it will still convert them. It may be happier if you construct a cdata block instead of a pcdata block for the data. Again, I'm not certain.
    I can try running a test when I get home on that to see what it does with the proper encoding though. I can see it choking completely on using pcdata, but possibly working in cdata.
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #4
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    OK. I`ll try.
    Here is code:
    PHP Code:
    # read from a file
    $comments file('comments.txt'FILE_SKIP_EMPTY_LINES|FILE_IGNORE_NEW_LINES); 
    PHP Code:
    $dom = new DOMDocument();
    $dom->loadHTMLFile('comments.html');
    $dom->validateOnParse true;
    $dom->preserveWhiteSpace false;

    $allComments $finder->query("//*[@id='allComments']/p[@class = 'comment']");
    $uniqueComments array_diff($comments$allComments);
    if (
    count($uniqueComments) === 0){
       
    # display html page
        
    show_html();
    }
    # find id to append to it more comments, commentators names, avatars etc
    $commentsToWhichToAttach $dom->getElementById('allComments');

    # skipped 

    for ($index 0$index count($uniqueComments); $index++){
        
    # skipped
        
    $p $dom->createElement('p');
        
    $p->setAttribute('class','comment');
        
    $commentToDom $dom->createTextNode($uniqueComments[$index]);
        
    $p->appendChild($commentToDom);
        
    $commentsToWhichToAttach->appendChild($p);
        
    # skipped
    }
    $dom->saveHTMLFile('comments.html'); 

  • #5
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    I see what you're doing here now.
    Does this occur with all items within the p[@class='comment'] blocks regardless of if they were there before, or only with the newly added items? That is, if you have an HTML file with the actual characters saved in it (probably in a unicode non-bom format), and you open it up and add another comment from comments.txt to it using the DOMDocument as shown here, then save the HTML, do both the original blocks within the code and the added become converted to their entities, or just the newly added ones?

    Also, is there a reason why you are modifying an HTML page instead of just using a db or flatfile and PHP to construct it when required?
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #6
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Well I have some good news. But I do not know how to use this together with saveHTMLFile. I write the next code
    PHP Code:
    $str "the entities for the unicode output as described in HTML entities";
    $a html_entity_decode($str,4,"UTF-8"); # first, I was trying to call a function  in this way:
    # html_entity_decode($str,ENT_HTML5/*or any other flag as described in doc*/,"UTF-8") but I am getting this error all the time 
    # PHP Warning:  html_entity_decode() expects parameter 2 to be long, string given in
    file_put_contents('asdasd.html',$a); 
    and get "normal" letters.
    If you want to try I have attached txt file.
    Attached Files Attached Files
    Last edited by badcoder2013; 05-21-2013 at 09:15 PM.

  • #7
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Yes a conversion is doable, but that's not my question.
    I'm curious if the problem originates from the use of file() or if it originates from DOMDocument. If I had to guess right off the bat, its a file() issue. Its listed as being binary safe, which means in the context of a string it doesn't quite know what to do with it, so it just renders it best it can (which is in ISO-8859-1 in PHP). On the other hand, DOMDocument is utf8 native, but I don't know if the construction of the domdocument encoding is a requirement. The only way I can think of to test that is to use ->saveHTML() and print that directly as a comparison. So that logic would be that the DOMDocument is fine, as is the file, and when saved and printed it retains all encoding information. When saved to an HTML file without a unicode set default, than it converts to ansi equivalent.

    Or, as mentioned, a DBMS would make this entire thing trivial. Simply push the proper UTF-8 charset headers, fetch from the db (stored in utf8 charset), and print. All done. Alternatively, you can read from a file (using unicode is fine, even bom are fine so long as they are abandoned when reading, or even use packed binary and fetch the chars out and convert the multibyte), and simply display just as you would with a db. That IMO is the easiest and most efficient route, whether you use manually constructed HTML or use the DOMDocument to construct it.
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #8
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I just tested file_put_contents('comments3.html',$dom->saveHTML()); construction. When file was saved I used cat to see what in it
    Code:
    $cat comments3.html
    and there is "normal" which is not in case of using saveHTMLFile. I tested with html_entity_decode and without. End result is the same. If I understood right what you have written file function is guilt. But what function do I need to use to avoid such behavior. file_get_contents?
    Here is complete code:
    PHP Code:
    $a file('/srv/www/htdocs/comments/comments'FILE_SKIP_EMPTY_LINES|FILE_IGNORE_NEW_LINES);
                
    $dom = new DOMDocument('1.0','utf-8');
                
    $dom->loadHTMLFile('/srv/www/htdocs/comments/comments.html');
                
    $dom->validateOnParse true;
                
    $dom->preserveWhiteSpace false;
                
    $commentsToWhichToAttach $dom->getElementById('allComments');
                
    foreach (
    $a as $b){
                            
    $p $dom->createElement('p');
                        
    $p->setAttribute('class','comment');
                        
    $commentToDom $dom->createTextNode($b);
      
    #                  $a = $dom->createTextNode(html_entity_decode($commentToDom->wholeText,4,"UTF-8"));
     #                   $p->appendChild($a);
                        
    $p->appendChild($commentToDom);
                        
    $commentsToWhichToAttach->appendChild($p);
    }
    #$z  = $dom->saveHTML();exit;
    file_put_contents('comments3.html',$dom->saveHTML());exit; 

  • #9
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Fou-Lu View Post
    I see what you're doing here now.
    Does this occur with all items within the p[@class='comment'] blocks regardless of if they were there before, or only with the newly added items? That is, if you have an HTML file with the actual characters saved in it (probably in a unicode non-bom format), and you open it up and add another comment from comments.txt to it using the DOMDocument as shown here, then save the HTML, do both the original blocks within the code and the added become converted to their entities, or just the newly added ones?

    Also, is there a reason why you are modifying an HTML page instead of just using a db or flatfile and PHP to construct it when required?
    About the p[@class='comment'] blocks. I did not checked.
    What I need is to add to already existing HTML page new comments with avatars, and nicknames. I need to add only unique comments that are not present yet on the page. This page would bee seen by users. Avatars are located within the folder and nicknames inside text file.

  • #10
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    I'm going to have to think about this one. Just to verify:
    PHP Code:
    file_put_contents('commentsFromPut.html'$dom->saveHTML()); 
    Generates the correct characters, whilst:
    PHP Code:
    $dom->saveHTMLFile('commentsFromSave.html'); 
    Generates entity replacement correct?
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #11
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Yes, this code generates correct charset (at least I can read the text)
    PHP Code:
    file_put_contents('commentsFromPut.html'$dom->saveHTML()); 
    About file created with help of above code file command says that this is
    Code:
    HTML document, UTF-8 Unicode text, with very long lines
    In the second case
    PHP Code:
    $dom->saveHTMLFile('commentsFromSave.html'); 
    I am getting these HTML entities to that representing Unicode symbols
    Last edited by badcoder2013; 05-21-2013 at 10:35 PM.

  • #12
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Does this occur with all items within the p[@class='comment'] blocks regardless of if they were there before, or only with the newly added items? That is, if you have an HTML file with the actual characters saved in it (probably in a unicode non-bom format), and you open it up and add another comment from comments.txt to it using the DOMDocument as shown here, then save the HTML, do both the original blocks within the code and the added become converted to their entities, or just the newly added ones?
    I just tried. All cyrillic text, that was already in the file and added later, becomes converted to their entities.

  • #13
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Quote Originally Posted by badcoder2013 View Post
    I just tried. All cyrillic text, that was already in the file and added later, becomes converted to their entities.
    This stacked with the use of saveHTML() and saveHTMLFile() both indicate a problem within the DOMDocument itself. I'll have to take a closer look of the saveHTML versus saveHTMLFile methods to see what they are doing differently.
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #14
    New Coder
    Join Date
    Mar 2013
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Maybe it worthful to describe this behavior on php.net in documentation or elsewhere? If you understood what is wrong.
    I do not know if this matters but
    Code:
    $php --version
    PHP 5.3.17 (cli) 
    Copyright (c) 1997-2012 The PHP Group
    Zend Engine v2.3.0, Copyright (c) 1998-2012 Zend Technologies
    OpenSUSE 12.3


  •  

    Tags for this Thread

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •