Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Page 1 of 2 12 LastLast
Results 1 to 15 of 28
  1. #1
    New Coder
    Join Date
    Nov 2012
    Location
    France
    Posts
    78
    Thanks
    20
    Thanked 0 Times in 0 Posts

    (How to) Copy/Manipulate all parsed output/displayed text for use as real text?

    This appears to be a major challenge for javascript programmers.

    Apparently, there are no universal references to nodes further down the DOM list than the root node. You have to traverse the DOM, using a tree-navigating algorithm, until you find what you're looking for.

    YET, we can simply drag our mouse across a web page of text, highlight that text, copy it, and post it to a text file........ so easy we don't think about it.

    Therefore:
    Is it possible to NOT traverse the DOM, but instead....

    .... grab everything from the 1st to the last part of the display....
    .... and paste it as text ....

    In effect, a function to mimic the 'highlight, copy, & paste, commands': Where everything on a page is copied, but only the text gets pasted to text?

    Notes:
    • For this application all the display pages would be the same, other than the changing text (arriving from google's translation server).
    • I understand that all browsers may be slightly different.
    • Clearly a universal solution would be ideal, but one for Chrome would be a good start.


    I am confident that there will be a simple way to do this, however I have yet to discover a programmer who knows that method (and for me, I lack the in depth experience needed, to generate a solution to this apparently fundamental problem).

    I found you via:
    http://www.javascriptkit.com/domref/

    There is a direct link to Coding Forums

  • #2
    The fat guy next door VIPStephan's Avatar
    Join Date
    Jan 2006
    Location
    Halle (Saale), Germany
    Posts
    8,620
    Thanks
    6
    Thanked 1,001 Times in 974 Posts
    Quote Originally Posted by Ace..... View Post
    This appears to be a major challenge for javascript programmers.

    Apparently, there are no universal references to nodes further down the DOM list than the root node. You have to traverse the DOM, using a tree-navigating algorithm, until you find what you're looking for.

    YET, we can simply drag our mouse across a web page of text, highlight that text, copy it, and post it to a text file........ so easy we don't think about it.
    I admit that I wouldn’t know a solution either and sorry for kind of hijacking the thread but do you have any idea what’s happening in the programs’ background while you’re just highlighting, copying, and pasting the text without thinking about it? How do you come to the conclusion that it would be any simpler than a JavaScript DOM traversal?

    Quote Originally Posted by Ace..... View Post
    I am confident that there will be a simple way to do this, however I have yet to discover a programmer who knows that method (and for me, I lack the in depth experience needed, to generate a solution to this apparently fundamental problem).
    Well, depending on how many programmers you’ve got to know so far it seems to me that apparently there isn’t such a simple solution for this after all if it’s so hard to find one who knows any that meets your expectations. So, what makes you so confident that there is a JS solution other than traversing the whole document tree?

    Again, sorry to get into a fundamental debate but I want to find the reasoning for your clear requirements of a non-traversal solution.

  • #3
    Senior Coder Logic Ali's Avatar
    Join Date
    Sep 2010
    Location
    London
    Posts
    1,028
    Thanks
    0
    Thanked 207 Times in 202 Posts
    I just threw this together to try to retrieve all visible text. It could be substantially refined, but seems to work if run as the last item in the document.
    Presumably you have the server-side code to retrieve data from another domain.

    Code:
    <script type='text/javascript'>
    
    var e = document.getElementsByTagName('*'),
        t = '',
        tagElem,
        nodes,
        cn;
        
    for( var i = 0; i < e.length; i++ )
    {
      tagElem = e[ i ];
      nodes = tagElem.childNodes;
      
      if( !/SCRIPT/i.test( tagElem.nodeName ) )
        for( var j = 0; j < nodes.length; j++ )
          if( ( cn = nodes[ j ] ).nodeType == 3 )
            t += ' ' + cn.textContent;   
    }        
        
    alert(t)
    
    </script>
    Alternatively, there may be a solution using document.execCommand.
    Last edited by Logic Ali; 11-03-2012 at 07:19 PM.

  • Users who have thanked Logic Ali for this post:

    Ace..... (11-05-2012)

  • #4
    New Coder
    Join Date
    Nov 2012
    Location
    France
    Posts
    78
    Thanks
    20
    Thanked 0 Times in 0 Posts
    Many thanks for the response:

    Quote Originally Posted by VIPStephan View Post
    do you have any idea what’s happening in the programs’ background while you’re just highlighting, copying, and pasting the text without thinking about it?
    No..... I truly do not know what is happening, however....
    .... The operation of highlighting is ancient (in computing terms), hence, in response to your next question:

    Quote Originally Posted by VIPStephan View Post
    How do you come to the conclusion that it would be any simpler than a JavaScript DOM traversal?
    I presume (rightly or wrongly - please advise) that there exists, and has existed from an early stage of browser development, a simple command that allows 'displayed text' ie. unwritten text - meaning (say) innerHTML; to be referenced/grabbed, and used as genuine text.

    Example:

    pg 1 contains text in a text area.
    I reference that text thru its path:

    Code:
    var pg_1 = document.getElementById("textarea_id").value;
    The text exists as characters viewable by viewing source.
    I now store that text:

    Code:
    localStorage.pg_1_text=pg_1;
    I then display this text on another page entirely pg_2:

    Code:
    document.getElementById("result").innerHTML="<pre>" + localStorage.pg_1_text + "</pre>";
    View source shows only the script, YET you can highlight the text, copy and paste.

    This tells me that something very simple is happening (but I could be wrong).
    Simple, because this highlighting, seems to be fundamental to human-pc interaction, and therefore was included as part of the basic structure of browsers, from the beginning (and since the beginning of the mouse, at least).

    From a debate perspective:
    Would the act of highlighting 'displayed text' be prone to all the divergent possibilities of programming, when ultimately, the actual display of text seems to be 'fundamental'..... all else can change, but display of text (and the highlighting of it) remains constant.

    Logically, therefore, I assume that this places 'displayed text' as a primary function.

    Therefore, I'm asking, whether it is possible to access this (supposed) primary function, rather than traversing the DOM - it is just a thought process that can be discounted.

    But....

    Discounted because, for example, this operation has been blocked due to security constraints OR Javascript cannot access this primary function.

    I'm just explaining my thinking here - don't think I'm being a ttwwaatt.

    Quote Originally Posted by VIPStephan View Post
    Well, depending on how many programmers you’ve got to know so far it seems to me that apparently there isn’t such a simple solution for this after all if it’s so hard to find one who knows any that meets your expectations. So, what makes you so confident that there is a JS solution other than traversing the whole document tree?
    This is a fair point you make.

    My honest answers are:
    • In my experience of life, often (but not every time) - by re-stating the problem, referencing similar scenarios; solutions manifest themselves - hence why I talk about ancient highlighting of text - I never see it mentioned in these topics.
    • Google staff have found a standard way to reference/grab displayed text - send it to their servers, translate it, and send it back.
    • AND they can grab the entire page, translate it and send it back (with formatting/styles/functions/everything).

    This makes me think that 'Joe Public' top programmers can do the same.

    That is my reasoning for the possibility of a non-traversal solution.


    Thanks again for the response.
    The grilling is fair.

    Like I said...... I'm not a ttwwaatt.
    I have some ideas/leads to follow.

    This was the start of the thread.
    Maybe we can do something.

    @Logic Ali Thanks also for the response.
    It's late(ish) - we've gotta eat - maybe somebody else interested in this thread can take your script to pieces and give a view on it?

    I'll also post my script leads later.

    Thanks again to everybody interested in solving this problem (that so few appear to have solved)


  • #5
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,292
    Thanks
    10
    Thanked 583 Times in 564 Posts
    have you played with String(getSelection()) ?

    also, for any given dom element, element.textContent will produce the same text as a clipboard copy of that element when highlighted.
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • #6
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,292
    Thanks
    10
    Thanked 583 Times in 564 Posts
    Quote Originally Posted by Logic Ali View Post
    I just threw this together to try to retrieve all visible text. It could be substantially refined, but seems to work if run as the last item in the document.
    Presumably you have the server-side code to retrieve data from another domain.

    Code:
    <script type='text/javascript'>
    
    var e = document.getElementsByTagName('*'),
        t = '',
        tagElem,
        nodes,
        cn;
        
    for( var i = 0; i < e.length; i++ )
    {
      tagElem = e[ i ];
      nodes = tagElem.childNodes;
      
      if( !/SCRIPT/i.test( tagElem.nodeName ) )
        for( var j = 0; j < nodes.length; j++ )
          if( ( cn = nodes[ j ] ).nodeType == 3 )
            t += ' ' + cn.textContent;   
    }        
        
    alert(t)
    
    </script>
    that will dredge up <script>, <iframe>, <noscript>, and <style> tag text, not cool.

    if you want visible text, at least start in document.body instead of the HTML element...

    first loop through and run element.parentNode.removeChild(element) on every script and style tag before you grab the text.
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • Users who have thanked rnd me for this post:

    Ace..... (11-05-2012)

  • #7
    Senior Coder Logic Ali's Avatar
    Join Date
    Sep 2010
    Location
    London
    Posts
    1,028
    Thanks
    0
    Thanked 207 Times in 202 Posts
    Quote Originally Posted by rnd me View Post
    that will dredge up <script>, <iframe>, <noscript>, and <style> tag text, not cool.

    Code:
     if( !/SCRIPT/i.test( tagElem.nodeName ) )
    I used this to suppress script tags, with the option of adding any others as required.

  • Users who have thanked Logic Ali for this post:

    Ace..... (11-05-2012)

  • #8
    Master Coder felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, Australia
    Posts
    6,603
    Thanks
    0
    Thanked 645 Times in 635 Posts
    The code I used the last time I wrote a script that needed to access all the text in the web page was:

    Code:
    var node, txtnodes;
    nodewalk = function(node, str) {
    if (typeof str != 'array') str = [];
    for (var i = 0; i < node.length; i++) {
      if (node[i].hasChildNodes() && 'SCRIPT' !== node[i].nodeName)
        str = nodewalk(node[i].childNodes,str);
      if (3 === node[i].nodeType) 
        str.push(node[i]);
    return str;
    }
    txtnodes = nodewalk(document.getElementsByTagName('body')[0]);
    If you don't want to include the alternate text for anyone whose browser doesn't support iframes then you'd add && 'IFRAME' !== node[i].nodeName after the test for scripts. Styles don't go in the body so that wouldn't be a problem and <noscript> has been dead since the DOM was implemented to replace it so that shouldn't be a problem either (but if you do still have antiquated code that uses it you can skip its content the same was as for script and iframe)

    I haven't come across a browser where the DOM doesn't return the nodeNames in uppercase for a page served as HTML and so have never bothered using a regular expression to make it insensitive to case. If the page were XHTML then the nodeNames would be lowercase but then you'd need to replace other parts of the code as well.
    Stephen
    Learn Modern JavaScript - http://javascriptexample.net/
    Helping others to solve their computer problem at http://www.felgall.com/

    Don't forget to start your JavaScript code with "use strict"; which makes it easier to find errors in your code.

  • Users who have thanked felgall for this post:

    Ace..... (11-05-2012)

  • #9
    New Coder
    Join Date
    Nov 2012
    Location
    France
    Posts
    78
    Thanks
    20
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Logic Ali View Post
    I just threw this together to try to retrieve all visible text. It could be substantially refined, but seems to work if run as the last item in the document.
    Presumably you have the server-side code to retrieve data from another domain.

    Code:
    <script type='text/javascript'>
    
    var e = document.getElementsByTagName('*'),
        t = '',
        tagElem,
        nodes,
        cn;
        
    for( var i = 0; i < e.length; i++ )
    {
      tagElem = e[ i ];
      nodes = tagElem.childNodes;
      
      if( !/SCRIPT/i.test( tagElem.nodeName ) )
        for( var j = 0; j < nodes.length; j++ )
          if( ( cn = nodes[ j ] ).nodeType == 3 )
            t += ' ' + cn.textContent;   
    }        
      alert(t)
    </script>
    Alternatively, there may be a solution using document.execCommand.
    I've just run a test using Logic Ali's code, involving 3 pages, using onclick to launch the functions:
    It definitely works, though we lose the line breaks.
    I ran it with output to <pre> (local_stor_3.html) and without <pre>.

    Interestingly with <pre> the text is written low down the page - don't know why.

    With <pre> removed, the text is written at the top beneath the typed text.

    While this has to be seen as a success; from a readers perspective, line breaks are critical.

    When highlighting, copying, and pasting to text (a web page); the text pasted does contain the line breaks (or perhaps it recognises paragraph tags).

    I'll now try felgall's code.

    local_store_1.html
    Code:
    <!DOCTYPE html>
    <html>
    <head>
    <script>
    function store_pg_1()
      {var pg_1 = document.getElementById("styled").value;
      localStorage.pg_1_text=pg_1;}
    </script>
    </head>
    <body OnLoad="document.myform.styled.focus();">
    <div id="result">
    <form name="myform">
    <textarea  name="styled" id="styled" onclick="store_pg_1()"> </textarea>
    <br><br>
    <input type="text" name="txt3" id = "Nstyled"  value="input text3" onclick="store_pg_1()"><br>
    </form>
    </div>
    </body>
    </html>
    local_store_2.html
    Code:
    <!DOCTYPE html>
    <html>
    <head>
    <script>
    function store_pg_2()
      {
    var e = document.getElementsByTagName('*'), t = '', tagElem, nodes, cn; for( var i = 0; i < e.length; i++ ) { tagElem = e[ i ]; nodes = tagElem.childNodes; if( !/SCRIPT/i.test( tagElem.nodeName ) ) for( var j = 0; j < nodes.length; j++ ) if( ( cn = nodes[ j ] ).nodeType == 3 ) t += ' ' + cn.textContent; } 
      localStorage.pg_2_text=t;
      }
    </script>
    </head>
    <body onclick='store_pg_2()';>
    <div id="stuff">
    <p>Actual typed text<br>line break, actual typed text</p>
    </div>
    <div id="result">
    <script>
    if(typeof(Storage)!=="undefined")
      {document.getElementById("result").innerHTML="<pre>" + localStorage.pg_1_text + "</pre>";}
    else
      {document.getElementById("result").innerHTML="Sorry, your browser does not support web storage...";}
    </script>
    </div>
    </body>
    </html>
    local_store_3.html
    Code:
    <!DOCTYPE html>
    <html>
    <body>
    <div id="stuff">
    <p>More typed text here<br>break more text here also</p>
    </div>
    <div id="result">
    <script>
    if(typeof(Storage)!=="undefined")
      {document.getElementById("result").innerHTML="<pre>" + localStorage.pg_2_text + "</pre>";}
    else
      {document.getElementById("result").innerHTML="Sorry, your browser does not support web storage...";}
    </script>
    </div>
    </body>
    </html>

  • #10
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,292
    Thanks
    10
    Thanked 583 Times in 564 Posts
    you can replace "</p>" with "</p>\n" and "<br>" with "\n" to get the line breaks back.
    maybe do the same for "</div>" or whatever blocks your content uses.

    a little text transformation can go a long way.

    if you are using text, you want <pre>, just " str".trim() it to remove leading whitespace.

    i prefer "whitespace: pre-wrap" or "pre-line", since i hate scrollbars...
    Last edited by rnd me; 11-05-2012 at 03:44 PM.
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • Users who have thanked rnd me for this post:

    Ace..... (11-05-2012)

  • #11
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,292
    Thanks
    10
    Thanked 583 Times in 564 Posts
    Quote Originally Posted by felgall View Post
    Styles don't go in the body so that wouldn't be a problem and <noscript> has been dead since the DOM was implemented to replace it
    two things:
    1. styles DO go in the body, that's what the scoped attrib is for. i will buy and ship you a great little book on html5, just PM me, but you can and should review the spec that's getting used by all major browsers. just sayin...

    2. how can the dom replace noscript in a browser without javascript?
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • Users who have thanked rnd me for this post:

    Ace..... (11-05-2012)

  • #12
    New Coder
    Join Date
    Nov 2012
    Location
    France
    Posts
    78
    Thanks
    20
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by felgall View Post
    The code I used the last time I wrote a script that needed to access all the text in the web page was:

    Code:
    var node, txtnodes;
    nodewalk = function(node, str) {
    if (typeof str != 'array') str = [];
    for (var i = 0; i < node.length; i++) {
      if (node[i].hasChildNodes() && 'SCRIPT' !== node[i].nodeName)
        str = nodewalk(node[i].childNodes,str);
      if (3 === node[i].nodeType) 
        str.push(node[i]);
    return str;
    }
    txtnodes = nodewalk(document.getElementsByTagName('body')[0]);
    If you don't want to include the alternate text for anyone whose browser doesn't support iframes then you'd add && 'IFRAME' !== node[i].nodeName after the test for scripts. Styles don't go in the body so that wouldn't be a problem and <noscript> has been dead since the DOM was implemented to replace it so that shouldn't be a problem either (but if you do still have antiquated code that uses it you can skip its content the same was as for script and iframe)

    I haven't come across a browser where the DOM doesn't return the nodeNames in uppercase for a page served as HTML and so have never bothered using a regular expression to make it insensitive to case. If the page were XHTML then the nodeNames would be lowercase but then you'd need to replace other parts of the code as well.
    For some reason, I failed to get this code to work.
    It is missing a '}'

    Here is the <head> of page 2:

    Code:
    <head>
    <script>
    function store_pg_2()
      {
    var node, txtnodes; nodewalk = function(node, str) 
        { if (typeof str != 'array') str = []; for (var i = 0; i < node.length; i++) 
         { if (node[i].hasChildNodes() && 'SCRIPT' !== node[i].nodeName) str = nodewalk(node[i].childNodes,str); if (3 === node[i].nodeType) str.push(node[i]); 
    return str; 
        } 
    txtnodes = nodewalk(document.getElementsByTagName('body')[0]);
      localStorage.pg_2_text=txtnodes;
      }
    </script>
    </head>
    I figure the '}' should follow 'str.push(node[i]);'.

    But when placed there, no text is written into page 3.

    It doesn't say 'undefined' like when the bracket is placed elsewhere.

    Anybody any thoughts?

  • #13
    New Coder
    Join Date
    Nov 2012
    Location
    France
    Posts
    78
    Thanks
    20
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by rnd me View Post
    you can replace "</p>" with "</p>\n" and "<br>" with "\n" to get the line breaks back.
    maybe do the same for "</div>" or whatever blocks your content uses.

    a little text transformation can go a long way.

    if you are using text, you want <pre>, just " str".trim() it to remove leading whitespace.

    i prefer "whitespace: pre-wrap" or "pre-line", since i hate scrollbars...
    At the moment, the 3 page test, is to create some displayed text in page 2. using innerHTML, and confirm it can be made to appear in page 3 correctly.

    The actual principal being:
    To have the 3 pages (as iframes) in a container page.

    I type text into page 1. textarea (purely using enter to create line breaks.

    This text appears in page 2. where google-translate will read it, translate it, and send it back as 'displayed text'

    I then want to transfer that displayed text to page 3. maintaining the line breaks.

    Question
    Do I somehow swap the line breaks for <br> before it gets written as innerHTML (in page 2) ?

    Ie. Do NOT write the innerHTML into <pre></pre>.
    Instead, the displayed text would be line separated by <br>.
    It would then, presumably be returned by google with the <BR>'s intact.

    Or is there a better solution?

  • #14
    New Coder
    Join Date
    Nov 2012
    Location
    France
    Posts
    78
    Thanks
    20
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Ace..... View Post
    Question
    Do I somehow swap the line breaks for <br> before it gets written as innerHTML (in page 2) ?
    Actually <br> doesn't work.
    I've just tested it.

    So.....
    ... using Logic Ali's script:

    It reads all the displayed text on page 2, and passes it to the variable 't'.
    This gets stored in localStorage under the name pg_2_text.

    I then use the code on page 3:

    Code:
    document.getElementById("result").innerHTML= "<pre>" + localStorage.pg_2_text + "</pre>";}
    This works for everything typed into the page 1 textarea. Ie. line breaks are passed thru page 2, to page 3.

    This actually was/is the objective.

    What I need to do next, is test this with google translate.

    Apologies to everybody for the confusion.
    In mitigation: It's so easy to lose track of tests, what's been changed, and were the tests consistent in the first place?

    I think.... before any further mods are effected (if any are required), a test with google translate should be effected.

    So, apart from my human failings....... this is looking to be an awesomely powerful script.

    Fingers crossed that everything comes back from the google servers, as is needed.

  • #15
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,292
    Thanks
    10
    Thanked 583 Times in 564 Posts
    if this is something that's going to be saved, it better to use <textarea>.value so people can cut and paste and to prevent the browser from wrongly fixing certain unicode chars.
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%


  •  
    Page 1 of 2 12 LastLast

    Tags for this Thread

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •