Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 6 of 6
  1. #1
    New to the CF scene
    Join Date
    May 2009
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question Removing unwanted tags with RegEx

    I am open to other suggestions as a solution for this, but here is an account of what I have been trying to do.

    Hey everyone - I'm new here so first of all "hello" to everyone - and a big thanks ahead of time for taking the time to read this...

    Rest assured, where I can help here, I will be sure to.

    I am a bit of an intermediate javascript user and am currently trying to write a RTE for our CMS system. So far, things are turning out well, but after an absurd amount of editing, our code blocks tend to get quite messy. What I'm trying to do first, is write a "cleaner" script that will get rid of unwanted and unused tags in our code.

    So here is an example of our first task - after many edits, we have managed to save this ugly bit of code (NOTE: it displays properly, but gosh, is it a mess to look at):

    HTML Code:

    Code:
    <h1><h2><p></p></h2></h1><p></p><p>This is a test paragraph</p><h2></h2><p><h1></h1></p>
    I will admin, i've gone a bit overboard in the demonstration of this mess.

    In any event, as you can probably assume, we want to rid all the empty tags here so our output (after script runs) would only be:

    HTML Code:

    Code:
    <p>This is a test</p>
    First we tried this:
    Code:

    Code:
    var html = [our content];
    html = html.replace(/<(\S+).*><\/\1>/, "");
    this seemed to get rid of only the first occurance, giving us:
    HTML Code:

    Code:
    <p></p></h2></h1><p></p><p>This is a test paragraph</p><h2></h2><p><h1></h1></p>
    So we thought, "gee whiz! we should run a global replace on it!" So thus:
    Code:

    Code:
    var html = [our content];
    html = html.replace(/<(\S+).*><\/\1>/g, "");
    it seemed to work pretty well and output this code:
    HTML Code:

    Code:
    <p>This is a test paragraph</p>
    However, then we started adding some unordered lists and headers, paragraphs, etc back in and it all ended up getting axed as well ( no good! ).

    So then we figured, instead of replacing them all at once and deleting every <h1></h1> element just because there was one empty one, we could run a loop to replace them one at a time. So we wrote this script, figuring it would work:
    Code:

    Code:
    var html = clean = [our content];
    while(clean.indexOf(/<(\S+).*><\/\1>/) >= 0) {
         clean = clean.replace(/<(\S+).*><\/\1>/, "\n");
    }
    but it didn't.

    Any suggestions?

  • #2
    Regular Coder
    Join Date
    Feb 2009
    Location
    Wheeling, IL
    Posts
    358
    Thanks
    5
    Thanked 62 Times in 60 Posts
    There is a firefox extension that can do the same thing... true, it doesn't change your webpages, you need to do that yourself, but it will do the cleaning. Additionally, it is very useful because it also serves as an HTML validator. Get it here:

    http://www.w3.org/People/Raggett/tidy/

    Plus, I'm really not sure how you plan to actually edit stuff with JS, considering it is purely a client-side language and as such has no access to the server files, so any changes you make with it aren't really permanent changes, but temporary stuff. True, going through a website and clearing everything out manually might be a bit of an annoyance, but eh... it's better in the long run and it will definitely save you some head-bashing when you go and look at your code at a later time.
    The way to success is to assume that there are no impossible things. After all, if you think something is impossible, you will not even try to do it.

    How to ask smart questions?

  • #3
    Senior Coder
    Join Date
    Oct 2008
    Location
    Long Beach
    Posts
    1,196
    Thanks
    36
    Thanked 164 Times in 164 Posts
    Here's my quick response. This will remove all html tags (including the one you want to remain):
    Code:
    var test_html = "<p></p></h2></h1><p></p><p>This is a test paragraph</p><h2></h2><p><h1></h1></p>";
    
    function stripXHTMLTags(str)
    	{
    	return str.replace(/<[^>]*>/g,"")
    	}
    
    alert(stripXHTMLTags(test_html));
    If I have any spare time today I'll look in to removing only the empty ones.
    Feel free to e-mail me if I forget to respond ;)
    ohsosexybrit@gmail.com

  • #4
    Senior Coder
    Join Date
    Oct 2008
    Location
    Long Beach
    Posts
    1,196
    Thanks
    36
    Thanked 164 Times in 164 Posts
    Modified my above one, tell me if it meets your expectations:
    Code:
    var test_html = "<p> </p></h2></h1><p></p><p>This is a test paragraph</p><h2></h2><p><h1></h1></p>";
    
    function stripEmptyXHTMLTags(str)
    	{
    	//var reg_exp = new RegExp("<[^>]*>","g"); //removes all markup tags
    	var reg_exp = new RegExp("<[^>]*><[^>]*>","g");
    	return str.replace(reg_exp,"")
    	}
    
    alert(stripEmptyXHTMLTags(test_html));
    EDIT: it doesn't work for multiple tags with content

    e.g.:
    Code:
    var test_html = "<p> </p></h2></h1><p>hey</p><p>This is a test paragraph</p><h2></h2><p><h1></h1></p>";
    still trying...

    EDIT2:
    Here's a sloppy way of doing it.
    Code:
    var test_html = "<p></p><h2></h2><h1>hey</h1><p></p><p>This is a test paragraph</p><h2></h2><p><h2></h2></p>";
    
    function outerHTML(elm)
    	{
    	if (elm.outerHTML) return elm.outerHTML;
    	var temp_parent = document.createElement("div");
    	temp_parent.appendChild(elm);
    	return temp_parent.innerHTML;
    	}
    
    function stripEmptyXHTMLTags(str)
    	{
    	var temp = document.createElement("div");
    	temp.innerHTML = str;
    	var children = temp.getElementsByTagName("*");
    	var new_str = "";
    	for (var i=0;i<children.length;i++)
    		{
    		if (children[i].innerHTML.replace(" ","")) new_str += outerHTML(children[i]);
    		}
    	return new_str;
    	}
    
    alert(stripEmptyXHTMLTags(test_html));
    Last edited by itsallkizza; 05-27-2009 at 04:05 PM.
    Feel free to e-mail me if I forget to respond ;)
    ohsosexybrit@gmail.com

  • #5
    New to the CF scene
    Join Date
    May 2009
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Post Part I SOLVED

    Using your suggestions, I created the following expression which works great for removing all the empty tags:

    /* Remove empty tags */
    theHTML = theHTML.replace(/(<[^\/]>|<[^\/][^>]*[^\/]>)\s*<\/[^>]*>/g, "");


    However, there is a problem which this doesn't solve:

    <h1><p>Test test test</p></h1> - when this is encountered, you get all the styles of the <p> tag, but any undefined styles of this which are defined by the heading tag are assumed as well, such as font-variant, etc. I could solve it by defining all these styles in my CSS, but that still leaves the sloppy nested code.

    So, now I'm trying to come up with a regEx for removing situations like this, but without removing things like this:

    <p><strong>test test test</strong></p>

    Here's what I've got so far - don't know if I can capture multiple patterns in one expression though - This is fairly long, so I'll break it up below:

    /* Replace redundant tags */
    theHTML = theHTML.replace(/<([^\/]>|<[^\/][^>]*[^\/]).*>(<[^\/]>|<[^\/][^>]\S+[^\/]>*<\/[^>]*>)<\/\1>/g, "$2");


    <([^\/]>|<[^\/][^>]*[^\/]).*>
    This should, as in the first example grab any opening tag and capture the tagName inside (but not attributes).

    <\/\1.*>
    The ending bit, should merely be (using the capture) the closing tag, so in our problem these two would capture the <h1></h1> and not <h1></p>

    (<[^\/]>|<[^\/][^>]\S+[^\/]>*<\/[^>]*>)
    The middle bit should capture any tag sequence such as <p></p> and capture it as well - then I want to replace all the match with only this info, removing the nested tags...

    This doesn't seem to be working, however - any suggestions?

    I know this will remove the <p><strong></strong></p> that I do not want to remove, but it shouldn't be diffucult to change the first bit to something like this to remove only unwanted tags:

    <(h1|h2|h3|h4|h5|h6)[^\/]>|<[^\/][^>]*[^\/]).*>

  • #6
    Senior Coder
    Join Date
    Oct 2008
    Location
    Long Beach
    Posts
    1,196
    Thanks
    36
    Thanked 164 Times in 164 Posts
    It occurred to me that perhaps it would be more efficient to simply use something like Firefox's Firebug plugin to do this work for you. I guess I didn't read your initial post thoroughly enough to notice this is for developmental purposes, which means you don't need client-side script to do the work for you. Firebug (and I'm sure many other plugins) will help clean up invalid code, you can then see all the empty tags by validating your page.

    Anyway, I believe I found a solution for all valid html code (as in, notice the removal of nested header tags in the test_html):
    Code:
    var test_html = "<h2><p>hey</p></h2></div><p></p><p>This is a test paragraph</p><h2></h2><p><h1></h1></p>";
    
    /* JKD 09-06-2007 */
    HTMLElement.prototype.__defineGetter__("outerHTML", function() {
        var span = document.createElement("span");
        span.appendChild(this.cloneNode(true));
        return span.innerHTML;
    });
    
    HTMLElement.prototype.__defineSetter__("outerHTML", function(html){
        var range = document.createRange();
        this.innerHTML = html;
        range.selectNodeContents(this);
        var frag = range.extractContents();
        this.parentNode.insertBefore(frag, this);
        this.parentNode.removeChild(this);
    });
    /* end JKD */
    
    function numChildren(str)
    	{
    	var temp = document.createElement("div");
    	temp.innerHTML = str;
    	return temp.getElementsByTagName("*").length;
    	}
    
    function stripEmptyXHTMLTags(str)
    	{
    	var temp = document.createElement("div");
    	temp.innerHTML = str;
    	var children = temp.getElementsByTagName("*");
    	var new_str = "";
    	var tag_reg_exp = new RegExp("<[^>]*>|\t|\n|\v|\0|\f|\r| ","g");
    	for (var i=0;i<children.length;i++)
    		{
    		if (children[i].innerHTML.replace(tag_reg_exp,""))
    			{
    			new_str += children[i].outerHTML;
    			i += children[i].getElementsByTagName("*").length;
    			}
    		}
    	return new_str;
    	}
    
    alert(stripEmptyXHTMLTags(test_html));
    Last edited by itsallkizza; 05-28-2009 at 04:15 PM.
    Feel free to e-mail me if I forget to respond ;)
    ohsosexybrit@gmail.com


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •