View Full Version : Exporting body of HTML pages - sitewide

05-22-2007, 11:13 PM
I have a need to export only the contents of the main content cell from each page throughout my site. This is for a translation agency.

Does anyone know of a speedy way I can export only this specific section and save each file as HTML?

I'm thinking of crawling with WinHTTrack site copier, and then somehow I'd need to strip out all the code above and below the main content cell.

Thanks for any ideas...

05-23-2007, 05:31 PM
Do all of these pages already exist, or do you have the ability to
put some sort of code before and after each section?

I'm thinking if you were to flag the sections, you could use PHP to
extract those easily, sort of like an RSS Feeder.


<div id='english'>
This is the content in English.

The PHP script could find all text between the <div>'s that have an
id = 'english' and do whatever you want with the content.

This is what an RSS Feeder does when it parses an HTML page and
creates the XML for the RSS Reader.

05-23-2007, 07:15 PM
Yes, I have an html comment before and after this section like

<!-- start main section body -->
<!-- end main section body -->

Would this work? Do you know of example code that would be of use to me?

Thanks for your suggestion! It sounds like it could work...

05-23-2007, 07:47 PM
Here is an example page I copied from the internet and
added the two lines in the middle of it (view HTML source):
You'll see the <!-- start ... and <! -- end ... parts:


This is the PHP script (see below) that extracts the part you want:


Here is the PHP script source:


// Get the page you want to parse
$url = "http://www.catpin.com/lorem.html";
$data = implode("", file($url));

// Get all content between <body> and </body>
preg_match_all ("/<body>([^`]*?)<\/body>/", $data, $matches);

// Loop through the page
foreach ($matches[0] as $match) {

// Get Content between your comment lines
preg_match ("/<!-- start main section body -->([^`]*?)<!-- end main section body -->/", $match, $temp);
$content = $temp['1'];
$content = strip_tags($content);
$content = trim($content);

// Print Content Found
echo $content;

// you would save the content instead of printing it
// or do whatever you want with it.




There is an extra "grab content" part because you may want to extract more than
just the part between your two comment lines ... example, you may also want to
grab between <title> and </title>. This allows the ability to do that. If each page
only has one section between one set of comment lines, you could do it with fewer
lines of code ... but I took this from pieces of an RSS Feeder, so it's what it is ....


05-23-2007, 09:01 PM
Thanks mlseim!

I removed the strip_tags line because I wanted to actually preserve the html for this content.

Now for the clincher, how can I get this to go through my entire site, parsing all pages and saving them as html files?

05-24-2007, 12:40 AM
First, tackle the save-as HTML part.
Instead of printing, determine the path
and filename you'll be giving it (or them),
open the file and write the HTML you want
along with $content.
Look for tutorials here:

You didn't mention anything about the $content
written into one file, multiple files, various directories ...
so you'll have to figure that part out.

Then, tackle the loop where PHP looks for all files ending
with .html and goes through them one by one.
Start by looking for possible examples here:

I don't have time to write any examples for these.
This would be a good time for you to learn PHP.

The intent of the forum is to help with existing code ...
I gave you a good start -- to get you going.