...

View Full Version : pdf to xml or to string in c#



bnewman
06-23-2009, 09:31 PM
I need to extract data from pdf files. I'm using .NET
I've been pouring over the web to find a way to do this. This is a case where the web is working against me. Putting data into pdf is easy and there's about a gazillion people posting how to do that. That makes it really hard to find how to do the opposite - get data out of pdf.
Ideally, I'd like to convert pdf into xml. Failing that, I'd like to read the text out of it into a string or stream.
I'd love to do it without using a COM component or some buggy open source product (I'm not anti-open source, but we all know there's a lot of half-baked open source software out there).
Is it possible?

Brandoe85
06-23-2009, 09:57 PM
Your best bet is to find something that can spit it out into some type of format for you, and you can work from there to decipher it. I did a quick google on "c# parse pdf" and found a few examples:
http://naspinski.net/post/ParsingReading-a-PDF-file-with-C-and-AspNet-to-text.aspx

Looks like it uses some type of library to get it into text format.

Mike_O
06-24-2009, 04:41 AM
Hey bnewman,

I hear you on the open source stuff, as you are risking more chances of bugs, however in this case, I do believe that's the way to go. Look into the following open source components:

iTextSharp
activePDF Server
PDF4NET
PDFlib + PDI
TallPDF.NET 3.0

Now, some of these you actually have to pay for, but I think if you just use one of the free components (iTextSharp is free I think), you should be fine. Just do some good testing, that's all.

Mike



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum