View Full Version : pdf to xml or to string in c#

06-23-2009, 10:31 PM
I need to extract data from pdf files. I'm using .NET
I've been pouring over the web to find a way to do this. This is a case where the web is working against me. Putting data into pdf is easy and there's about a gazillion people posting how to do that. That makes it really hard to find how to do the opposite - get data out of pdf.
Ideally, I'd like to convert pdf into xml. Failing that, I'd like to read the text out of it into a string or stream.
I'd love to do it without using a COM component or some buggy open source product (I'm not anti-open source, but we all know there's a lot of half-baked open source software out there).
Is it possible?

06-23-2009, 10:57 PM
Your best bet is to find something that can spit it out into some type of format for you, and you can work from there to decipher it. I did a quick google on "c# parse pdf" and found a few examples:

Looks like it uses some type of library to get it into text format.

06-24-2009, 05:41 AM
Hey bnewman,

I hear you on the open source stuff, as you are risking more chances of bugs, however in this case, I do believe that's the way to go. Look into the following open source components:

activePDF Server
PDFlib + PDI
TallPDF.NET 3.0

Now, some of these you actually have to pay for, but I think if you just use one of the free components (iTextSharp is free I think), you should be fine. Just do some good testing, that's all.