Convert PDF to Text (Using Apache PDFBox)


In this post we will see how we can convert PDF to Text or how we can extract text from PDF file.
We will be using a Java library called Apache PDFBox, it is one of the project on website.

Apache PDFBox is really powerfull and handy, either you are a full fledge Java programmer or a common end user, you can use it with the APIs it provides or can use Commandline as well. We will see both ways to use it. Here I am using Apache PDFBox 2.0.


Download the library
Download library from Apache PDFBox ( pdfbox-app-2.0.3.jar )

Using Java APIs:

First add the jar file pdfbox-app-2.0.3.jar in your project, below is a sample code to extract text from PDF:

Apart from above, Apache PDFBox provides lots of APIs, check it out on API Section

Using Commandline:

Syntax for Text Extraction:



Find some more parameters for commandline here

In both methods, you can iterate commands for multiple files, or can parallelize the process for bulk processing.

— Convert PDF to Text —
— extract text from PDF file —