The Apache Tika™ toolkit detects and extracts metadata and text content from various documents – from PPT to CSV to PDF – using existing parser libraries. Tika unifies these parsers under a single interface to allow you to easily parse over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page.
To build Tika from source, you will need Java 6 and Maven 2. But, you can also jump straight into extraction using the Tika app jar file, found on the download page. Another option is to use one of the wrappers written to use Tika in other programming languages, like Julia or Python. Please see the Getting Started page for more information. The Parser and Detector pages describe the main interfaces of Tika and how they work.