News

Apache Tika 2.3 Released

The maintainers of the Apache Tika project, the open-source, Java-based content detection and analysis framework, recently announced the release of Tika 2.3.0.

This release comes with several security upgrades in dependencies, including an upgrade to log4j2 (version 2.17.1). It also includes a non-trivial upgrade to Apache POI 5.2.0 (TIKA-3164). "Users will observe significantly more logging from the POI parsers," wrote Tim Allison, a long-time project committer, on the project mailing list page. The release contents have been pushed to the main Apache release site and to the Maven Central sync, Allison added.

The Apache Tika toolkit was designed to detect and extract metadata and structured text content from more than 1,400 different file types. Data is stored in literally thousands of formats, from text documents and Excel spreadsheets to JPEG images and multimedia files. Consequently, search engines and content management systems need additional support for efficient extraction of data from these document types. Apache Tika provides that support via a generic API for parsing different file formats. It uses existing specialized parser libraries for each document type.

Tika is widely used in search engines, document analysis solutions, digital asset management tools, and content analysis components. Although it was written in Java, Tika is widely used from other languages. Tika-Python, for example, is Python binding to the Apache TikaTM REST services, which allows Tika to be called natively in Python.

The 16-plus-year-old project is stewarded at the Apache Software Foundation (ASF). It was formerly a subproject of Apache Lucene, a Java library designed to provide indexing and search features, as well as spellchecking, hit highlighting, and advanced analysis/tokenization capabilities.

Apache Tika is available on the download page. It's also available in binary form or for use using Maven 2 from the Central Repository.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].