Skip to content

2. Using Tika and Tesseract as an API exposed by Solr via ExtractingRequestHandler

Eric Pugh edited this page Nov 6, 2019 · 1 revision

Don't want to deploy a separate Tika server? But need Tika server like capabilities and you already have Solr? This is the solution for you!

First we figured out the magic incantation to configure Tika from inside of Solr, which is via a parseContext.config parameter and a specific XML format:

<entries>
  <entry class="org.apache.tika.parser.pdf.PDFParserConfig" impl="org.apache.tika.parser.pdf.PDFParserConfig">
    <property name="extractInlineImages" value="true"/>
    <property name="ocrStrategy" value="OCR_AND_TEXT_EXTRACTION"/>
  </entry>
  <entry class="org.apache.tika.parser.ocr.TesseractOCRConfig" impl="org.apache.tika.parser.ocr.TesseractOCRConfig">
    <property name="outputType" value="HOCR"/>
    <property name="language" value="eng"/>
    <property name="pageSegMode" value="1"/>
  </entry>
</entries>

You might be tempted to think that this is the same file format as a tika-config.xml, and you'd be wrong ;-). While visually very similar, this file is loaded by ParseContextConfig, which is part of the Solr extraction contrib module. So yes, there are many different ways to specify configuration settings for PDF extraction and Tesseract OCR!

We then tweaked the default /update/extract request handler to refer to the parseContext.xml. We want any fields that we don't already have defined in solrconfig.xml to be prepended with the name attr_ which triggers a dynamic field generation. So if the field from Tika is Creator, it becomes in Solr a text field called attr_creator.

<requestHandler name="/update/extract"
                class="solr.extraction.ExtractingRequestHandler" >
  <str name="parseContext.config">parseContext.xml</str>
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="multipartUploadLimitInKB">20480</str> Limit to 20 MB PDF
  </lst>
</requestHandler>

Because PDFs can be big, we also needed to bump the size on the requestDispatcher

<requestDispatcher handleSelect="true" >
  <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20480" formdataUploadLimitInKB="20480" />
</requestDispatcher>

You can now hit Solr via curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc2&commit=true&extractOnly=true' -F "myfile=@files/alvarez20140715a.pdf" and get back from Solr the Tika processed content in a relatively easy to process structure!