Skip to content

Store binary data in Solr and serve it up like a object store!

Eric Pugh edited this page Nov 6, 2019 · 1 revision

I'm sure this will incite some hate mail, however Solr makes a really powerful caching layer for single documents. I've done 3000 queries per second looking up large blobs of text off a single node that hosted 100's of GB of data.

So that got me curious, how would I store binary objects in Solr and serve them up? In the demo we use a simple python server in the init script: http-server ../files --cors -p 8443, but that isn't super clean. We already store the text in Solr, couldn't we store the PDF?

I started by adding the binary field type to schema.xml.

Then I tweaked the create-solr-docs.ps1 powershell script to encode the raw binary PDF file into Base64 via $file_binary_base64 = [Convert]::ToBase64String([System.IO.File]::ReadAllBytes($file))

Look in ocr/docs_for_solr to see examples of the resulting Solr JSON formatted docs.

These documents post right into Solr just fine.

So, much much more interesting is how we get the binary file out of Solr! We want it to come out as a big chunk of base64 encoded binary data, and conveniently enough, PDF.js has the method to load up the binary data directly via pdfjsLib.getDocument({data: pdfData});.

So... I needed a query to Solr that wouldn't wrap the field I wanted, file_binary_base64 with any other formatting. Well, the wt=csv writer type comes to the rescue. Combined with csv.header=false and fl=file_binary_base64 and we just get base64 binary data back from Solr. Run it through atob() Javascript method, and we are good to go!

Check out the demo file load_pdf_from_solr_directly.html to see this in action. There are some alerts() to show you whats going on.