Acrobat to the Rescue: Searching Unsearchable Productions

rescueIn a perverse irony, lawyers often ‘brag’ about how little they know about information technology; but in situations where admitting confusion could help them, they clam up.  Abraham Lincoln said, “Better to remain silent and be thought a fool than to speak out and remove all doubt.”  But with respect to problems in electronic discovery, it’s foolish to stay silent.

Sadly, many requesting parties are flummoxed by what’s produced to them.  Rather than confess their confusion, they suffer in silence, opening or printing TIFF images one page at a time with nary a clue how to search what they’ve received.  And when a production arrives broken—lacking some essential element required for completeness or functionality—the silent majority often don’t know what they’re missing.  Instead, they laboriously flail away at the evidence, hoping to turn up something useful.  It’s a painful and unnecessary ordeal.

Case in point: a client received a production of about 5,000 documents; mostly e-mail messages, all produced as Adobe Portable Document Files or PDFs.  Though the documents derived from inherently searchable electronic originals, all the PDFs were created without a searchable text layer, and no extracted text or any fielded data were furnished in accompanying load files.  Ouch!

E-discovery denizens reading this will grasp the deviousness of the production.  It ruthlessly destroys any ability to search or sort the documents electronically and runs afoul of the Federal mandate stating, “If the responding party ordinarily maintains the information it is producing in a way that makes it searchable by electronic means, the information should not be produced in a form that removes or significantly degrades this feature.”  Comments to Rule 34(b) of the Federal Rules of Civil Procedure.

Innocent mistake?  Hardly.  The producing party is a Fortune 50 corporation with a storied history of discovery abuse.  It’s not their first rodeo. 

The producing party surely knows that it will have to supply a replacement production, if sought; but, it also knows that most requesting parties won’t raise a ruckus for fear that an objection will prompt a humiliating, “apparently you don’t understand how to use what we gave you.”  With the lack of e-discovery competence extant, most opponents will let it pass unaware.  Ignorance is bliss, more so when you can take advantage of the ignorant.

But stripping out searchability and holding back load files is advantageous even when sprung on a savvy opponent like my client.  It buys time.  Depositions must be put off and discovery deadlines or trial dates moved.  Opponents squander resources fiddling with the broken production, drafting motions and hiring experts.  It’s a tactic that rarely engenders sanctions or cost-shifting because few judges are going to punish a producing party who agrees to promptly supplement and supply the missing data.  Every dog gets one bite…per lawsuit.

So, if you’ve received a production like the brain dead PDFs mentioned, how do you muddle through and deny your opponent the benefit of such delaying tactics?  There’s no pat answer, but I’ll describe the quick-and-dirty approach I took to assist a lawyer who, on the eve of depositions, said, “I’ve just got to go forward with what I’ve got.”

If you’re stuck with unsearchable document images, there are three things you can do to add electronic searchability:

  1. You can obtain the native source document or a near-native counterpart;
  2. You can obtain extracted text and the requisite load file data that pairs the text with the images; or,
  3. You can run optical character recognition (OCR) against the images to extract text.

The third option is the only one you can undertake without obtaining further production from the other side, so it was the only option here.[1]

For the most part, the PDFs produced held clean text.  That is, because they derived from electronic originals, there were few handwritten annotations, skewed scans, funky fonts or other characteristics to confound OCR.  OCR is error-prone at its best; but, it performs abysmally on anything but clean text images.

Once they had the extracted text of the documents in an electronic format, my clients would need a means to pair the extracted text with the correct page image and to search the text.  If the mechanism employed indexed the text so as to speed search and supported Boolean and proximity searching, even better.

So, I turned to Adobe Acrobat.  The old version 9 Pro edition of Acrobat on my machine is up-to-date enough to create Acrobat Portfolios, run OCR against the contents and even optimize the index for speedier search.  It also supports Boolean and proximity searching in a simple-to-use interface that includes a preview mechanism and a basic way to annotate notable documents.

While you need Adobe Acrobat versions 9, 10 or 11 to create a portfolio the recipient of the portfolio just needs the ubiquitous, free Acrobat Reader application to open, view and search it.  A PDF Portfolio supports a simple browser-style viewer format in Acrobat Reader, so the documents are very quick to peruse.

Here, I need to reiterate the key difference between Adobe Acrobat products that just seems to stymie so many.  Adobe gives away a program called Adobe Reader.  It reads PDF formats, but it doesn’t create them.  Repeat: it doesn’t create PDFs or Portfolios.  It just reads them.  It’s called “Reader.”  Why? Because IT DOESN’T CREATE PDFs.  It’s free, so enjoy what it does, which is read PDFs.  Only.

Adobe sells products called Acrobat (so named because you have to perform gymnastics to get people to understand that the Reader product just reads PDFs).  The Acrobat products create PDFs, including Portfolios, from Version 9 forward.  This is how Adobe makes money: free reader, $350 writer.

But like most law offices, you already have a copy of the Adobe Acrobat program.  The writer, not the…oh, never mind.

To create the searchable Portfolio from almost 5,000 non-searchable PDFs comprising 1.7GB of data, I began by copying the PDFs I wanted to make searchable into a separate folder.  Next, I ran Adobe Acrobat and selected “Create PDF Portfolio” from the File menu.  The Edit PDF Portfolio window seen below opened.

OCR_1

I then selected “Add Existing Folder” from the bottom of the window and pointed the program to the folder I’d filled with unsearchable PDFs.   Acrobat began assembling the Portfolio from the files.  It took only a few minutes to ‘bind’ the documents into a virtual notebook; however, what I had wasn’t yet searchable.

The next step was to run optical character recognition against all the documents in the Portfolio.  Adobe Acrobat has a built-in basic OCR capability.  From the Document menu, I selected OCR Text Recognition>Recognize Text in Multiple Files Using OCR.  The dialog box that appeared allowed me to Add Files > Add Open Files. OCR_2As I’d not yet saved it, the portfolio in progress was called “Portfolio1.pdf” by default.  I selected it and my Output Options; then, I left for dinner because it would take hours for Acrobat to extract text from an estimated 30-40 thousand page images using optical character recognition.

Before you vendors reading this add, “Our tool would be better for this,” please remember that the goal here was fast and cheap.  Your wares cost more than free and carry a steeper learning curve than an application law firms already have and use.  Adobe Acrobat doesn’t deliver the benefits of applications purpose-built for e-discovery; but, it’s the butter knife that serves as a decent screwdriver in a pinch.

When the OCR engine completed its work, all of the documents in the collection were now text searchable…sort of.  Text in uncommon typefaces or unclear to the OCR engine was rendered incorrectly or not at all.  Gray scale content remained largely unsearchable.  What emerged was far more utile than what was produced, but fell short of what should be exchanged in e-discovery.

Searching was slow because each PDF in the portfolio had to be searched one-by-one.  To speed search, the next step was to generate an index for the contents of the portfolio.  From the Advanced menu, I selected Document Processing, set my parameters and generated an index.[2]  I let this run for a few hours more until completion. OCR_3

Now, I had something I could give my client to enable his team to run text and proximity searches against the collection,[3] even if the only tool they had to use was a free copy of Acrobat Reader.   It’s even feasible for reviewers using Acrobat to add tags in each document’s description field (or in a custom field added by the reviewer) and sort by those fields and tags.  A lagniappe of the process is that, by consolidating the PDFs into a Portfolio, they’re compressed and stored more efficiently.  Even with the added text, the searchable Portfolio is one-third the collective size of the documents it holds.

My client can now prepare for depositions.  Acrobat rode to the rescue; yet, the Portfolio workaround detailed here is far from optimum.  It’s triage: quick, low cost and preferable to having no review platform and no ability to search the production, but no substitute for a proper production.


[1] I suppose you could have typists recreate all of the text in the documents manually; but, I shudder to think what that would cost.

[2]  In Acrobat 10 and 11, look for this option in the Tools menu.

[3] For Boolean and proximity searches, use the Advanced Search dialogue box.  If you have trouble getting the Advanced Search box to appear (as I did with Acrobat 9), try this: Open Acrobat, then open the Advanced Search dialogue box and only then open the Portfolio file. The window stays open and supports the advanced search options.

This was originally posted by Craig Ball on July 21, 2013 at https://ballinyourcourt.wordpress.com/2013/07/21/acrobat-to-the-rescue-searching-unsearchable-productions/#more-1297

Tagged:

Rate This Article

(29 out of 82 people found this article helpful)

Leave A Comment?