By Jason Wilson
In my recent interview with Peter Jackson, the chief scientist for Thomson Reuters, we had the following exchange:
JW: I was curious about the current state of the art in legal search. In 2007, you raised the issue that extraction technology required information to be explicitly stated in the text; it couldn’t be implied. You used an example of when a debtor moves to convert from Chapter 7 to Chapter 13, and a creditor files a complaint to oppose it, the judge decides the case by “finding for the plaintiff,” which really means the conversion was denied because the plaintiff is the creditor and the defendant is the debtor. Are we any closer to achieving the inferential capability necessary to extract this kind of data?
PJ: I think it is a very hard problem. I think that in theory you could sit down and build a very specially crafted solution for that particular kind of inference. It’s very hard to see how you do that in a way that would be scalable or would apply to similar kinds of reasoning problems. With the right amount of duct tape you can solve all of these narrow problems. You can always come up with some sort of algorithm or device or whatever, but to come up with a more general solution that you could apply to different kinds of situations, even just within a case, is much more difficult. For example, when we worked on Litigation Monitor, we wrote a program that went through the front matter in a case and figured out who the plaintiffs were, who the defendant’s were, what attorney and law firms were represented, whom they were representing, and what was the case was about. But that was a very specially crafted piece of code. There’s nothing in that code that would help you solve an analogous inference problem either with a different kind of document with a different kind of format or some other kind of reasoning problem like the one you described about bankruptcy.
Since then, I’ve been thinking more about the problem of organizing and searching the legal web (a topic of my upcoming Slaw.ca piece), and the difficulties we face as editors (curators). In particular, how do we make specifically relevant information generated by legal professionals findable? Particularly when you realize that Google only gets you so far.
I actually owe this post to Mike Cane, who reminded me that dumb information can be turned into smart information in his post last year titled, “Dumb eBooks Must Die, Smart eBooks Must Live.” It was after reading that post I realized the difficulties in making dumb information “smart.” As far as the legal web is concerned, most blog posts, white papers, firm synopses, etc. are “dumb” information. And Jackson’s response to my question merely solidified Cane’s observations. Namely, the inference problem is really one of hidden data, which can only be revealed by human editors at this point.
Let’s take a look at a random passage from a recent opinion of the U.S. Supreme Court, City of Ontario v. Quan.
And it is worth noting that during his internal affairs investigation, McMahon redacted all messages Quon sent while off duty, a measure which reduced the intrusiveness of any further review of the transcripts.
Some of the “smart” information associated with this single passage could include the following:
<citation-parties-01=City of Ontario>
<citation-court=United State Supreme Court>
<citation-district court-judge=Stephen G. Larson>
<citation-court of appeals-district=Ninth>
<factual-police=Ontario Police Department>
<factual-police=Special Weapons and Tactics Team>
<factual-company=Arch Wireless Operating Company>
<technology-03=backup communications system>
<technology=short message service>
<communication=short message service>
When did you stop reading? The first five lines or so? If so, I encourage you to go back and read the rest and then think about what other implied data is missing. Cane describes this as “exploded data,” that is, hidden text that makes a dumb sentence into a smart one. But, as you soon realize, it is a human who is the bomb maker. It is a person who is making most of these connections, not an algorithm.
In talking about eBooks, Cane makes a very astute observation:
An eBook becomes a local terminal connected to a growing and living cloud of associated information, with meanings and implications no publisher or writer can currently imagine.
Take a second and think about this perspective in the context of a single practitioner’s blog post and how much exploded data is included within it. Without human editors at this point, what are you really discovering? Better yet, what are you missing? I ask this because you (legal researcher) think that everything you know about a subject is found on WestlawNext, Lexis Advance, Loislaw, Fastcase, CaseMaker, or (gasp!) a printed treatise or handbook. But it isn’t. The Scott Greenfield’s of this world teach us that a single sentence is full of exploded data, especially that pesky “practical knowledge” kind.
My thought at this point is that the legal web is in an infancy that we can’t even fathom yet. There is cloud of associated information that our current computer assisted legal research vendors cannot give to us based on their algorithms, especially when they remain in walled-in gardens that don’t account for the vast and valuable information being created by users. The question is whether we will step up to organize this sea of data, or wait until a program can do it for us. If the latter, what does it say about the future of legal research and the practice of law?
[Image (CC) by hey mr glen]
UPDATE & CLARIFICATION
Greg Lambert over at 3 Geeks has a nice follow up on the concept of exploded data. In his post, he quotes the last two sentence of my post and notes that he doesn’t see the problem as a man versus machine issue. After reading it, I realized that my post indeed leaves that impression. I want to clarify that I don’t see this as a man versus machine problem. My point was more about the wait. Some man-machine combination is going to be necessary to curate the legal web, but my line of thinking was “well, do we use the tools available to us now, roll up our sleeves, and start pounding away at the keys,” or “do we sit back until someone creates a really sophisticated algorithm that pulls all of this material together for us, adds in the metadata, and then we review it”? I think if it is the latter, then we are going to be missing an enormous amount of valuable research data, particularly when you consider Simon Fodden’s observation (in the comments below) that the associated data opens up pathways to all sorts of disciplines.