Exploded data, the legal web, and what we’re missing.

by Jason Wilson on July 25, 2010

By Jason Wilson

In my recent interview with Peter Jackson, the chief scientist for Thomson Reuters, we had the following exchange:

JW: I was curious about the current state of the art in legal search. In 2007, you raised the issue that extraction technology required information to be explicitly stated in the text; it couldn’t be implied. You used an example of when a debtor moves to convert from Chapter 7 to Chapter 13, and a creditor files a complaint to oppose it, the judge decides the case by “finding for the plaintiff,” which really means the conversion was denied because the plaintiff is the creditor and the defendant is the debtor. Are we any closer to achieving the inferential capability necessary to extract this kind of data?

PJ: I think it is a very hard problem. I think that in theory you could sit down and build a very specially crafted solution for that particular kind of inference. It’s very hard to see how you do that in a way that would be scalable or would apply to similar kinds of reasoning problems. With the right amount of duct tape  you can solve all of these narrow problems. You can always come up with some sort of algorithm or device or whatever, but to come up with a more general solution that you could apply to different kinds of situations, even just within a case, is much more difficult. For example, when we worked on Litigation Monitor, we wrote a program that went through the front matter in a case and figured out who the plaintiffs were, who the defendant’s were, what attorney and law firms were represented, whom they were representing, and what was the case was about. But that was a very specially crafted piece of code. There’s nothing in that code that would help you solve an analogous inference problem either with a different kind of document with a different kind of format or some other kind of reasoning problem like the one you described about bankruptcy.

Since then, I’ve been thinking more about the problem of organizing and searching the legal web (a topic of my upcoming Slaw.ca piece), and the difficulties we face as editors (curators).  In particular, how do we make specifically relevant information generated by legal professionals findable? Particularly when you realize that Google only gets you so far.

I actually owe this post to Mike Cane, who reminded me that dumb information can be turned into smart information in his post last year titled, “Dumb eBooks Must Die, Smart eBooks Must Live.” It was after reading that post I realized the difficulties in making dumb information “smart.”  As far as the legal web is concerned, most blog posts, white papers, firm synopses, etc. are “dumb” information. And Jackson’s response to my question merely solidified Cane’s observations. Namely, the inference problem is really one of hidden data, which can only be revealed by human editors at this point.

Let’s take a look at a random passage from a recent opinion of the U.S. Supreme Court, City of Ontario v. Quan.

And it is worth noting that during his internal affairs investigation, McMahon redacted all messages Quon sent while off duty, a measure which reduced the intrusiveness of any further review of the transcripts.

Some of the “smart” information associated with this single passage could include the following:

<citation-parties-01=City of Ontario>
<citation-parties-02=Jeff Quon>
<citation-court=United State Supreme Court>
<citation-date-month=June>
<citation-date-day=17>
<citation-date-year=2010>
<citation-judge-majority=Kennedy>
<citation-judge-concurring=Stevens>
<citation-judge-concurring-in part=Scalia>
<citation-district court-state=California>
<citation-district court-district=Central>
<citation-district court-judge=Stephen G. Larson>
<citation-court of appeals-district=Ninth>
<factual-police=internal affairs>
<factual-police=Ontario Police Department>
<factual-police=OPD>
<factual-police=Special Weapons and Tactics Team>
<factual-police=SWAT>
<factual-person-01=Patrick McMahon>
<factual-person-02=Jeff Quon>
<factual-person-sex=male>
<factual-person-identity=man>
<factual-person-role=seargent>
<factual-person-occupation-01=seargent>
<factual-person-occupation-02=police officer>
<factual-person-label-01=respondent>
<factual-person-label-02=witnesses>
<factual-location-galaxy=Milky Way>
<factual-location-planet=Earth>
<factual-location-country=United States>
<factual-location-city=Ontario>
<factual-company=Arch Wireless Operating Company>
<factual-location-year=2001>
<factual-location-year=2002>
<action-01=deleted>
<action-02=redacted>
<action-03=review>
<action-04=pager issued>
<action-effect=reduced>
<action-effect=lessened>
<action-effect=not intrusive>
<technology-01=pager>
<technology-02=beeper>
<technology-03=backup communications system>
<technology=short message service>
<technology=electronic mail>
<technology=SMS>
<technology=numeric message>
<technology=text message>
<communication=short message service>
<communication=SMS>
<communication=electronic mail>
<communication=numeric message>
<communication=alphanumeric message>
<communication=text message>
<communication-nature-01=sexual>
<communication-nature-02=explicit>
<communication-nature-03=private>
<communication-nature-04=off-duty>
<factual-documents=transcript>
<factual-documents=internal affairs>
<factual-documents=personnel records>
<government-action-police=internal affairs>
<government-action-police=review>
<government-action-police=transcripts>
<government-action-police=investigation>

When did you stop reading? The first five lines or so? If so, I encourage you to go back and read the rest and then think about what other implied data is missing. Cane describes this as “exploded data,” that is, hidden text that makes a dumb sentence into a smart one. But, as you soon realize, it is a human who is the bomb maker. It is a person who is making most of these connections, not an algorithm.

In talking about eBooks, Cane makes a very astute observation:

An eBook becomes a local terminal connected to a growing and living cloud of associated information, with meanings and implications no publisher or writer can currently imagine.

Take a second and think about this perspective in the context of a single practitioner’s blog post and how much exploded data is included within it. Without human editors at this point, what are you really discovering? Better yet, what are you missing? I ask this because you (legal researcher) think that everything you know about a subject is found on WestlawNext, Lexis Advance, Loislaw, Fastcase, CaseMaker, or (gasp!) a printed treatise or handbook. But it isn’t. The Scott Greenfield’s of this world teach us that a single sentence is full of exploded data, especially that pesky “practical knowledge” kind.

My thought at this point is that the legal web is in an infancy that we can’t even fathom yet. There is cloud of associated information that our current computer assisted legal research vendors cannot give to us based on their algorithms, especially when they remain in walled-in gardens that don’t account for the vast and valuable information being created by users. The question is whether we will step up to organize this sea of data, or wait until a program can do it for us. If the latter, what does it say about the future of legal research and the practice of law?

[Image (CC) by hey mr glen]

UPDATE & CLARIFICATION

Greg Lambert over at 3 Geeks has a nice follow up on the concept of exploded data. In his post, he quotes the last two sentence of my post and notes that he doesn’t see the problem as a man versus machine issue. After reading it, I realized that my post indeed leaves that impression. I want to clarify that I don’t see this as a man versus machine problem. My point was more about the wait. Some man-machine combination is going to be necessary to curate the legal web, but my line of thinking was “well, do we use the tools available to us now, roll up our sleeves, and start pounding away at the keys,” or “do we sit back until someone creates a really sophisticated algorithm that pulls all of this material together for us, adds in the metadata, and then we review it”? I think if it is the latter, then we are going to be missing an enormous amount of valuable research data, particularly when you consider Simon Fodden’s observation (in the comments below) that the associated data opens up pathways to all sorts of disciplines.

{ 2 comments }

Simon Fodden July 26, 2010 at 8:49 pm

Really good post, Jason. You're talking about the sort of thing that I came at another way when I tried to describe "tomorrow's texts" in a Slaw post [http://bit.ly/9xrr1q] . I came at it timidly and with very small suggestions.

The sort of exploded implications you're identifying put me in mind of the business of "judicial notice," where a judge can take notice of a commonly accepted fact; at the margins, this can be contentious; but across 99.9% of the field, it's wholly tacit—"unexploded," to use your notion. Which is good. We don't want our cases judged by a person from Mars, who has to learn what a table is, a man, a party, etc. These things "go without saying," otherwise everything grinds to a halt.

I think it's interesting to see how much of this unspoken knowledge might have to become explicit in order to have our documents written or read by a machine from Mars.

I'm only musing.

I do think, though, that we have to go there. But I think we should go generally (regardless of what publishers do with their reports and monographs) through the legal system. If we redefine what a good legal document is—whether it's a contract, a memo, a judgment, or even a solicitor's letter—our new expectations, supported by word processing and other applications, might elicit the sort of meta-data from all document creators that would be useful for research, archiving, statistics, accounting, etc. After all, courts and lawyers had to learn at some point to use a certain style of cause, to adopt a certain citation system, to number paragraphs, and so forth.

So I see a combo of technology (law office stuff, court admin stuff) and people (document creators and editors) doing the work.

But we need to prioritize, to choose which six pieces of metadata from your sixty we're going to expect from our profession, for example, in the near future (50 years?).

jasnwilsn July 26, 2010 at 10:01 pm

Simon,

Thanks for the thoughts. I think one of the notions you've hit on is really something I liked about Mike Cane's observation that eBooks are like terminals plugging into something more vast. When you redefine a good legal document, as you put it, that document can become a terminal that leads you to a multitude of disciplines. I agree that now we must prioritize, and if I were to begin curating the legal web, I would build a taxonomy with limited associated terminology. Eventually, though, I would want a machine to come along and help me grow those associations.

{ 5 trackbacks }

Previous post:

Next post: