Limber up for the Big Data Marathon

19 March 2013 by Adrienn Toth

The Data Craze for Sports Fanatics and Lawyers

One of my colleagues has just run the Reading Half Marathon and I am expecting any minute to see his race stats published on Facebook.   Well done Rob Jones, a GPS time of 2:21:19.  Budding athletes and intrepid cyclists are downloading various  apps to their phones (like Endomondo Sports Tracker or, relying on the information they gather to track distance travelled, time taken and  energy expended and using this to not only subtly show-off on social networking sites but also to plot and plan their race strategies. Of course, a positive spin-off is that the rest of us, having shared their pain and gain, feel inspired to do something similar and before you know it the data craze has turned into a sports craze and a new way of doing things. This phenomenon highlights how data can be transformed into intelligence, can inform decision making and strategy and possibly even have an unintended impact.  It got me thinking again about the influence that big data and predictive analytics is having on business and on the legal profession and how edisclosure fits into the picture.

Big data in business

Initially it was only big companies like telecommunications companies, banks and government agencies that could afford to store and analyse big data.  Thanks to advancements in hardware and databases you no longer need supercomputers to carry out complex analytics across large data sets.  Many businesses are finding that for a reasonable investment they can collect data and make it relevant to their business; by measuring consumer behaviour and using pattern detection they can respond to customer needs and market conditions and make data-driven decisions.   Supermarkets, healthcare providers, gaming companies, insurance companies and even florists are jumping on the bandwagon and tapping into the intelligence running through the big data stream and finding ways to monetise the data they hold.

But (and it’s a big but) what about law firms? 

Can lawyers, who have tended to shy away from technological innovation really harness big data to predict case outcomes and legal costs?   We know that big data can be exploited to predict the outbreak of diseases, but can it be used to predict the outcome of a litigation case?  In an interesting article by Mike Wheatley on Silicone Angle it appears that databases of legal history are being built up and algorithms are being developed to help predict case outcomes.  Apparently, companies are also developing mobile apps that predict the average legal cost of different types of cases in the US.

As we enter a new era of cost management in the UK and the need to stick to case budgets becomes more important, we will need all the help we can get to estimate costs and guess what impact variables like the number of witnesses or extent of disclosure might have, not only on costs, but also on the outcome of a case.  Of course the data that needs to be collected, analysed and correlated to make sensible predictions includes not just the key features and facts of the case itself but also the results recorded in subsequent court decisions.   When it comes to costs, law firms and e-disclosure providers are all holding a lot of valuable billing data that could be analysed to assist with cost estimating.   This might all be feasible but has not yet been done.

On the edisclosure front, data analytics has been used for some time.  We have had email analytic tools that can be used to visualize who has been communicating with whom, when and about what.  Similarly, Technology Assisted Review (TAR) (also known as Computer Assisted Review or Predictive Coding)  analyses decisions made by humans on a sub-set of documents, and then look for similar patterns in a much larger document universe to predict which documents are relevant to a case and top priority.    At this stage most of us know about TAR and some are testing the water. Here are some tips on analytics from the sports scene:

Sports analytics and the CIO: Five lessons from the sports data craze

Collect the right data to start with, both qualitatively and quantitatively.  In edisclosure this means targeting the right sources of data and is an area where experts can help.  Is it better to present a raw unfiltered set of data (to teach the system in a balanced way) or a set of results based on a carefully crafted search, or is that somewhat prejudicial. Until there are better statistics and more guidelines from real cases, the ultimate decision is likely to be a strategic one.

Start with statistically significant data.  This refers to the selection of your seed set of documents that will be reviewed by humans and used to train the prediction software.   You cannot expect the software to achieve peak performance on 1,000 documents.

Remember that the ability to contextualise data is important.  There are incalculable factors that come into play with prediction and this is where human quality control is vital.

Perhaps, as we use these predictive tools more in legal cases and share our practical experiences and results, their use will become widespread and a status symbol just like Nike + is.