A group of around 20 people at a tableOn September 19, 2022 the Civil Justice Data Commons gathered over thirty civil justice experts to discuss the tricky problem of how to clean and cluster civil justice data. The CJDC Clustering & Classifying Methods Convening included 13 presentations and collaborative discussions on common pain points, potential solutions, and key lessons learned. Much of the code used in these clustering methods can be found in the CJDC Knowledge Base, which launches on November 30.

The first step of clustering is cleaning, i.e., the processing done on data to ensure it can be clustered effectively. The most basic methods of cleaning, such as correcting spelling errors on common words, were used by nearly all attendees. The source of data can affect the cleaning that needs to be done, with data derived from web-scraping often requiring more effort. Some teams employed hands-on cleaning, such as the Oklahoma Policy Institute who introduced specific cleaning rules for the major players in their local courts. Natural Language Processing (“NLP”) can be helpful in cleaning and identifying likely matches between words spelled differently, and then, these differences can be ratified for a normalized dataset, and is used by organizations such as the Legal Services Corporation.

Once data have been cleaned, attendees reported on a wide variety of different clustering methods.

Machine Learning methods like Probabilistic Soft Logic (CJDC-University of Southern California), Fuzzy Logic (University of Guelph), and Text-Based Offense Classification (CJARS) or the tools Fuzzywuzzy and spaCy (several attendees) were most common. Professor Rebecca Johnson of Georgetown has similarly been able to fill large gaps in education data by using Topic Modeling, a form of unsupervised clustering. 

N-Grams can be a useful tool in machine learning as well as other methods of clustering, by breaking a string of text into smaller chunks of length N and comparing them to other chunks of the same size, instead of looking at long text strings as a whole. The CJDC-University of Southern California joint project and the Legal Services Corporation use N-Grams to account for typos or alternate spellings in data, while Philadelphia Legal Assistance uses them to speed up processing. 

Another category of clustering is focused on matching and cross-referencing with outside data sources, especially effective with geographic information such as in the work of January Advisors and the Princeton Eviction Lab. In an even more novel use, Claire Johnson Raba of the University of Illinois Chicago School of Law has been bringing in data from court summons documents to accurately find addresses that are missing in some court records. 

The key takeaways from this convening were the need for data standardization, the opportunity of novel data sources, and the necessity of preserving privacy. Standardization (such as common data elements and litigation event ontology) would greatly ease researcher collaboration with courts and each other. Accessing data from novel sources like court pdfs using Optical Character Recognition and geographic data through external data sources would both introduce new data and fill gaps in existing data. Finally, how sensitive data are managed will be an important frontier going forward; the ability to tokenize and anonymize court data will open more opportunities for working with those data and will also ensure analysis is able to be done ethically when dealing with vulnerable populations.