{"id":1571,"date":"2022-11-11T17:22:17","date_gmt":"2022-11-11T17:22:17","guid":{"rendered":"https:\/\/www.law.georgetown.edu\/tech-institute\/?page_id=1571"},"modified":"2026-01-28T19:18:45","modified_gmt":"2026-01-28T19:18:45","slug":"convening-of-civil-justice-experts-on-data-clustering","status":"publish","type":"page","link":"https:\/\/www.law.georgetown.edu\/tech-institute\/programs-and-initiatives\/georgetown-justice-lab\/civil-justice-data-commons\/cjdc-blog\/convening-of-civil-justice-experts-on-data-clustering\/","title":{"rendered":"Convening of Civil Justice Experts on Data Clustering"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-1574 alignright\" src=\"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic-300x210.jpg\" alt=\"A group of around 20 people at a table\" width=\"300\" height=\"210\" srcset=\"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic-300x210.jpg 300w, https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic-768x538.jpg 768w, https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic-500x350.jpg 500w, https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic-740x518.jpg 740w, https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic-980x686.jpg 980w, https:\/\/www.law.georgetown.edu\/tech-institute\/wp-content\/uploads\/sites\/42\/2022\/11\/CJDC-Clustering-Conveening-Pic.jpg 1000w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/>On September 19, 2022 the Civil Justice Data Commons gathered over thirty civil justice experts to discuss the tricky problem of how to clean and cluster civil justice data. The CJDC Clustering &amp; Classifying Methods Convening included 13 presentations and collaborative discussions on common pain points, potential solutions, and key lessons learned. Much of the code used in these clustering methods can be found in the CJDC Knowledge Base, which launches on November 30.<\/p>\n<p><span style=\"font-weight: 400\">The first step of clustering is cleaning, i.e., the processing done on data to ensure it can be clustered effectively. The most basic methods of cleaning, such as correcting spelling errors on common words, were used by nearly all attendees. The source of data can affect the cleaning that needs to be done, with data derived from web-scraping often requiring more effort. Some teams employed hands-on cleaning, such as the <\/span><a href=\"https:\/\/openjustice.okpolicy.org\/\"><b>Oklahoma Policy Institute<\/b><\/a><span style=\"font-weight: 400\"> who introduced specific cleaning rules for the major players in their local courts. <a href=\"https:\/\/www.ibm.com\/cloud\/learn\/natural-language-processing\">Natural Language Processing<\/a> (\u201cNLP\u201d) can be helpful in cleaning and identifying likely matches between words spelled differently, and then, these differences can be ratified for a normalized dataset, and is used by organizations such as the <\/span><a href=\"https:\/\/www.lsc.gov\/\"><b>Legal Services Corporation<\/b><\/a><span style=\"font-weight: 400\">.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Once data have been cleaned, attendees reported on a wide variety of different clustering methods.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Machine Learning methods like <a href=\"https:\/\/psl.linqs.org\/\">Probabilistic Soft Logic<\/a> (<\/span><b>CJDC-University of Southern California<\/b><span style=\"font-weight: 400\">), Fuzzy Logic (<\/span><b>University of Guelph<\/b><span style=\"font-weight: 400\">), and Text-Based Offense Classification (<\/span><b><a href=\"https:\/\/cjars.isr.umich.edu\/toc-tool\/\">CJARS<\/a>) <\/b><span style=\"font-weight: 400\">or the tools <a href=\"https:\/\/pypi.org\/project\/fuzzywuzzy\/\">Fuzzywuzzy<\/a> and <a href=\"https:\/\/spacy.io\/\">spaCy<\/a> (<\/span><b>several attendees<\/b><span style=\"font-weight: 400\">) were most common. <\/span><a href=\"https:\/\/www.rebeccajohnson.io\/\"><b>Professor<\/b> <b>Rebecca Johnson <\/b><\/a><span style=\"font-weight: 400\">of Georgetown has similarly been able to fill large gaps in education data by using Topic Modeling, a form of unsupervised clustering.\u00a0<\/span><\/p>\n<p><i><span style=\"font-weight: 400\">N-Grams<\/span><\/i><span style=\"font-weight: 400\"> can be a useful tool in machine learning as well as other methods of clustering, by breaking a string of text into smaller chunks of length <\/span><i><span style=\"font-weight: 400\">N<\/span><\/i><span style=\"font-weight: 400\"> and comparing them to other chunks of the same size, instead of looking at long text strings as a whole. The <\/span><b>CJDC-University of Southern California<\/b><span style=\"font-weight: 400\"> joint project and the <\/span><b>Legal Services Corporation<\/b><span style=\"font-weight: 400\"> use <\/span><i><span style=\"font-weight: 400\">N-Grams<\/span><\/i><span style=\"font-weight: 400\"> to account for typos or alternate spellings in data, while <\/span><a href=\"https:\/\/philalegal.org\/\"><b>Philadelphia Legal Assistance<\/b><\/a><span style=\"font-weight: 400\"> uses them to speed up processing.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Another category of clustering is focused on matching and cross-referencing with outside data sources, especially effective with geographic information such as in the work of <\/span><a href=\"https:\/\/www.januaryadvisors.com\/\"><b>January Advisors<\/b><\/a><span style=\"font-weight: 400\"> and the <\/span><a href=\"https:\/\/evictionlab.org\/\"><b>Princeton Eviction Lab<\/b><\/a><span style=\"font-weight: 400\">. In an even more novel use, <\/span><a href=\"https:\/\/www.law.uci.edu\/faculty\/fellows\/johnson-raba\/\"><b>Claire Johnson Raba<\/b><\/a><span style=\"font-weight: 400\"> of the University of Illinois Chicago School of Law has been bringing in data from court summons documents to accurately find addresses that are missing in some court records.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">The key takeaways from this convening were the need for data standardization, the opportunity of novel data sources, and the necessity of preserving privacy. Standardization (such as common data elements and litigation event ontology) would greatly ease researcher collaboration with courts and each other. Accessing data from novel sources like court pdfs using Optical Character Recognition and geographic data through external data sources would both introduce new data and fill gaps in existing data. Finally, how sensitive data are managed will be an important frontier going forward; the ability to tokenize and anonymize court data will open more opportunities for working with those data and will also ensure analysis is able to be done ethically when dealing with vulnerable populations.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>On September 19, 2022 the Civil Justice Data Commons gathered over thirty civil justice experts to discuss the tricky problem of how to clean and cluster civil justice data. The [&hellip;]<\/p>\n","protected":false},"author":1654,"featured_media":0,"parent":567,"menu_order":8,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"_price":"","_stock":"","_tribe_ticket_header":"","_tribe_default_ticket_provider":"","_tribe_ticket_capacity":"0","_ticket_start_date":"","_ticket_end_date":"","_tribe_ticket_show_description":"","_tribe_ticket_show_not_going":false,"_tribe_ticket_use_global_stock":"","_tribe_ticket_global_stock_level":"","_global_stock_mode":"","_global_stock_cap":"","_tribe_rsvp_for_event":"","_tribe_ticket_going_count":"","_tribe_ticket_not_going_count":"","_tribe_tickets_list":"[]","_tribe_ticket_has_attendee_info_fields":false,"footnotes":"","_tec_slr_enabled":"","_tec_slr_layout":""},"class_list":["post-1571","page","type-page","status-publish","hentry"],"acf":[],"ticketed":false,"_links":{"self":[{"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/pages\/1571","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/users\/1654"}],"replies":[{"embeddable":true,"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/comments?post=1571"}],"version-history":[{"count":73,"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/pages\/1571\/revisions"}],"predecessor-version":[{"id":9197,"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/pages\/1571\/revisions\/9197"}],"up":[{"embeddable":true,"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/pages\/567"}],"wp:attachment":[{"href":"https:\/\/www.law.georgetown.edu\/tech-institute\/wp-json\/wp\/v2\/media?parent=1571"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}