CJDC's Case Search Scraping

Although most data in the Civil Justice Data Commons comes from partners such as courts and other civil society organizations, in some places data is also scraped directly. One such area is Franklin Co, Ohio, where we spoke with the court and the most efficient method of them providing data to us was scraping their public site ourselves.

This Python scraping code is now publicly available on our GitHub, so it can provide a model for your own scraping if you have permission from a court to scrape their site!

The scraper is written in Python and makes use of the Selenium and BeautifulSoup libraries.

Mine Data from PDF Forms Using Suffolk LIT Lab's FormFyxer

Suffolk Law School's Legal Innovation & Technology Lab has created a "Form Fyxer" open source Python tool for mining data from court PDFs. The tool is a part of their Document Assembly Line project, which endeavors to help courts transition their paper workflows into modern web apps.

Address Cleaning with the CJDC

The Civil Justice Data Commons' GitHub features projects we have developed to aid researchers in diving into civil justice data. One such project is a Python tool for using modified Damerau-Levenshtein distance and geographic databases to clean addresses, our "Address Janitor."

Classifying Evictions with Open Justice Oklahoma

Open Justice Oklahoma (OJO), a program of the Oklahoma Policy Institute, aims to provide a complete understanding of how their state justice system operates.

Their GitHub includes a variety of repos for opening the "black box" of civil justice, including R code for clustering and classifying evictions data that they presented at the CJDC Clustering Convening ("ojo-at-gu").

Criminal Records Processing with CJARS

The Criminal Justice Administrative Records System (CJARS) project out of the University of Michigan is creating a nationally integrated repository of data following individuals through the criminal justice system.

Their GitHub features Stata and Python code for their data processing infrastructure, including localization, standardization, entity resolution, and harmonization. This includes their text-based offense classification system they presented at the CJDC Clustering Convening.

Uncover Federal Court Data with SCALES

The Systematic Content Analysis of Litigation EventS (SCALES) was born out of the Northwestern Open Access to Court Records Initiative (NOACRI) and aims to increase the transparency of federal court records.

Their GitHub contains tools for doing so, including Python code to help scrape PACER (the federal court records system) and annotating those records.

Use AI to Systematize Court Language with SCALES

SCALES also uses machine learning to automate interpreting court docket data, and their models are shared on their Hugging Face page.