Data & Code
In addition to the data in the Civil Justice Data Commons, researchers may need to access data from other sources. And once they have that data, they need to analyze it with code. These data resources and code are valuable examples.
Although most data in the Civil Justice Data Commons comes from partners such as courts and other civil society organizations, in some places data is also scraped directly. One such area is Franklin Co, Ohio, where we spoke with the court and the most efficient method of them providing data to us was scraping their public site ourselves.
This Python scraping code is now publicly available on our GitHub, so it can provide a model for your own scraping if you have permission from a court to scrape their site!
The scraper is written in Python and makes use of the Selenium and BeautifulSoup libraries.
Suffolk Law School's Legal Innovation & Technology Lab has created a "Form Fyxer" open source Python tool for mining data from court PDFs. The tool is a part of their Document Assembly Line project, which endeavors to help courts transition their paper workflows into modern web apps.
The Civil Justice Data Commons' GitHub features projects we have developed to aid researchers in diving into civil justice data. One such project is a Python tool for using modified Damerau-Levenshtein distance and geographic databases to clean addresses, our "Address Janitor."
Open Justice Oklahoma (OJO), a program of the Oklahoma Policy Institute, aims to provide a complete understanding of how their state justice system operates.
Their GitHub includes a variety of repos for opening the "black box" of civil justice, including R code for clustering and classifying evictions data that they presented at the CJDC Clustering Convening ("ojo-at-gu").
The Criminal Justice Administrative Records System (CJARS) project out of the University of Michigan is creating a nationally integrated repository of data following individuals through the criminal justice system.
Their GitHub features Stata and Python code for their data processing infrastructure, including localization, standardization, entity resolution, and harmonization. This includes their text-based offense classification system they presented at the CJDC Clustering Convening.
The Systematic Content Analysis of Litigation EventS (SCALES) was born out of the Northwestern Open Access to Court Records Initiative (NOACRI) and aims to increase the transparency of federal court records.
Their GitHub contains tools for doing so, including Python code to help scrape PACER (the federal court records system) and annotating those records.
SCALES also uses machine learning to automate interpreting court docket data, and their models are shared on their Hugging Face page.