Exploring the Data Commons at Our Launch Webinar
James Carey | December 16, 2022
The Civil Justice Data Commons and the CJDC Knowledge Base officially opened for business on November 30, 2022!
To celebrate this launch and introduce the world to the available CJDC resources, we held a launch webinar. We presented our vision for the Commons and how that vision has manifested into an amazing suite of tools for our users. We were also joined by guests who talked about their early experiences using the commons. Finally, we answered many questions from the over 120 live attendees. These are collected, along with those we did not have the chance to address live, with detailed answers at the bottom of this blog post. If you missed the webinar live you can view a recording of it here.
The webinar started with Principal Investigators Dr. Amy O’Hara from the Massive Data Institute and Professor Tanina Rostain from the Georgetown University Law Center providing an in-depth look into the story of the commons and why it is needed now. After they met at a conference a few years ago they realized that, although millions of people cycle through thousands of state and local courts in civil cases each year, there was very little data available to understand how people’s lives were affected by civil courts. Cases that have major impacts on people’s lives—such as evictions and debt collection—were functionally impossible to view through data on a state or national level. This is despite a greater number of courts collecting digital data and an increased interest by researchers to look into this data. In response to this problem, Dr. O’Hara and Professor Rostain began work on the Civil Justice Data Commons.
Casey Chiappetta of the Pew Charitable Trusts (a funder of the CJDC) was an early test researcher for the commons and spoke on how the commons allowed for easy access to data and made novel research possible. Their findings that Eviction Court Outcomes in Philadelphia Differ by Type of Landlord were made possible by the hundreds of thousands of local eviction records explorable using the CJDC tools.
These records were available in the commons thanks to Jonathan Pyle of Philadelphia Legal Assistance, who had ethically scraped them as a part of his work and was thrilled to see the insights that were possible once researchers were able to explore them. He spoke at the webinar about the seamless record sharing process as a data provider to the commons.
The CJDC Database Administrator Ellen Moriarty provided a whirlwind tour of applying for access to the commons and the powerful tools it provides through the Redivis platform. CJDC fellow James Carey introduced the CJDC Knowledge Base, an extensive resource of prior research, computer code, and model legal agreements launching alongside the commons.
The webinar concluded by inviting everyone interested in exploring the commons to apply for access today and encouraging stakeholders to explore and contribute to our knowledge base, before opening the floor for questions (reproduced below)!
We welcome you to apply for access to the commons here and explore our Knowledge Base here.
Are there any geographic identifiers for each case record? If so, does it differ based on the jurisdiction? What is the timeline covered by your data so far? Will you try to retrieve time series? How complete are the data? Do you have data for all states?
→ General answer: It varies by jurisdiction. We typically have street addresses for evictions, but geographic data on debt cases varies. Data coverage also varies–we don’t have coverage for all 50 states yet, but check back because we’re getting more sites on a continuous basis. We are hoping to release one full state by the end of the year.
Date coverage also varies: Philadelphia goes back to 1969, Franklin goes back to 2021, other datasets fall in between.
Completeness is a complicated issue. More recent years are often more complete. Changes to technology within a court can impact the availability of data and whether data within a single jurisdiction is comparable across the entire available timeline.
If someone has a time series they would like to share, we’d be happy to chat! But we aren’t specifically targeting time series.
Is there a mechanism to ask questions about nuances in the data from the original data providers?
→The Commons provides clear information about data sources and we endeavor to provide contacts to the original data providers and other researchers working on the data to help untangle nuances, though we cannot guarantee they will have the resources to answer all questions.
I have students who might be able to use this resource to help researchers answer their questions. Have you considered an intake form where researchers can ask questions that could become student Python or R projects?
→ In the near future, a great option for interested students to get involved would be to join our codefest in February 2023. Participants will get short-term access to a limited dataset that they will have the freedom to explore without going through the study approval process. We also welcome researchers to bring students along to participate as collaborators, or students to submit their own study proposals. We hope to provide more opportunities for potential collaborators to connect as the commons grows.
Has there been any resistance from current or potential data submitters and, if so, what was the reason? How was this addressed?
→ While we were not met with resistance, we did learn that courts faced challenges in providing data. In our planning stage, we interviewed upwards of 50 actors, across courts, legal service providers (LSPs), academia/research, and policy nonprofits, to survey their interest in participating in the Commons. As our NSF Planning Grant report details, courts were interested in sharing data with us. However, when we moved to the implementation stage, we quickly discovered that the courts’ challenges in providing data, outlined in this blog post here, overwhelmed them and prevented any imminent data sharing. We’d love to work with courts directly in future years as their data infrastructure processes improve.
What is your vision for where the CJDC will be in one year? Three years? Five years?
→ We are currently processing several more datasets that will be uploaded to the CJDC in the next 2-3 months. In one year, we intend to expand our number of eviction and consumer datasets to 15-20. As time goes on, court modernization will improve (through projects such as these), and so we hope that in three years, a handful of courts will be able to provide data to the Commons directly. In five years, we envision the CJDC as a collaborative space of users sharing best practices to generate civil justice insights that inform policy initiatives and increase access to justice.
When court records are sealed, are those records removed from CJDC datasets? Will we know which or how many records have been removed?
→ Sealed records are removed from datasets. This would be done in a new version of the dataset and would mean the previous versions of the dataset containing those records would be removed. Differences in the number of records per version will be documented in the version notes for the new release.
Can the data be exported, or is it only available in this tool? Is an API planned?
→ We don’t allow exports or downloads of the data. This is to preserve privacy of those involved in these cases, and so that data providers can be assured that their data remains in our control. The beauty of the Redivis tool is that everything is self contained; it makes powerful cloud resources available for researchers using our data. Python and R notebooks and output can be downloaded when research is complete as long as the outputs comply with the guidelines outlined in our Researcher Access Agreement (see p.15).
How do you get your data? How are the data curated and how often are datasets updated?
→ As a repository, our updates are dependent on our data providers. We work with providers to establish secure delivery methods on a timeline that works for them. Most of our datasets are updated regularly, although the exact schedule varies by jurisdiction. When possible, we update monthly during the first week of the month. When we receive data, we do a cleaning pass to make sure the data is usable–making sure it parsed into tables correctly, that variable types are correct, etc. We also update field names so that there is consistency across the commons where possible. We do not perform any significant cleaning or harmonization, but when a data provider has performed these tasks, we include their provided methodology in the dataset overview. We also make sure that all fields have labels and descriptions where appropriate.
What do you mean you have data that’s been “ethically scraped”?
→ “Ethically scraped” data is data that has been scraped according to the court’s rules (e.g., one cannot download more than 30 days of cases at once), that are securely stored, and that are not downloaded in such great quantities at once that it crashes the court website.
Are there plans to involve community or other stakeholders in deciding who can access datasets?
→ Right now, we are using publicly available civil court datasets that have been ethically scraped. These data were already available to anyone who seeks them out from court lookup sites. We also held CJDC design workshops with Amazon Web Services that included community members from nonprofits and the access to justice space. However, we are always open to engaging with more community members about the data and how they can be used to strengthen these organizations’ missions. Please email us here for any ideas here: firstname.lastname@example.org.
Is there a write-up of research on court data availability, or plans to make it available?
→ On our website, we have created a Knowledge Base–a one-stop-shop of civil justice data resources, such as scholarly research articles that use available court data. We also have a forthcoming 50+ state survey of the landscape of court data access! Keep an eye out for it in 2023.
Are there any plans to include racial and ethnic demographic data as well as gender?
→ Most civil courts do not collect race and ethnicity data, unfortunately. The few that do often collect it nonuniformly, or through secondhand observers (such as process servers or courtroom clerks). Even more frustratingly, default rates are high in civil cases, so there are rare opportunities for courts to even ask defendants for this information. Our current protocol is to take the datasets as we receive them, with whatever elements the courts collect. That sometimes includes gender but rarely includes race and ethnicity. We are, however, working on an independent research project to append governmental race and ethnicity data to civil court data. We also welcome CJDC users to bring their own demographic datasets into Redivis to link to our data.
Are there limitations on combining the data with zip codes or other identifying information?
→ It varies. Most data already have zip codes. And in general, there is no official restriction on combining the CJDC data with other datasets containing identifying information, as long as the users are abiding by our Researcher Access Agreement (see p.15) privacy and security terms.
Is the focus on expanding the number of jurisdictions in the CJDC or expanding the types of civil court cases–beyond eviction and consumer debt data?
→ For now, our priority is adding jurisdictions to the existing datasets on eviction and consumer debt. In future years, we would love to explore the possibility of expanding to other civil court case types.
How do I know when each dataset was last updated? What happens to my work in Redivis when that happens?
→ When you go to the dataset in Redivis, you will be able to see a “Last Updated” date in the right-hand panel of the overview. There is also a version number next to the dataset name that you can click on to review version notes and to navigate to different versions of the data. When you are working in a Redivis project, there is a flag at the top of the screen that is highlighted when a new version of the dataset has been released. You can toggle between available versions there–if you want the latest data, you can pull it into your work, but if you want to stick with the version you’ve been working on, there is no impact on your work.