Block Featured Image

Supporting use of computational methods on the Opioid Industry Documents Archive

Short Separator

Welcome to the Opioid Industry Documents Archive (OIDA) Toolbox! This page is designed to help you access the raw data behind OIDA, a digital archive co-created by the University of California, San Francisco and Johns Hopkins University containing millions of documents from the opioid industry that shed light on the root causes of the opioid crisis.

By “raw data” we mean metadata describing the documents, the documents themselves in various file formats, and text extracted from the documents. As described below, the raw data are available through Johns Hopkins University’s SciServer and Amazon Simple Storage Service (Amazon S3), and the metadata can be queried and downloaded in bulk through the Industry Documents Library (IDL).

FeatureSciServerPrepared datasets from IDLAWSInternet ArchiveIDL Solr API
Query metadata and full textYesYes (using Parquet)Yes
Download metadataYesWhole collectionsYes (in Parquet format)YesYes
Download full textYesWhole collectionsYes
Download native formats and page imagesYes except audio and videoNoYes except audio and videoAudio and video
Sample notebooks/codeAvailable: showing flexible query via Solr API and bulk downloading, plus basic text analysis and visualizationAccess and sample code provided
EnvironmentCloud-based (everything included)Bring your own cloud instanceCommand line or PythonAPI-based
File system accessYesYes
Metadata formatsParquet (plus other formats through Solr API)Parquet onlyXML and SQLiteXML, JSON, Python, Ruby, PHP, and CSV
“Batteries included”High (cloud-based, ready to use)Medium (prepared datasets)Low (requires own setup)Medium (basic technologies)Medium (API access)

Due to the need to ensure integrity and appropriate redaction of newly released documents, there can be a delay of a few weeks between when documents are released on the IDL website and when they are available in the OIDA Toolbox as raw data.

Need help beyond what’s below? Please contact [email protected]

What can you do with OIDA data?

Applying computational methods to the documents in OIDA can provide valuable insights into the documents and the opioid crisis. For example:

  • Machine learning and data mining can help reveal hidden patterns, trends, and correlations within the archive, such as how actions by specific people correlated with changes in opioid prescription rates in different regions.
  • Social network analysis can reveal the connections among employees of a company and between doctors, sales reps, pharmacists, drug distributors, and regulators. It can reveal who communicated with whom most frequently, and during which time periods, helping us understand how communication and decision-making work in a large organization.
  • Natural language processing and other linguistic analysis allows for study of written communication in the modern workplace and in relation to specific actions of these companies.

Here is a lexical dispersion plot for a list of common terms in OIDA as found in a sample of 1,000 documents and generated using one of the sample Jupyter notebooks provided through SciServer.

A lexical dispersion plot for some common terms in OIDA documents based on a sample of 1,000 documents
A lexical dispersion plot for some common terms in OIDA documents based on a sample of 1,000 documents. Note that “sackler” was not used over time similarly to the other terms.

Work in the cloud with SciServer

You can work with OIDA’s metadata, documents (except audio and video), and extracted text using SciServer, a web-based sandbox for server-side analysis with extremely large datasets using interactive Jupyter notebooks. We provide sample notebooks that allow you to accomplish a commonly requested task: retrieving a list of OIDA documents matching a query and then downloading those documents for offline use (see a sample set of 4 documents). We also provide a notebook that shows how to perform some basic text analysis and visualization on a sample of 1,000 documents.

SciServer does not require installing additional software beyond a web browser nor downloading big datasets, but working with OIDA’s sample Jupyter notebooks does require some basic familiarity with Python. For performing additional data analysis, SciServer provides users with a virtual machine image in the cloud, pre-installed with Python and important data analysis packages (including Pandas, NumPy, and SciKit-Learn).

Getting started

See Getting Started with OIDA and SciServer for how to get access to OIDA data through SciServer and use our Jupyter notebooks within SciServer for downloading files that match a query and for performing basic text analysis and visualization. You can edit these notebooks to adapt them to your needs. We welcome your suggestions for additional notebooks we might create to help you get started with analyzing OIDA’s raw data; please reach out to us at [email protected]

Download full text and metadata for OIDA collections

If you want the full text of OIDA documents (but not the page images, PDFs, and native file formats) along with associated metadata for one or more collections within OIDA, the simplest option for getting these is to download the appropriate collection-level ZIP file(s) from the IDL.

Download from and work with data on AWS

OIDA’s raw data (except audio and video) is made available through the AWS Open Data Sponsorship Program: see our entry on the Registry of Open Data on AWS. OIDA data can be downloaded for local use or used in the cloud with products such as AWS SageMaker and Google CoLab.

For more information, see the documentation for OIDA Data on AWS. We welcome your suggestions for how to improve our documentation. Please reach out to us at [email protected]

Download audio and video from the Internet Archive

OIDA’s audio and video files are stored in IDL collections in the Internet Archive:

The Internet Archive provides two methods for bulk downloads of content.

Directly query the Industry Document Library’s Apache Solr server

Many metadata fields for documents in OIDA (and the rest of the IDL), plus the full text appearing in documents, can be searched using the IDL’s Solr API. Metadata for matching documents can be retrieved in these formats: XML, JSON, Python, Ruby, PHP array, and CSV.

Metadata for individual documents

To access a document’s metadata, query the IDL’s Apache Solr server with the document’s ID. This unique, 8-character alphanumeric ID consists of four letters followed by four digits, e.g., flpp0234.

For example, to return the information of the document with ID flpp0234, use the following HTTPS request: https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=id:flpp0234. The default response returns XML data.

Searching the full text of documents

The same basic query structure can be used to keyword search all OIDA and retrieve metadata about matching documents. For example, to search for the word “addiction” in all OIDA documents within the IDL, use the following: https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=(addiction AND industry:Opioids).

Additional metadata fields can be added with Boolean logic for advanced searching. For more information and further examples, please see the IDL’s Solr API documentation.

Have you used OIDA data?

We’d love to hear from you about how you’ve used OIDA data and possibly share your work with others on this website! Please reach out to us at [email protected]

Get updates about the OIDA Toolbox

Want to receive occasional emails about new tools and features in the OIDA Toolbox? Please contact us at [email protected] to be added to our mailing list.

Short Separator

The OIDA Toolbox has been made possible in part with support from the UCSF-JHU Opioid Industry Documents Archive.

UCSF-JHU joint logo