Block Featured Image

Supporting use of computational methods on the Opioid Industry Documents Archive

Short Separator

Welcome to the Opioid Industry Documents Archive (OIDA) Toolbox! This page is designed to help you access the raw data behind OIDA, a digital archive co-created by the University of California, San Francisco and Johns Hopkins University containing millions of documents from the opioid industry that shed light on the root causes of the opioid crisis.

By “raw data” we mean metadata describing the documents, the documents themselves in various file formats, and text extracted from the documents. As described below, parts of the raw data are available through Johns Hopkins University’s SciServer, Amazon Simple Storage Service (Amazon S3), and the Industry Documents Library (IDL).

Need help beyond what’s below? Please contact opioidarchive@jh.edu

What can you do with OIDA data?

Applying computational methods to the documents in OIDA can provide valuable insights into the documents and the opioid crisis. For example:

  • Machine learning and data mining can help reveal hidden patterns, trends, and correlations within the archive, such as how actions by specific people correlated with changes in opioid prescription rates in different regions.
  • Social network analysis can reveal the connections among employees of a company and between doctors, sales reps, pharmacists, drug distributors, and regulators. It can reveal who communicated with whom most frequently, and during which time periods, helping us understand how communication and decision-making work in a large organization.
  • Natural language processing and other linguistic analysis allows for study of written communication in the modern workplace and in relation to specific actions of these companies.

Work in the cloud with SciServer

You can work with OIDA’s metadata, documents, and extracted text using SciServer, a web-based sandbox for server-side analysis with extremely large datasets using interactive Jupyter Notebooks. We provide sample notebooks that allow you to accomplish a commonly requested task: retrieving a list of OIDA documents matching a query and then downloading those documents for offline use.

For new users

SciServer is perfect for users new to data science, since it does not require installing additional software beyond a web browser nor downloading big datasets. You only need to create a free SciServer account.

For expert users

SciServer provides users with a virtual machine image in the cloud, pre-installed with Python and important data analysis packages (including Pandas, NumPy, and SciKit-Learn).

Getting started

See Getting Started with OIDA and SciServer for how to get access to OIDA data through SciServer and access copies of our notebooks for downloading files that match a query. We welcome your suggestions for additional notebooks we might create! Please reach out to us at opioidarchive@jh.edu

Download full text and metadata for OIDA collections

If you want the full text of OIDA documents (but not the page images, PDFs, and native file formats) along with associated metadata for one or more collections within OIDA, the simplest option for getting these is to download the appropriate collection-level ZIP file(s) from the IDL.

Download from and work with data on AWS

OIDA’s raw data is made available through the AWS Open Data Sponsorship Program: see our entry on the Registry of Open Data on AWS. OIDA data can be downloaded for local use or used in the cloud with products such as AWS SageMaker and Google CoLab.

For more information, see the documentation for OIDA Data on AWS. We welcome your suggestions for how to improve our documentation. Please reach out to us at opioidarchive@jh.edu

Directly query the Industry Document Library’s Apache Solr server

Many metadata fields for documents in OIDA (and the rest of the IDL), plus the full text appearing in documents, can be searched using the IDL’s Solr API. Metadata for matching documents can be retrieved in these formats: XML, JSON, Python, Ruby, PHP array, and CSV.

Metadata for individual documents

To access a document’s metadata, query the IDL’s Apache Solr server with the document’s ID. This unique, 8-character alphanumeric ID consists of four letters followed by four digits, e.g., flpp0234.

For example, to return the information of the document with ID flpp0234, use the following HTTPS request: https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=id:flpp0234. The default response returns XML data.

Searching the full text of documents

The same basic query structure can be used to keyword search all OIDA and retrieve metadata about matching documents. For example, to search for the word “addiction” in all OIDA documents within the IDL, use the following: https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=(addiction AND industry:Opioids).

Additional metadata fields can be added with Boolean logic for advanced searching. For more information and further examples, please see the IDL’s Solr API documentation.

Have you used OIDA data?

We’d love to hear from you about how you’ve used OIDA data and possibly share your work with others on this website! Please reach out to us at opioidarchive@jh.edu

Get updates about the OIDA Toolbox

Want to receive occasional emails about new tools and features in the OIDA Toolbox? Please contact us at opioidarchive@jh.edu to be added to our mailing list.

Short Separator

The OIDA Toolbox has been made possible in part with support from the UCSF-JHU Opioid Industry Documents Archive.

UCSF-JHU joint logo