Finding Datasets

With the increasing number of publicly available datasets, there is a good chance you may find secondary data sources that can support your research inquiries. Before embarking on data collection, it is advisable to ascertain whether someone else has already gathered the data you require for your project. Even if your search leads you to datasets that prove unsuitable for your needs and you still need to collect primary data, this exploration can still provide valuable insights for your research and help inform your collection efforts.

When looking for datasets, it is helpful to consider the following approaches:

  1. List possible data owners/providers who might have collected the data of interest. This could include government agencies, nonprofit organizations, or other researchers. Then, check their websites for data.
  2. Check for publications, articles, or government reports that cite the underlying dataset to determine their location and ways to access the dataset (if any).
  3. Search for data on engines like Google Dataset Search or browse for data on specific data repositories relevant to your topic. is worth exploring if you need to identify relevant repositories in your field. 
  4. Consult with the Libray. Visit the DREAMLab's Data Collections page for more information on available data sources and consult with subject librarians.

Regardless of the chosen approach, when you identify relevant data sources, ensure to document their provenance and citation, including a persistent identifier, whenever available (see: Citing & Persistent Identification). Additionally, be sure to examine the licensing agreements associated with the datasets and identify any potential restrictions that could impact your ability to use the data according to your needs. Refer to Ownership & Licensing for more information. If the dataset is readily accessible and available for download with no restrictions, make sure to preserve an unaltered copy of the original data and create backup copies for safekeeping.

Recommended Resources


Accessing Data

There might be cases where the publicly available data you are interested in is not available in an easy-to-use format, it is constantly updated, or the volume of it is beyond what you can download using the web interface.  In such scenarios, we recommend first reaching out to the data's owner or provider to explore alternative methods for acquiring the data, and we suggest you explore programmatic approaches to automate this process.

Programmatic Access Methods

Web APIs

An API, or Application Programming Interface, is a server that you can use to retrieve and send data using code. APIs are most commonly used to retrieve data. When we want to receive data from an API, we need to make a request. Requests are used all over the web. API requests work exactly the same way – you make a request to an API server for data, which responds to your request.

Web Scraping

Web scraping is an automatic method to obtain large amounts of data from websites.  This can be done manually, but it is usually faster, more efficient, and less error-prone to automate the task. Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable format, such as a .csv file or spreadsheet.

It is important to recognize that, in certain cases, web scraping can be illegal. If the terms and conditions of the website you are scraping specifically prohibit downloading and copying its content, you will be in trouble for scraping it. 

Please be advised that most vendor licenses do not permit massive data downloading from the UCSB Library's subscription content. Unauthorized data scraping violates UCSB Library's licenses and will result in the vendor/s shutting down access of content to the particular IP address where the downloading is being done. If this happens, the entire UCSB community will be denied access to the specific databases where massive downloads occurred.

If you have questions regarding Web APIs or Web Scraping, consult with the DREAM Lab at the Library. 

Recommended Resources