Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include non-traditional data in the catalog of government data sets and Allow agencies to provide knowledge of the existence of these time series data in a systematic way. #72

Open
sofianef opened this issue Jun 29, 2023 · 3 comments
Assignees
Labels
collection issues related to collection of data Data Cube Issues related to Data Cube Final Review Tagged for final review before closing Requirements Request Proposed requirements based on user and stakeholder needs Statistical Domain issues pertaining to statistical data Time Series Issues related to Time Series

Comments

@sofianef
Copy link
Collaborator

Creator Name: Dan Gillman
Creator Contact Information:: Information Scientist, [email protected]
Creator Affiliation: Office of Survey Methods Research, US Bureau of Labor Statistics

Requirement(s)

The requirements for describing time series data are in the cited paper below and here: https://iassistquarterly.com/index.php/iassist/article/view/1038.

For time series, the notion of a data set in the traditional sense must be altered. It does not fit – see problem statement below.

Instead, the data needs to be thought of as a multi-dimensional structure (n-cube) that grows over time. This structure is defined by a measure (see the paper), a set of dimensions (see the paper), and an expanding time interval beginning at some fixed date. The identity of the n-cube is through the measure and all the dimensions that apply. This means many time series will be described together

For example, take the urban Consumer Price Index. It is subdivided by metro area and product. Therefore, the dimensions are metro area (MSA) and product category (food, clothing, etc.). All the combinations of product and metro area produce a different series. This is many thousands of series. And this approach will still require many thousands of different series kinds, resulting in many thousands of n-cubes. There is probably a way to reduce this complexity further.

Without such an approach, a lot of government data may be missed or handled in an inefficient way.

Problem Statement

  • Much of the data BLS (US Bureau of Labor Statistics) produces for the public are in the form of time series. There are over 50 million such series on the BLS web site. These series follow a simple descriptive model described in https://iassistquarterly.com/index.php/iassist/article/view/1038.
  • Current access to time series data at BLS is based on knowledge of existing measures. Uninformed users interested in finding data are limited to extensive searching and digging through the BLS web site. Questions about what data BLS has for both occupations and states is not readily available.
  • Further, all the series BLS produces are added to on a recurring basis. For example, the US Unemployment Rate is updated monthly. Each new measurement is appended to the existing files. Moreover, the files containing the data for each series are intermixed with data from other similarly defined series. A fixed data set does not exist for each series.
  • Each time series is defined by a measure (a quantitative variable), set of dimensions (each dimension contains a finite set of categories), an initial time stamp, and an update interval. These are defined by a data set in the traditional formulation under Data.Gov. These defining characteristics are also described in the paper linked above.
  • The mismatch between what Data.Gov expects for its data set catalog and what BLS provides are quite different, and this makes it almost impossible to describe the vast landscape of BLS data there.
  • Other US statistical agencies produce time series, too. Generating a common framework for time series data is an opportunity for the government to provide common access to this large landscape.

Target Audience / Stakeholders

The target audiences are two:

  1. The agencies producing time series data, including:
    a. Subject matter experts
    b. Metadata experts
  2. The potential users in the public for time series data

Intended Uses / Use Cases

The intended use cases are some of the following:

  • Include non-traditional data in the catalog of government data sets
  • Allow agencies to provide knowledge of the existence of these time series data in a systematic way.

Existing Approaches - Optional

BLS provides several avenues of access and description of time series data. Some rather easy questions (see above) are not readily answerable. BLS maintains access to data through

  1. Download directories of ASCII encoded data
  2. Several online tools for extracting series (at least 3 different ways)
  3. The agency API
  4. Multiple tables downloadable for current series data

All these approaches are mostly dependent on prior knowledge of what the series cover.

Additional context, comments, or links - Optional

Original Email Submission:
DCAT-US-3-Requirements-BLS.docx

@sofianef sofianef added the Requirements Request Proposed requirements based on user and stakeholder needs label Jun 29, 2023
@fellahst fellahst added Statistical Domain issues pertaining to statistical data Time Series Issues related to Time Series Data Cube Issues related to Data Cube labels Jun 29, 2023
@fellahst fellahst added the collection issues related to collection of data label Jan 26, 2024
@fellahst
Copy link
Collaborator

fellahst commented Jan 31, 2024

This requirement is addressed in DCAT-US 3.0. See DatasetSeries usage guide.

@fellahst fellahst added the Final Review Tagged for final review before closing label Jan 31, 2024
@TDabolt
Copy link

TDabolt commented Jan 31, 2024

@fellahst pls send response to Dan G and copy M.R and I for confirmation the solution meets the requirement

@mrratcliffe
Copy link

@fellahst @TDabolt Do we want to hold off closing this until Dan G has responded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collection issues related to collection of data Data Cube Issues related to Data Cube Final Review Tagged for final review before closing Requirements Request Proposed requirements based on user and stakeholder needs Statistical Domain issues pertaining to statistical data Time Series Issues related to Time Series
Projects
None yet
Development

No branches or pull requests

4 participants