Skip to content

data-sheets-schema

A LinkML schema for Datasheets for Datasets.

URI: https://w3id.org/bridge2ai/data-sheets-schema

Name: data-sheets-schema

Classes

Class Description
FormatDialect Additional format information for a file
NamedThing A generic grouping for any identifiable entity
        DatasetProperty Represents a single property of a dataset, or a set of related properties
                AddressingGap Was there a specific gap that needed to be filled by creation of the dataset?
                CleaningStrategy Was any cleaning of the data done (e
                CollectionConsent Did the individuals in question consent to the collection and use of their da...
                CollectionMechanism What mechanisms or procedures were used to collect the data (e
                CollectionNotification Were the individuals in question notified about the data collection? If so, p...
                CollectionTimeframe Over what timeframe was the data collected, and does this timeframe match the...
                Confidentiality Does the dataset contain data that might be confidential (e
                ConsentRevocation If consent was obtained, were the consenting individuals provided with a mech...
                ContentWarning Does the dataset contain any data that might be offensive, insulting, threate...
                Creator Who created the dataset (e
                DataAnomaly Are there any errors, sources of noise, or redundancies in the dataset?
                DataCollector Who was involved in the data collection (e
                DataProtectionImpact Has an analysis of the potential impact of the dataset and its use on data su...
                Deidentification Is it possible to identify individuals in the dataset, either directly or ind...
                DirectCollection Indicates whether the data was collected directly from the individuals in que...
                DiscouragedUse Are there tasks for which the dataset should not be used?
                DistributionDate When will the dataset be distributed?
                DistributionFormat How will the dataset be distributed (e
                Erratum Is there an erratum? If so, please provide a link or other access point
                EthicalReview Were any ethical or compliance review processes conducted (e
                ExistingUse Has the dataset been used for any tasks already?
                ExportControlRegulatoryRestrictions Do any export controls or other regulatory restrictions apply to the dataset ...
                ExtensionMechanism If others want to extend/augment/build on/contribute to the dataset, is there...
                ExternalResource Is the dataset self-contained or does it rely on external resources (e
                FundingMechanism Who funded the creation of the dataset? If there is an associated grant, plea...
                FutureUseImpact Is there anything about the dataset's composition or collection that might im...
                HumanSubjectCompensation Information about compensation or incentives provided to human research parti...
                HumanSubjectResearch Information about whether the dataset involves human subjects research and wh...
                InformedConsent Details about informed consent procedures used in human subjects research
                Instance What do the instances that comprise the dataset represent (e
                InstanceAcquisition Describes how data associated with each instance was acquired (e
                IPRestrictions Have any third parties imposed IP-based or other restrictions on the data ass...
                LabelingStrategy Was any labeling of the data done (e
                LicenseAndUseTerms Will the dataset be distributed under a copyright or other IP license, and/or...
                Maintainer Who will be supporting/hosting/maintaining the dataset?
                MissingInfo Is any information missing from individual instances? (e
                OtherTask What other tasks could the dataset be used for?
                ParticipantPrivacy Information about privacy protections and anonymization procedures for human ...
                PreprocessingStrategy Was any preprocessing of the data done (e
                Purpose For what purpose was the dataset created?
                RawData Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data...
                Relationships Are relationships between individual instances made explicit (e
                RetentionLimits If the dataset relates to people, are there applicable limits on the retentio...
                SamplingStrategy Does the dataset contain all possible instances, or is it a sample (not neces...
                SensitiveElement Does the dataset contain data that might be considered sensitive (e
                Splits Are there recommended data splits (e
                Subpopulation Does the dataset identify any subpopulations (e
                Task Was there a specific task in mind for the dataset's application?
                ThirdPartySharing Will the dataset be distributed to third parties outside of the entity (e
                UpdatePlan Will the dataset be updated (e
                UseRepository Is there a repository that links to any or all papers or systems that use the...
                VariableMetadata Metadata describing an individual variable, field, or column in a dataset
                VersionAccess Will older versions of the dataset continue to be supported/hosted/maintained...
                VulnerablePopulations Information about protections for vulnerable populations in human subjects re...
        Grant The name and/or identifier of the specific mechanism providing monetary suppo...
        Information Grouping for datasets and data files
                Dataset A single component of related observations and/or information that can be rea...
                        DataSubset A subset of a dataset, likely containing multiple files of multiple potential...
                DatasetCollection A collection of related datasets, likely containing multiple files of multipl...
        Organization Represents a group or organization
                Grantor The name and/or identifier of the organization providing monetary support or...
        Person An individual human being
        Software A software program or library

Slots

Slot Description
acquisition_methods
addressing_gaps
affiliation The organization(s) to which the person belongs in the context of this datase...
annotation_platform Platform or tool used for annotation (e
annotations_per_item Number of annotations collected per data item
annotator_demographics Demographic information about annotators, if available and relevant (e
anomalies
anonymization_method What methods were used to anonymize or de-identify participant data? Include ...
archival Indication whether official archival versions of external resources are inclu...
assent_procedures For research involving minors, what assent procedures were used? How was deve...
bytes Size of the data in bytes
categories The permitted categories or values for a categorical variable
cleaning_strategies
collection_mechanisms
collection_timeframes
comment_prefix
compensation_amount What was the amount or value of compensation provided? Include currency or eq...
compensation_provided Were participants compensated for their participation?
compensation_rationale What was the rationale for the compensation structure? How was the amount det...
compensation_type What type of compensation was provided (e
compression compression format used, if any
confidential_elements
confidential_elements_present Indicates whether any confidential data elements are present
conforms_to
conforms_to_class
conforms_to_schema
consent_documentation How is consent documented? Include references to consent forms or procedures ...
consent_obtained Was informed consent obtained from all participants?
consent_scope What specific uses did participants consent to? Are there limitations on data...
consent_type What type of consent was obtained (e
contact_person Contact person for questions about ethical review
content_warnings
content_warnings_present Indicates whether any content warnings are needed
counts How many instances are there in total (of each type, if appropriate)?
created_by
created_on
creators
credit_roles Contributor roles using the CRediT (Contributor Roles Taxonomy) for the princ...
data_collectors
data_linkage Can this dataset be linked to other datasets in ways that might compromise pa...
data_protection_impacts
data_substrate Type of data (e
data_topic General topic of each instance (e
data_type The data type of the variable (e
data_use_permission Structured data use permissions using the Data Use Ontology (DUO)
delimiter
derivation Description of how this variable was derived or calculated from other variabl...
description A human-readable description for a thing
dialect
discouraged_uses
distribution
distribution_dates
distribution_formats
doi digital object identifier
double_quote
download_url URL from which the data can be downloaded
email The email address of the person
encoding the character encoding of the data
errata
ethical_reviews
ethics_review_board What ethics review board(s) reviewed this research? Include institution names...
eu_ai_act_risk_category Risk category under the EU AI Act
examples Example values for this variable to illustrate typical data
existing_uses
extension_mechanism
external_resources
format The file format, physical medium, or dimensions of a resource
funders
future_guarantees Explanation of any commitments that external resources will remain available ...
future_use_impacts
gdpr_compliant Indicates compliance with the EU General Data Protection Regulation (GDPR)
grant Name/identifier of the specific grant mechanism supporting dataset creation
grant_number The alphanumeric identifier for the grant
grantor Name/identifier of the organization providing monetary or resource support
guardian_consent For participants unable to provide their own consent, how was guardian or sur...
hash hash of the data
header
hipaa_compliant Indicates compliance with the Health Insurance Portability and Accountability...
id A unique identifier for a thing
identifiable_elements_present Indicates whether data subjects can be identified
identification
instance_type Multiple types of instances? (e
instances
inter_annotator_agreement Measure of agreement between annotators (e
involves_human_subjects Does this dataset involve human subjects research?
ip_restrictions
irb_approval Was Institutional Review Board (IRB) approval obtained? Include approval numb...
is_data_split Is this subset a split of the larger dataset, e
is_deidentified
is_identifier Indicates whether this variable serves as a unique identifier or key for reco...
is_random Indicates whether the sample is random
is_representative Indicates whether the sample is representative of the larger set
is_sample Indicates whether it is a sample of a larger set
is_sensitive Indicates whether this variable contains sensitive information (e
is_subpopulation Is this subset a subpopulation of the larger dataset, e
is_tabular
issued
keywords
label Is there a label or target associated with each instance?
label_description If labeled, what pattern or format do labels follow?
labeling_strategies
language language in which the information is expressed
last_updated_on
license
license_and_use_terms
maintainers
maximum_value The maximum value that the variable can take
md5 md5 hash of the data
measurement_technique The technique or method used to measure this variable
media_type The media type of the data
minimum_value The minimum value that the variable can take
missing Description of the missing data fields or elements
missing_information References to one or more MissingInfo objects describing missing data
missing_value_code Code(s) used to represent missing values for this variable
modified_by
name A human-readable name for a thing
orcid ORCID (Open Researcher and Contributor ID) - a persistent digital identifier ...
other_compliance Other regulatory compliance frameworks applicable to this dataset (e
other_tasks
page
path
precision The precision or number of decimal places for numeric variables
preprocessing_strategies
principal_investigator A key individual (Principal Investigator) responsible for or overseeing datas...
privacy_techniques What privacy-preserving techniques were applied (e
profile The frictionless data profile to which the data conforms
publisher
purposes
quality_notes Notes about data quality, reliability, or known issues specific to this varia...
quote_char
raw_sources
regulatory_compliance What regulatory frameworks govern this human subjects research (e
regulatory_restrictions
reidentification_risk What is the assessed risk of re-identification? What measures were taken to m...
representative_verification Explanation of how representativeness was validated or verified
resources
response Short explanation describing the primary purpose of creating the dataset
restrictions Description of any restrictions or fees associated with external resources
retention_limit
reviewing_organization Organization that conducted the ethical review (e
same_as URL of a reference web resource that is the same as this dataset
sampling_strategies
sensitive_elements
sensitive_elements_present Indicates whether sensitive data elements are present
sha256 sha256 hash of the data
source_data Description of the larger set from which the sample was drawn, if any
special_populations Does the research involve any special populations that require additional pro...
special_protections What additional protections were implemented for vulnerable populations? Incl...
status
strategies Description of the sampling strategy (deterministic, probabilistic, etc
subpopulation_elements_present Indicates whether any subpopulations are explicitly identified
subpopulations
subsets
tasks
themes Themes associated with the data
title the official title of the element
unit The unit of measurement for the variable, preferably using QUDT units (http:/...
updates
url
use_repository
used_software What software was used as part of this dataset property?
variable_name The name or identifier of the variable as it appears in the data files
variables Metadata describing individual variables, fields, or columns in the dataset
version
version_access
vulnerable_groups_included Are any vulnerable populations included (e
warnings
was_derived_from
was_directly_observed Whether the data was directly observed
was_inferred_derived Whether the data was inferred or derived from other data
was_reported_by_subjects Whether the data was reported directly by the subjects themselves
was_validated_verified Whether the data was validated or verified in any way
why_missing Explanation of why each piece of data is missing
why_not_representative Explanation of why the sample is not representative, if applicable
withdrawal_mechanism How can participants withdraw their consent? What procedures are in place for...

Enumerations

Enumeration Description
AIActRiskEnum Risk categories under the EU AI Act
BiasTypeEnum Types of bias that may be present in datasets
Boolean
ComplianceStatusEnum Compliance status for regulatory frameworks
CompressionEnum
CreatorOrMaintainerEnum
CRediTRoleEnum Contributor roles based on the CRediT (Contributor Roles Taxonomy)
DataUsePermissionEnum Data use permissions and restrictions based on the Data Use Ontology (DUO)
EncodingEnum
FormatEnum
MediaTypeEnum
VariableTypeEnum Common data types for variables
VersionTypeEnum Type of version change using semantic versioning principles

Types

Type Description
Boolean A binary (true or false) value
Curie a compact URI
Date a date (year, month and day) in an idealized calendar
DateOrDatetime Either a date or a datetime
Datetime The combination of a date and time
Decimal A real number with arbitrary precision that conforms to the xsd:decimal speci...
Double A real number that conforms to the xsd:double specification
Float A real number that conforms to the xsd:float specification
Integer An integer
Jsonpath A string encoding a JSON Path
Jsonpointer A string encoding a JSON Pointer
Ncname Prefix part of CURIE
Nodeidentifier A URI, CURIE or BNODE that represents a node in a model
Objectidentifier A URI or CURIE that represents an object in the model
Sparqlpath A string encoding a SPARQL Property Path
String A character string
Time A time object represents a (local) time of day, independent of any particular...
Uri a complete URI
Uriorcurie a URI or a CURIE

Subsets

Subset Description
Collection The questions in this section are designed to elicit information that may hel...
Composition The questions in this section are intended to provide dataset consumers with ...
DataGovernance The questions in this section relate to how the dataset is governed: how it i...
Distribution The questions in this section pertain to dataset distribution
Ethics The questions in this section address ethical and data-protection concerns, i...
Maintenance The questions in this section are intended to encourage dataset creators to p...
Motivation The questions in this section are primarily intended to encourage dataset cre...
Preprocessing-Cleaning-Labeling The questions in this section are intended to provide dataset consumers with ...
Uses The questions in this section are intended to encourage dataset creators to r...