data-sheets-schema

A LinkML schema for Datasheets for Datasets.

URI: https://w3id.org/bridge2ai/data-sheets-schema

Name: data-sheets-schema

Classes

Class	Description
DatasetProperty	Represents a single property of a dataset, or a set of related properties
AddressingGap	Was there a specific gap that needed to be filled by creation of the dataset?
AnnotationAnalysis	Analysis of annotation quality, inter-annotator agreement metrics, and system...
AtRiskPopulations	Information about protections for at-risk populations in human subjects resea...
CleaningStrategy	Was any cleaning of the data done (e
CollectionConsent	Did the individuals in question consent to the collection and use of their da...
CollectionMechanism	What mechanisms or procedures were used to collect the data (e
CollectionNotification	Were the individuals in question notified about the data collection? If so, p...
CollectionTimeframe	Over what timeframe was the data collected, and does this timeframe match the...
Confidentiality	Does the dataset contain data that might be confidential (e
ConsentRevocation	If consent was obtained, were the consenting individuals provided with a mech...
ContentWarning	Does the dataset contain any data that might be offensive, insulting, threate...
Creator	Who created the dataset (e
DataAnomaly	Are there any errors, sources of noise, or redundancies in the dataset?
DataCollector	Who was involved in the data collection (e
DataProtectionImpact	Has an analysis of the potential impact of the dataset and its use on data su...
DatasetBias	Documents known biases present in the dataset
DatasetLimitation	Documents known limitations of the dataset that may affect its use or interpr...
DatasetRelationship	Typed relationship to another dataset, enabling precise specification of how ...
Deidentification	Is it possible to identify individuals in the dataset, either directly or ind...
DirectCollection	Indicates whether the data was collected directly from the individuals in que...
DiscouragedUse	Are there tasks for which the dataset should not be used?
DistributionDate	When will the dataset be distributed?
DistributionFormat	How will the dataset be distributed (e
Erratum	Is there an erratum? If so, please provide a link or other access point
EthicalReview	Were any ethical or compliance review processes conducted (e
ExistingUse	Has the dataset been used for any tasks already?
ExportControlRegulatoryRestrictions	Do any export controls or other regulatory restrictions apply to the dataset ...
ExtensionMechanism	If others want to extend/augment/build on/contribute to the dataset, is there...
ExternalResource	Is the dataset self-contained or does it rely on external resources (e
FundingMechanism	Who funded the creation of the dataset? If there is an associated grant, plea...
FutureUseImpact	Is there anything about the dataset's composition or collection that might im...
HumanSubjectCompensation	Information about compensation or incentives provided to human research parti...
HumanSubjectResearch	Information about whether the dataset involves human subjects research and wh...
ImputationProtocol	Description of data imputation methodology, including techniques used to hand...
InformedConsent	Details about informed consent procedures used in human subjects research
Instance	What do the instances that comprise the dataset represent (e
InstanceAcquisition	Describes how data associated with each instance was acquired (e
IntendedUse	Explicit statement of intended uses for this dataset
IPRestrictions	Have any third parties imposed IP-based or other restrictions on the data ass...
LabelingStrategy	Was any labeling of the data done (e
LicenseAndUseTerms	Will the dataset be distributed under a copyright or other IP license, and/or...
MachineAnnotationTools	Automated or machine-learning-based annotation tools used in dataset creation...
Maintainer	Who will be supporting/hosting/maintaining the dataset?
MissingDataDocumentation	Documentation of missing data in the dataset, including patterns, causes, and...
MissingInfo	Is any information missing from individual instances? (e
OtherTask	What other tasks could the dataset be used for?
ParticipantPrivacy	Information about privacy protections and anonymization procedures for human ...
PreprocessingStrategy	Was any preprocessing of the data done (e
ProhibitedUse	Explicit statement of prohibited or forbidden uses for this dataset
Purpose	For what purpose was the dataset created?
RawData	Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data...
RawDataSource	Description of raw data sources before preprocessing, cleaning, or labeling
Relationships	Are relationships between individual instances made explicit (e
RetentionLimits	If the dataset relates to people, are there applicable limits on the retentio...
SamplingStrategy	Does the dataset contain all possible instances, or is it a sample (not neces...
SensitiveElement	Does the dataset contain data that might be considered sensitive (e
Splits	Are there recommended data splits (e
Subpopulation	Does the dataset identify any subpopulations (e
Task	Was there a specific task in mind for the dataset's application?
ThirdPartySharing	Will the dataset be distributed to third parties outside of the entity (e
UpdatePlan	Will the dataset be updated (e
UseRepository	A repository or registry of known uses of this dataset by third parties
VariableMetadata	Metadata describing an individual variable, field, or column in a dataset
VersionAccess	Will older versions of the dataset continue to be supported/hosted/maintained...
FormatDialect	Additional format information for a file
NamedThing	A generic grouping for any identifiable entity
Grant	The name and/or identifier of the specific mechanism providing monetary suppo...
Information	Grouping for datasets and data files
Dataset	A single component of related observations and/or information that can be rea...
DataSubset	A subset of a dataset, likely containing multiple files of multiple potential...
DatasetCollection	A collection of related datasets, likely containing multiple files of multipl...
File	A single file within a dataset or file collection
FileCollection	A collection of files with shared characteristics (format, purpose, structure...
Organization	Represents a group or organization
Grantor	The name and/or identifier of the organization providing monetary support or...
Person	An individual human being
Software	A software program or library

Slots

Slot	Description
access_details	Information on how to access or retrieve the raw source data
access_url	URL or access point for the raw data
access_urls	One or more URLs providing access to the distribution channel(s) or format(s)
acquisition_details	Free-text description of how data was acquired for each instance, including i...
acquisition_methods	Methods used to acquire or obtain dataset instances
addressing_gaps	Research or practical gaps this dataset addresses
affected_subsets	One or more specific subsets or features of the dataset affected by this bias...
affiliation	The organization(s) to which the person belongs in the context of this datase...
affiliations	Organizations with which the creator or team is affiliated
agreement_metric	Type of agreement metric used (Cohen's kappa, Fleiss' kappa, Krippendorff's a...
analysis_method	Methodology used to assess annotation quality and resolve disagreements
annotation_analyses	One or more analyses of annotation quality and inter-annotator agreement
annotation_quality_details	Additional details on annotation quality assessment and findings
annotations_per_item	Number of annotations collected per data item
annotator_demographics	One or more demographic characteristics of the annotators, if available and r...
anomalies	Known data quality issues, errors, or irregularities in the dataset
anomaly_details	Free-text description of errors, noise sources, or redundancies in the datase...
anonymization_method	What methods were used to anonymize or de-identify participant data? Include ...
archival	Indicates whether official archival versions of external resources are includ...
assent_procedures	For research involving minors, what assent procedures were used? How was deve...
at_risk_groups_included	Are any at-risk populations included (e
at_risk_populations	Information about protections for at-risk populations (e
bias_description	Detailed description of how this bias manifests in the dataset, including aff...
bias_type	The type of bias identified, using standardized categories from the Artificia...
bytes	Size of the data in bytes
categories	One or more permitted categories or values for a categorical variable
citation	Recommended citation for this dataset in DataCite or BibTeX format
cleaning_details	Free-text description of data cleaning procedures applied, including criteria...
cleaning_strategies	Data cleaning and quality control procedures applied to the dataset
collection_consents	Consent obtained from individuals for data collection and use
collection_details	Free-text description of whether data was collected directly from individuals...
collection_mechanisms	Mechanisms, instruments, or tools used for data collection
collection_notifications	Notifications provided to individuals about data collection
collection_timeframes	Time periods during which data was collected
collection_type	Type(s) of content in this file collection
collector_details	Free-text description of who was involved in data collection (e
comment_prefix	Character(s) used to indicate comment lines (e
compensation_amount	What was the amount or value of compensation provided? Include currency or eq...
compensation_provided	Were participants compensated for their participation?
compensation_rationale	What was the rationale for the compensation structure? How was the amount det...
compensation_type	What type of compensation was provided (e
compression	Compression format used, if any (e
confidential_elements	Confidential or restricted information within the dataset that requires acces...
confidential_elements_present	Indicates whether any confidential data elements are present
confidentiality_details	Free-text description of which data elements are confidential, the basis for ...
confidentiality_level	Confidentiality classification of the dataset indicating level of access rest...
conforms_to	An established standard, specification, or schema to which the resource confo...
conforms_to_class	The specific class or type within a schema to which the resource conforms
conforms_to_schema	The schema or data model to which the resource conforms
consent_details	Free-text description of how consent was requested (e
consent_documentation	How is consent documented? Include references to consent forms or procedures ...
consent_obtained	Was informed consent obtained from all participants?
consent_revocations	Mechanisms for individuals to revoke previously given consent
consent_scope	What specific uses did participants consent to? Are there limitations on data...
consent_type	What type of consent was obtained (e
contact_person	Contact person for questions about ethical review
content_warnings	Content warnings for potentially harmful, offensive, or disturbing material i...
content_warnings_present	Indicates whether any content warnings are needed
contribution_url	URL for contribution guidelines or process
counts	How many instances are there in total (of each type, if appropriate)?
created_by	The person or organization primarily responsible for creating the resource
created_on	The date and time when the resource was created
creators	Individuals or organizations who created the dataset
credit_roles	One or more contributor roles using the CRediT (Contributor Roles Taxonomy) f...
data_annotation_platform	One or more platforms or tools used for annotation (e
data_annotation_protocol	Annotation methodology, tasks, and protocols followed during labeling
data_collectors	Individuals or organizations responsible for collecting the data
data_linkage	Can this dataset be linked to other datasets in ways that might compromise pa...
data_protection_impacts	Data protection impact assessments (DPIAs) conducted for the dataset
data_substrate	Type of data (e
data_topic	General topic of each instance (e
data_type	The data type of the variable (e
data_use_permission	Structured data use permissions using the Data Use Ontology (DUO)
deidentification_details	Details on de-identification procedures and residual risks
delimiter	Field delimiter character (e
derivation	Description of how this variable was derived or calculated from other variabl...
description	A human-readable description for a thing
dialect	Specific format dialect or variation (e
direct_collection	Whether data was collected directly from individuals or via third parties
disagreement_patterns	Systematic patterns in annotator disagreements (e
discouraged_uses	Uses that are not recommended for this dataset due to limitations, risks, or ...
discouragement_details	Free-text description of tasks or applications for which the dataset is not r...
distribution	The distribution of instances across identified subpopulations, including cou...
distribution_dates	Dates when the dataset was or will be distributed or released
distribution_formats	Formats in which the dataset is distributed or made available
doi	Digital Object Identifier (DOI) in format 10
double_quote	Whether quotes within quoted fields are escaped by doubling them
download_url	URL from which the data can be downloaded
email	The email address of the person
encoding	The character encoding of the data
end_date	End date of data collection
errata	Known errors or corrections to the dataset since publication
erratum_details	Free-text description of the error, its scope, the affected data or records, ...
erratum_url	URL or access point for the erratum
ethical_reviews	Ethical reviews and institutional oversight for the dataset
ethics_review_board	What ethics review board(s) reviewed this research? Include institution names...
examples	List of examples of known/previous uses of the dataset
existing_uses	Known existing uses of the dataset at the time of publication
extension_details	Free-text description of how third parties can contribute to the dataset, how...
extension_mechanism	Mechanisms for extending or contributing to the dataset
external_resources	Links or identifiers for external resources
file_collections	Collection of file groups within this dataset
file_count	Number of files in this collection
file_type	Semantic type or purpose of this file (e
format	The file format, physical medium, or dimensions of a resource
frequency	How often updates are planned (e
funders	Funding mechanisms that supported dataset creation
future_guarantees	Explanation of any commitments that external resources will remain available ...
future_use_impacts	Anticipated impacts of future uses, including risks and benefits
governance_committee_contact	Contact person for data governance committee
grant_number	The alphanumeric identifier for the grant
grantor	Name/identifier of the organization providing monetary or resource support
grants	Grant mechanisms supporting dataset creation
guardian_consent	For participants unable to provide their own consent, how was guardian or sur...
handling_strategy	The primary strategy used to handle missing data (e
hash	Cryptographic hash value of the data for integrity verification (e
header	Whether the first row of the file contains column headers
hipaa_compliant	Indicates compliance with the Health Insurance Portability and Accountability...
human_subject_research	Information about whether dataset involves human subjects research, including...
id	A unique identifier for a thing
identifiable_elements_present	Indicates whether data subjects can be identified
identification	How subpopulations are identified and defined (e
identifiers_removed	List of identifier types removed during de-identification (e
impact_details	Free-text description of potential future impacts or risks arising from the d...
imputation_method	Specific imputation technique used (mean, median, mode, forward fill, backwar...
imputation_protocols	Data imputation protocols applied to handle missing values
imputation_rationale	Justification for the imputation approach chosen, including assumptions made ...
imputation_validation	Methods used to validate imputation quality (if any)
imputed_fields	Fields or columns where imputation was applied
informed_consent	One or more records detailing informed consent procedures, including consent ...
instance_type	The type or types of instances in the dataset (e
instances	Individual data instances or records in the dataset
intended_uses	List of explicit intended and recommended uses for this dataset
inter_annotator_agreement	Measure of agreement between annotators (e
inter_annotator_agreement_score	Measured agreement between annotators (e
involves_human_subjects	Does this dataset involve human subjects research?
ip_restrictions	Intellectual property restrictions on dataset use or redistribution
irb_approval	Was Institutional Review Board (IRB) approval obtained? Include approval numb...
is_data_split	Is this subset a split of the larger dataset, e
is_deidentified	De-identification status and procedures applied to the dataset
is_direct	Whether collection was direct from individuals
is_identifier	Indicates whether this variable serves as a unique identifier or key for reco...
is_random	Indicates whether the sample is random
is_representative	Indicates whether the sample is representative of the larger set
is_sample	Indicates whether it is a sample of a larger set
is_sensitive	Indicates whether this variable contains sensitive information (e
is_shared	Boolean indicating whether the dataset is distributed to parties external to ...
is_subpopulation	Is this subset a subpopulation of the larger dataset, e
is_tabular	Whether the dataset is in tabular format (rows and columns)
issued	Date of formal issuance or publication of the resource
keywords	Keywords or tags describing the resource for discovery and classification
known_biases	List of known biases present in the dataset that may affect fairness, represe...
known_limitations	List of known limitations of the dataset that may affect its use or interpret...
label	Is there a label or target associated with each instance?
label_description	If labeled, what pattern or format do labels follow?
labeling_details	Free-text description of the labeling or annotation procedures, including ann...
labeling_strategies	Labeling or annotation methodologies applied to the data
language	Language in which the information is expressed
last_updated_on	The date and time when the resource was most recently modified or updated
latest_version_doi	DOI or URL identifying the latest version of this dataset (e
license	The legal license under which the resource is made available (e
license_and_use_terms	License and usage terms governing dataset access and use
license_terms	Description of the dataset's license and terms of use, including links, costs...
limitation_description	Detailed description of the limitation and its implications
limitation_type	Category of limitation (e
machine_annotation_tools	List of automated annotation tools used in dataset creation
maintainer_details	Free-text description of the organization, team, or individual responsible fo...
maintainers	Individuals or organizations responsible for maintaining the dataset
maximum_value	The maximum value that the variable can take
md5	MD5 hash value of the data (128-bit cryptographic hash)
measurement_technique	The technique or method used to measure this variable
mechanism_details	Free-text description of the specific mechanisms or procedures used to collec...
media_type	The media type of the data
method	Method used for de-identification (e
minimum_value	The minimum value that the variable can take
missing	Description of the missing data fields or elements
missing_data_causes	Known or suspected causes of missing data (e
missing_data_documentation	One or more records documenting missing data patterns and handling strategies
missing_data_patterns	Description of patterns in missing data (e
missing_information	References to one or more MissingInfo objects describing missing data
missing_value_code	Code(s) used to represent missing values for this variable
mitigation_strategy	Steps taken or recommended to mitigate this bias
modified_by	A person or organization that contributed to modifying or updating the resour...
name	A human-readable name for a thing
notification_details	Free-text description of how individuals were notified about data collection,...
orcid	ORCID (Open Researcher and Contributor ID) - a persistent digital identifier ...
other_compliance	Other regulatory compliance frameworks applicable to this dataset (e
other_tasks	Additional tasks the dataset may support beyond its original intent
page	A landing page or web page providing access to or information about the resou...
parent_datasets	One or more parent datasets that this dataset is part of or derived from
participant_compensation	One or more records describing compensation or incentives provided to human r...
participant_privacy	One or more records describing privacy protections and anonymization procedur...
path	The file path or URL where the content is located
precision	The precision or number of decimal places for numeric variables
preprocessing_details	Free-text description of preprocessing steps applied to the data, including t...
preprocessing_strategies	Preprocessing steps applied to the raw data
principal_investigator	A key individual (Principal Investigator) responsible for or overseeing datas...
privacy_techniques	What privacy-preserving techniques were applied (e
prohibited_uses	List of explicitly prohibited or forbidden uses for this dataset
prohibition_reason	One or more reasons why this use is prohibited (e
publisher	The organization or entity responsible for making the resource available
purposes	Purposes for which the dataset was created
quality_notes	Notes about data quality, reliability, or known issues specific to this varia...
quote_char	Character used for quoting fields (e
raw_data_details	Free-text description of raw data availability, access procedures, and any co...
raw_data_format	One or more formats of the raw data before any preprocessing (e
raw_data_sources	List of raw data sources before preprocessing
raw_sources	Raw, unprocessed source data before any preprocessing was applied
recommended_mitigation	Recommended approaches for users to address this limitation
regulatory_compliance	What regulatory frameworks govern this human subjects research (e
regulatory_restrictions	Regulatory and export control restrictions applicable to the dataset
reidentification_risk	What is the assessed risk of re-identification? What measures were taken to m...
related_datasets	List of related datasets with typed relationships (e
relationship_details	Free-text description of how relationships between instances are represented ...
relationship_type	The type of relationship (e
relationships	Explicit relationships between individual instances in the dataset
release_dates	One or more dates or timeframes for dataset release, in ISO 8601 format (e
repository_details	Free-text description of the repository of known dataset uses, including how ...
repository_url	URL to a repository of known dataset uses
representative_verification	One or more explanations of how representativeness was validated or verified ...
resources	Sub-resources or component items
response	Short explanation describing the primary purpose of creating the dataset
restrictions	One or more descriptions of restrictions or fees associated with accessing th...
retention_details	Free-text description of applicable retention limits, legal or ethical basis ...
retention_limit	Data retention policies and limits for the dataset
retention_period	Time period for data retention
review_details	Free-text description of the ethical review process, board decisions, outcome...
reviewing_organization	Organization that conducted the ethical review (e
revocation_details	Free-text description of the mechanism provided for individuals to revoke con...
role	Role of the data collector (e
same_as	One or more URLs or URIs identifying equivalent or related representations of...
sampling_strategies	Strategies used to select data instances from a larger population
scope_impact	How this limitation affects the scope or applicability of the dataset
sensitive_elements	Sensitive data elements requiring special handling or access controls
sensitive_elements_present	Indicates whether sensitive data elements are present
sensitivity_details	Details on sensitive data elements present and handling procedures
sha256	SHA-256 hash value of the data (256-bit cryptographic hash, recommended)
source_data	One or more descriptions of the larger sets from which the sample was drawn, ...
source_description	Detailed description of where raw data comes from (e
source_type	One or more types of raw source (e
special_populations	Does the research involve any special populations that require additional pro...
special_protections	What additional protections were implemented for at-risk populations? Include...
split_details	Free-text description of the recommended data splits (e
splits	Recommended data splits for this dataset
start_date	Start date of data collection
status	The status of the resource (e
strategies	One or more sampling strategies used (e
subpopulation_elements_present	Indicates whether any subpopulations are explicitly identified
subpopulations	Subpopulations represented within the dataset
subsets	Subsets or splits of this dataset
target_dataset	The dataset that this relationship points to
task_details	Free-text description of other potential tasks the dataset could support, inc...
tasks	Tasks the dataset is intended to support
themes	One or more themes associated with the data
third_party_sharing	Third-party distribution policies for the dataset
timeframe_details	Free-text description of the data collection period and whether this timefram...
title	The official title of the element
tool_accuracy	One or more known accuracy or performance metrics for the automated tools (if...
tool_descriptions	Descriptions of what each tool does in the annotation process and what types ...
tools	List of automated annotation tools with their versions
total_bytes	Total size of all files in this collection, in bytes (integer)
total_file_count	Total number of files across all file collections in this dataset
total_size_bytes	Total size of all files in bytes across all file collections
unit	The unit of measurement for the variable, preferably using QUDT units (http:/...
update_details	Free-text description of planned update types (e
updates	Plans for future updates or versioning of the dataset
url	URL where the software can be found (e
usage_notes	A note or caveat about using the dataset for its intended purposes
use_category	One or more categories of intended use (e
use_repository	Repositories or registries tracking how the dataset has been used
used_software	What software was used as part of this dataset property?
variable_name	The name or identifier of the variable as it appears in the data files
variables	List of metadata records describing individual variables, fields, or columns ...
version	The version identifier of the resource (e
version_access	Information about access to different versions of the dataset
version_details	Free-text description of version support policies, how long older versions wi...
versions_available	List of available versions with metadata
warnings	One or more specific content warnings describing potentially offensive, insul...
was_derived_from	A resource from which this resource was derived, in whole or in part
was_directly_observed	True if the data was directly observed by a researcher or instrument; false i...
was_inferred_derived	True if the data was computationally inferred or derived from other data (e
was_reported_by_subjects	True if the data was self-reported directly by the subjects themselves (e
was_validated_verified	True if the data underwent a validation or verification process (e
why_missing	Explanation of why each piece of data is missing
why_not_representative	One or more explanations of why the sample is not representative of the large...
withdrawal_mechanism	How can participants withdraw their consent? What procedures are in place for...

Enumerations

Enumeration	Description
BiasTypeEnum	Types of bias that may be present in datasets
Boolean	Three-valued boolean logic supporting true, false, and unknown states
ComplianceStatusEnum	Compliance status for regulatory frameworks
CompressionEnum	Compression algorithms and formats for file compression
ConfidentialityLevelEnum	Confidentiality classification levels for datasets indicating the degree of a...
CreatorOrMaintainerEnum	Types of agents (persons or organizations) involved in dataset creation or ma...
CRediTRoleEnum	Contributor roles based on the CRediT (Contributor Roles Taxonomy)
DatasetRelationshipTypeEnum	Standardized types of relationships between datasets, based on DataCite Metad...
DataUsePermissionEnum	Data use permissions and restrictions based on the Data Use Ontology (DUO)
EncodingEnum	Character encoding schemes for text representation in different languages and...
FileCollectionTypeEnum	Types of file collections within datasets
FileTypeEnum	Types of individual files within datasets
FormatEnum	Common file format extensions for data files and documents
LimitationTypeEnum	Types of limitations that may affect dataset use or interpretation
MediaTypeEnum	MIME media types (Internet Media Types) for file content identification
VariableTypeEnum	Common data types for variables
VersionTypeEnum	Type of version change using semantic versioning principles

Types

Type	Description
Boolean	A binary (true or false) value
Curie	a compact URI
Date	a date (year, month and day) in an idealized calendar
DateOrDatetime	Either a date or a datetime
Datetime	The combination of a date and time
Decimal	A real number with arbitrary precision that conforms to the xsd:decimal speci...
Double	A real number that conforms to the xsd:double specification
Float	A real number that conforms to the xsd:float specification
Integer	An integer
Jsonpath	A string encoding a JSON Path
Jsonpointer	A string encoding a JSON Pointer
Ncname	Prefix part of CURIE
Nodeidentifier	A URI, CURIE or BNODE that represents a node in a model
Objectidentifier	A URI or CURIE that represents an object in the model
Sparqlpath	A string encoding a SPARQL Property Path
String	A character string
Time	A time object represents a (local) time of day, independent of any particular...
Uri	a complete URI
Uriorcurie	a URI or a CURIE

Subsets

Subset	Description
Collection	The questions in this section are designed to elicit information that may hel...
Composition	The questions in this section are intended to provide dataset consumers with ...
DataGovernance	The questions in this section relate to how the dataset is governed: how it i...
Distribution	The questions in this section pertain to dataset distribution
Ethics	The questions in this section address ethical and data-protection concerns, i...
Maintenance	The questions in this section are intended to encourage dataset creators to p...
Motivation	The questions in this section are primarily intended to encourage dataset cre...
Preprocessing-Cleaning-Labeling	The questions in this section are intended to provide dataset consumers with ...
Uses	The questions in this section are intended to encourage dataset creators to r...