review-article

Free Access

Datasheets for datasets

Authors:
Timnit Gebru

DAIR Institute, Palo Alto, CA

DAIR Institute, Palo Alto, CA
View Profile

,
Jamie Morgenstern

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
Briana Vecchione

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Jennifer Wortman Vaughan

Microsoft Research, New York, NY

Microsoft Research, New York, NY
View Profile

,
Hanna Wallach

Microsoft Research, New York, NY

Microsoft Research, New York, NY
View Profile

,
Hal Daumé III

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

,
Kate Crawford

USC Annenberg, CA

USC Annenberg, CA
View Profile

Authors Info & Claims

Communications of the ACM Volume 64 Issue 12December 2021pp 86–92https://doi.org/10.1145/3458723

Published:19 November 2021Publication History

Communications of the ACM

Abstract

Documentation to facilitate communication between dataset creators and consumers.

References

Andrews, D., Bonta, J., and Wormith, J. The recent past and near future of risk and/or need assessment. Crime & Delinquency 52, 1 (2006), 7--27.Google ScholarCross Ref
Bender, E. and Friedman, B. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Trans. of the Assoc. for Computational Linguistics 6 (2018), 587--604.Google ScholarCross Ref
Bhardwaj, A. et al. DataHub: Collaborative data science & dataset version management at scale. CoRR abs/1409.0798 (2014).Google Scholar
Bolukbasi, T., Chang, K., Zou, J., Saligrama, V., and Kalai, A. Man is to computer programmer as woman is to homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems (2016).Google Scholar
Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conf. on Fairness, Accountability, and Transparency (2018). 77--91.Google Scholar
Cao, Y. and Daumé, H. Toward gender-inclusive coreference resolution. In Proceedings of the Conf. of the Assoc. for Computational Linguistics (2020). abs/1910.13913.Google Scholar
Cao, Y. and Daumé, H. Toward gender-inclusive coreference resolution. In Proceedings of the Conf. of the Assoc. for Computational Linguistics (2020).Google ScholarCross Ref
Cheney, J., Chiticariu, L., and Tan, W. Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1, 4 (2009), 379--474.Google Scholar
Chmielinski, K. et al. The dataset nutrition label (2nd Gen): Leveraging context to mitigate harms in artificial intelligence. In NeurIPS Workshop on Dataset Curation and Security, 2020.Google Scholar
Choi, E. et al. QuAC: Question answering in context. In Proceedings of the 2018 Conf. on Empirical Methods in Natural Language Processing.Google Scholar
Chui, G. Project will use AI to prevent or minimize electric grid failures, 2017.Google Scholar
Dastin, J. Amazon scraps secret AI recruiting tool that showed bias against women, 2018; https://reut.rs/3imOH4d.Google Scholar
Garvie, C., Bedoya, A., and Frankle, J. The Perpetual Line-Up: Unregulated Police Face Recognition in America. Georgetown Law, Center on Privacy & Technology, Washington, D.C., 2016.Google Scholar
Hind, M. et al. Varshney. Increasing trust in AI services through supplier's declarations of conformity. CoRR abs/1808.07261 (2018).Google Scholar
Holstein, K., Vaughan, J., Daumé, H, Dudík, M., and Wallach, H. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of 2019 ACM CHI Conf. on Human Factors in Computing Systems.Google Scholar
Huang, G., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49. University of Massachusetts Amherst, 2007.Google Scholar
Krasin, I. et al. OpenImages: A public dataset for large-scale multi-label and multi-class image classification, 2017.Google Scholar
Lin, T. The new investor. UCLA Law Review 60 (2012), 678.Google Scholar
Mann, G. and O'Neil, C. Hiring Algorithms Are Not Neutral, 2016; https://hbr.org/2016/12/hiring-algorithms-are-not-neutral.Google Scholar
Mitchell, M. et al. Model cards for model reporting. In Proceedings of the Conf. on Fairness, Accountability, and Transparency (2019). 220--229.Google ScholarDigital Library
O'Connor, M. How AI Could Smarten Up Our Water System, 2017.Google Scholar
Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42^nd Annual Meeting of the Assoc. for Computational Linguistics. 2004, 271.Google ScholarDigital Library
Seck, I., Dahmane, K., Duthon, P., and Loosli, G. Baselines and a datasheet for the Cerema AWP dataset. CoRR abs/1806.04016 (2018). http://arxiv.org/abs/1806.04016Google Scholar
Doha Supply Systems. Facial Recognition, 2017.Google Scholar
World Economic Forum Global Future Council on Human Rights 2016--2018. How to Prevent Discriminatory Outcomes in Machine Learning; 2018. https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-inmachine-learning.Google Scholar
Yagcioglu, S., Erdem, A., Erdem, E., and Ikizler-Cinbis, N. RecipeQA: A challenge dataset for multimodal comprehension of cooking recipes. In Proceedings of the 2018 Conf. on Empirical Methods in Natural Language Processing.Google Scholar

Index Terms

Datasheets for datasets
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Datasheets for Energy Datasets: An Ethically-Minded Approach to Documentation
e-Energy '23 Companion: Companion Proceedings of the 14th ACM International Conference on Future Energy Systems

This work presents an argument for the use of specific documentation for the ethical development, use, and sharing of energy datasets, and an evaluation of current practice in the energy AI community. Drawing on a recently developed resource from the ...
Read More
Augmented Datasheets for Speech Datasets and Ethical Decision-Making
FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of ...
Read More
What is in our datasets?: describing a structure of datasets
ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference

In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 64, Issue 12
December 2021
101 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3502158
Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 November 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- review-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 439
  Total Citations
  View Citations
- 35,652
  Total Downloads
- Downloads (Last 12 months)6,732
- Downloads (Last 6 weeks)2,155
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Datasheets for datasets

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Datasheets for Energy Datasets: An Ethically-Minded Approach to Documentation

Augmented Datasheets for Speech Datasets and Ethical Decision-Making

What is in our datasets?: describing a structure of datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Datasheets for datasets

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Datasheets for Energy Datasets: An Ethically-Minded Approach to Documentation

Augmented Datasheets for Speech Datasets and Ethical Decision-Making

What is in our datasets?: describing a structure of datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media