Make data accessible
From Geoscience Paper of the Future
What This Task Involves
The training session and training materials indicate how to:
- Get a permanent unique identifier for your dataset in a public repository
- Specify general (creator, license, version) and domain metadata (categories, tags)
- Upload or specify a pointer to the dataset
Training Materials
This training session will be held on February 20, 2015:
Suggested Readings
- "Ten Simple Rules for the Care and Feeding of Scientific Data." Alyssa Goodman, Alberto Pepe , Alexander W. Blocker, Christine L. Borgman, Kyle Cranmer, Merce Crosas, Rosanne Di Stefano, Yolanda Gil, Paul Groth, Margaret Hedstrom, David W. Hogg, Vinay Kashyap, Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic. PLOS Computational Biology, Published: April 24, 2014. DOI: 10.1371/journal.pcbi.1003542
- A brief and practical introduction for how to publish and share data
- "Achieving human and machine accessibility of cited data in scholarly publications." Joan Starr, Eleni Castro, Mercè Crosas, Michel Dumontier, Robert R. Downs, Ruth Duerr, Laurel Haak, Melissa Haendel, Ivan Herman, Simon Hodson, Joe Hourclé, John Ernest Kratz, Jennifer Lin, Lars Holm Nielsen, Amy Nurnberger, Stefan Pröll, Andreas Rauber, Simone Sacchi, Arthur P. Smith, Michael Taylor, Tim Clark. PeerJ Preprint, Version of 11 February 2015.
- A good description of the different kinds of permanent unique identifiers
- "How to Cite Datasets and Link to Publications." Alex Ball and Monica Duke. DCC How-to Guides. Edinburgh: Digital Curation Centre. Version of 20 June 2012.
- A good overview of the elements in a data citation and how to handle granularity and versioning
- "Data Citation Guidelines for Data Providers and Archives." ESIP Technical Report. Version of 31 December 2011. doi:10.7269/P34F1NNJ
- Provides many examples of alternative formats for citing a dataset
What To Do
We described many options in the training. Here is a sketch of the most common approach:
- Create a public entry for your dataset with a permanent unique identifier.
- Select a repository
- Option 1: Find a repository that your community uses
- Option 2: Go to figshare.com or zenodo.org (supported by CERN) or similarly free service, create an account. Figshare has 250MB file limits and 1GB private storage, but unlimited open storage. Zenodo allows files up to 2GB (with the potential for higher, if you talk to the site managers) and no current total storage limit.
- Create an entry for each of your datasets
- Specify the metadata
- Include license information: choose from Creative Commons, for example CC-BY or CC0.
- Upload or point to the data
- The repository should give you a unique identifier (a DOI)
- Select a repository
- Create a data citation for each of your datasets
- Include: authors, date of publication, dataset name, repository name, permanent unique identifier, timestamp of retrieval.
- Specify the data citation in the repository entry for each dataset, so others can use it
- Include the data citations in the GPF
Some interesting cases that you may run into:
- I have several related datasets in several files (e.g., each file has data for a time period)
- Create a DOI for each file and a DOI for the whole set. If there are too many files (dozens or hundreds, it may be best to create a DOI for the whole set.
- My data is in a public repository, it is not my data
- Create a DOI for the slice of data that you use. Describe the data by specifying the query that you did to the repository and put a pointer to the repository, so others can also retrieve it.
- My data is from a database
- Ask for permission to publish the data that you extracted, and mention that you will give appropriate credit. Get an understanding of the appropriate license to use. Put the data in a file and publish it.
- Some of the data that I use is from a colleague
- Encourage them to make the data public in Figshare or any public repository, and offer to help. Explain to them how the license works. If they do not want to make the data public, that is ok. In that case, you should create an entry that does not have the data but at least describes it with all the metadata, which would include information about your colleague as the data creator and other information about how to get the data from them.
- My data comes from many sources
- Credit each source, create repository entries as needed
- An option is to create in the paper a table with “microattributions” that summarize each data source
- My data has many versions (e.g., sensors that collect more data over time)
- Create an entry for either each slice or each snapshot
- My datasets are very large
- Leave the datasets in a repository that can contain data of that size, or put the data in a publicly accessible URL. Then get a PURL at [1], and create an entry in Figshare or similar pointing to that PURL.