Work Package |
Organisation |
Dataset Title |
Dataset Identifier |
Status - June 2016 (ongoing, in progress, due date) |
New entry to data management plan? Yes/No |
Generated or collected |
Origin |
Scale |
Who is this useful for? |
Similar existing dataset and possibility for integration? Value of this new dataset? |
What standards and methodologies will be utilised for data collection and management? |
Outline the metadata, documentation or other supporting material that should accompany the data for it to be interpreted correctly |
Status and location of metadata, documentation or other supporting material |
Licensing, data protection, ownership and copyright |
Can the data be published under an open licence? |
Reasons why the data cannot be shared |
How will the data be shared? (including access procedures, dissemination, software/tools needed for enabling reuse |
Which repository will be used for the data? Why this respository? |
Is it ready to be published? |
Current location of dataset |
Dataset Link |
Licence |
How long should the data be preserved? How will it exceed the length of the project if necessary? |
Approx end volume |
Who is responsible in your organisation for the data managament and curation? |
Quality assurance and back up procedures? |
Associated costs and how these will be covered - do you need to purchase storage? How much time will it take for a person to manage the data - how will this be covered? |
WP1 |
ODI |
Corpora of crawled web-based adverts from LinkedIn |
WebSiteHarvest |
|
|
Collected |
LinkedIn |
46 terms
31 languages
47 countries
1 harvest per day
2162 data points per day
|
Internal demand analysis. External research into job and skill demand |
Many datasets are collected in this area, however due to the specific nature of this study, collection of new data is required and integration with existing datasets not viable. The value of this dataset comes from the provision of an up-to-date snapshot of current data science skills needs across the EU. |
All data collected is translated into CSV format. |
A README.md file is available detailing the data structure and basic usage. |
|
Usage of the LinkedIn service is bound by the user agreement |
No |
The terms of the LinkedIn user agreement forbid harvesting and collection of data without express permission.
"Use manual or automated software, devices, scripts robots, other means or processes to access, ��_��_��_scrape,��_��_��_ ��_��_��_crawl��_��_��_ or ��_��_��_spider��_��_��_ the Services or any related data or information;"
https://www.linkedin.com/legal/user-agreement?trk=hb_ft_userag |
Data will be available to view on the EDSA dashboard and accessible for free in the EDSA dashboard Github repository. |
Using Github so that the data stays close to its usage and can be used quickly and easily. |
N/A |
N/A |
N/A |
N/A |
As long as Github exists as a minimum. Beyond that a value judgement would have to be made. |
<1Gb |
ODI lead data management and curation, other WP1 partners will contribute |
Data is stored with external providers, e.g. Github |
Github free and public. Approximately 1 day person effort per month |
WP1 |
ODI |
Aggregated statistics of European skill demand based on web-based job adverts |
WebSiteStatistics |
|
|
Generated |
LinkedIn |
Not yet known |
Internal demand analysis. External research into job and skill demand |
Many datasets are collected in this area, however due to the specific nature of this study, collection of new data is required and integration with existing datasets not viable. The value of this dataset comes from the provision of an up-to-date snapshot of current data science skills needs across the EU. |
All data collected is translated into CSV format. |
A README.md file is available detailing the data structure and basic usage. |
|
Usage of the LinkedIn service is bound by the user agreement |
No |
"The terms of the LinkedIn user agreement forbid harvesting and collection of data without express permission. There are also restrictions on creating derivative works.
https://www.linkedin.com/legal/user-agreement?trk=hb_ft_userag" |
Data will be available to view on the EDSA dashboard and accessible for free in the EDSA dashboard Github repository. |
Github/ EDSA Dashboard on website |
N/A |
N/A |
N/A |
N/A |
As long as Github exists as a minimum. Beyond that a value judgement would have to be made. |
<1Gb |
ODI lead data management and curation, other WP1 partners will contribute |
Data is stored with external providers, e.g. Github |
Github free and public |
WP1 |
ODI |
Individual results from demand analysis |
IndividualResponses |
|
No |
Generated |
Interviews and survey |
584 surveys
108 interviews |
Internal demand analysis. |
A number of surveys exist in this domain but their data is not available to this project. This data will enable EDSA to build up a country by country view of current capacity and requirements for data science skills. |
Data collection methods outlined in D1.4. Translated into CSV format. |
Data will be not shared or available for reuse. The data collected will be used for internal analysis to inform the creation of curriculum. Anonymised data will be publicly available. |
N/A |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Data protection of personal data |
Data will be not shared or available for reuse |
Internal ODI repository |
N/A |
N/A |
N/A |
N/A |
Until the end of the project |
<100Mb |
ODI lead data management and curation, other WP1 partners will contribute |
Backed up to an internal ODI respoisitory |
Approximately 1 day effort per month |
WP1 |
ODI |
Raw anonymised data from demand analysis |
AnonymisedResponses |
|
No |
Generated |
Interviews and survey |
471 surveys
43 interviews
|
External analysis of results and trends by anyone who wishes to gather survey data in the area of data science |
There are a number of other surveys that have been aggregated that we can compare our result too and use these results if necessary. This dataset has the same eventual value to others in the area |
Data collection methods outlined in D1.4. Translated into CSV format. |
A README.md file is available detailing the data structure and basic usage. |
http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/ |
Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ |
Yes |
N/A |
Data will be available to view on the EDSA dashboard and accessible for free in the EDSA dashboard Github repository. |
Github/ EDSA Dashboard on website |
Yes |
Github/ EDSA Dashboard on website |
http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/ |
Creative Commons Attribution (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ |
As long as Github exists as a minimum. Beyond that a value judgement would have to be made. |
<100Mb |
ODI lead data management and curation, other WP1 partners will contribute |
No-one we store everything with external providers, e.g. Github |
Github free and public |
WP1 |
ODI |
Recordings and transcriptions of interviews |
InterviewTranscipts |
|
No |
Generated |
Interviews |
108 transcripts
52 recordings |
Internal demand analysis only |
No similar datasets exist that are usable for this project. The interviews provide insights and data points for use in the demand analysis. |
Qualitative research methodology for collection outlined in D1.4 |
Data will be not shared or available for reuse. The data collected will be used for internal analysis to inform the creation of curriculum. |
N/A |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Data protection of personal data |
Data will be not shared or available for reuse. The data collected will be used for internal review to inform the creation of curriculum and will only be available publically as anonymous data |
Internal ODI repository |
N/A |
N/A |
N/A |
N/A |
Until the end of the project |
< 2GB |
ODI lead data management and curation, other WP1 partners will contribute |
Backed up to an internal ODI respoisitory |
As part of the subcontracting costs of WP1 |
WP1 |
ODI |
ideXlab search platform results |
ExpertIdentification |
|
|
Collected |
Research publications |
Not yet known |
Internal analysis and curriculum development. |
Not in this area. This dataset will provide validation of the demand analysis and form the basis for further insights. |
Sampling approach outlined in D1.2. for data collection. CSV data will be created |
N/A |
|
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Privacy |
Raw data will be not be shared or may available for reuse outside of the project |
ideXlab search platform |
N/A |
N/A |
N/A |
N/A |
Until the end of the project |
Est. 1000 returns |
ideXlab lead data management curation |
Backed up to an internal ideXlab respository |
Approx 2 person days per month. No other external costs |
WP2 |
ODI |
Related course data regarding similar modules and training offerings across the EU |
DataScienceCourses |
|
No |
Collected |
Course websites |
Not yet known |
Internal use for development of curricula and learning materials. External use for identfying useful courses |
None. The data will provide a useful resource for those wishing to understand what courses are available. |
Systematic search and review of available courses |
Links to other coursers etc. |
|
The content is being collected from individual websites. There are likely to be specific terms of use and/or copyright statements that cover the content |
Yes |
The terms of use of the data or content will need to be checked to determine if they can be re-published |
EDSA Website |
Internal Soton repository, EDSA dashboard |
|
|
|
|
Until the end of the project |
< 1GB |
Soton lead data management and curation |
Backed up remotely, hosted on Google Docs |
0.5 days per month |
WP2 |
Persontyle |
Datasets for course examples and exercises |
Using namespace notation to specify R packages: sml::poly4, sml::poly4b, sml::kmeans, sml::seeds, car::Duncan, car::Davis, datasets::car, datasets::HairEyeColor, datasets::Airquality, datasets::swiss, bestGLM::zprostate, MASS::menarche |
|
Yes |
Both |
Various - many from third party R packages students download from CRAN. Some in an author developed package. |
12 small datasets. <1MB |
Students in the "Essentials of Data Analytics and Machine Learning" course. |
Datasets are archived in CRAN (except, currently, for those in the sml package). Used in course examples and exercises. |
None |
The datasets will be used within learning activities offered as part of the "Essentials of Data Analytics and Machine Learning" course. They are stored in the sml R package. |
Package documentation (except, currently, for those in the sml package) |
Various open licenses, see packages for details. |
Yes |
N/A |
Via R packages |
CRAN |
Yes |
CRAN, except for sml package which is currently available on the EDSA portal and will move to CRAN when finished. |
N/A |
Various |
Until the end of the project |
< 1MB |
Persontyle and third parties |
Relying on CRAN |
None |
WP2 |
OU |
Learning Analytics data generated from the EDSA Online Courses portal |
EDSAOnlineCoursesLA |
Generation of data started on February 1st 2016, together with the launch of the first EDSA self-study courses |
|
Generated |
http://courses.edsa-project.eu |
Not yet known |
Course producers can get an understanding of how their courses are being used. Learners can monitor their learning progress. |
Not many Learning Analytics datasets are publicly available. The OU has recently published a similar dataset: https://analyse.kmi.open.ac.uk/open_dataset |
The xAPI specification is used for expressing the data; the open source Learning Locker software is used for storing and visualising the data. |
Introduction to the xAPI (or Tin Can API): https://tincanapi.com/overview/. Introduction to Learning Locker: https://learninglocker.net |
|
Dataset owned by the EDSA project. |
Yes |
N/A |
Access is currently provided to individuals within the EDSA consortium; contact Alex (OU) for getting access. |
We have setup a dedicated EDSA Learning Locker. This was chosen for the reasons outlined in https://learninglocker.net/benefits/ |
Yes |
EDSA Learning Locker |
http://analytics.edsa-project.eu |
CC-BY |
At least until the end of project |
Not yet known |
OU lead data management and curation. |
Relying on the backup procedures of the OU, as the dataset is hosted on an OU server. |
Server storage has already been purchased. Effort for analysing the data has been allocated in Task 3.4. |
WP2 |
TU/e |
Event log from a municipality process |
a07386a5-7be3-4367-9535-70bc9e77dbe6 |
Available |
Yes |
Collected |
Dutch municipality |
200 KB |
Users interested in real life event logs. |
We have a large collection of real life event logs at http://data.3tu.nl/repository/collection:event_logs_real |
management throuh 3TU data center |
Includes number of traces, events, attributes, timespan, etc. |
http://data.3tu.nl/repository/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6 |
Open licence (Attribution, non-commercial) |
Yes |
N/A |
http://data.3tu.nl/repository/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6 |
http://data.3tu.nl/repository/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6 |
Yes |
http://data.3tu.nl/repository/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6 |
http://data.3tu.nl/repository/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6 |
unknown |
past project end |
200 KB |
3TU |
3TU data center |
none |
WP3 |
JSI |
Repository statistics on downloads and views of educational resources |
RepositoryStatistics |
Status March 2016 (available, regularly updated) |
|
Collected |
videolectures.net |
views and comments for each videolecture |
internal analysis, curriculum development, external demand analysis |
None. Provides evidence of resource usage and basis for improving curriculum, content and course structure. |
JSON is used for Videolectures API |
Videolectures REST api documentation |
N/A |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Privacy. Data that does not contain privacy issues might be publishable |
Available to see at videolectures website; described as part of WP3 deliverables
|
videolectures repository. Proximity to data source. |
N/A |
JSI server |
N/A |
N/A |
the data will be available after the project ends as part of the project's learning materials |
< 1GB |
JSI lead data management and curation. OU contribute |
videolectures - relying on internal quality assurance & back up procedures |
1 |
WP3 |
JSI |
Internal logs of elearning systems |
InternalLogs |
Status March 2016 (available, regularly updated) |
|
Collected |
videolectures.net |
for videolectures: 20.000 videos, 17.431 lectures, 12.998 authors, 952 events, 579 categories |
internal analysis, demand analysis |
None. Provides evidence of resource usage and basis for improving curriculum, content and course structure. |
JSON is used for Videolectures API |
Videolectures REST api documentation |
N/A |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Privacy |
Available to see at videolectures website; described as part of WP3 deliverables
|
videolectures repository. Proximity to data source. |
N/A |
JSI server |
N/A |
N/A |
at least until the end of project |
N/A |
JSI lead data management and curation. OU contribute |
videolectures - relying on internal quality assurance & back up procedures |
N/A |
WP3 |
JSI |
Statistics of course registration, participation and completion |
StatisticsForCourses |
Status March 2016 (is being collected) |
|
Collected |
videolectures.net |
for videolectures - available per videolecture, per viewer |
internal analysis, demand analysis |
None. Provides basis for improving curriculum, content and course structure. |
JSON is used for Videolectures API |
Videolectures REST api documentation |
N/A |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Privacy. Data that does not contain privacy issues might be publishable |
Available to see at videolectures website; described as part of WP3 deliverables
|
videolectures repository. Proximity to data source. |
N/A |
JSI server |
N/A |
N/A |
at least until the end of project |
< 1GB |
JSI lead data management and curation. OU contribute |
videolectures - relying on internal quality assurance & back up procedures |
N/A |
WP3 |
JSI |
Aggregated statistics of engagement with the developed courses and educational resources |
AggregatedStatistics |
Status March 2016 (is being generated) |
|
Generated |
videolectures.net |
for videolectures - available per videolecture, per viewer |
internal analysis, demand analysis |
None. Provides evidence of adoption and basis for improving curriculum, content and course structure. |
JSON is used for Videolectures API |
Videolectures REST api documentation |
N/A |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Privacy. Data that does not contain privacy issues might be publishable |
Available to see at videolectures website; described as part of WP3 deliverables
|
videolectures repository. Proximity to data source. |
N/A |
JSI server |
N/A |
N/A |
at least until the end of project |
< 1GB |
JSI lead data management and curation. OU contribute |
videolectures - relying on internal quality assurance & back up procedures |
N/A |
WP3 |
TU/e |
Recorded behavior of students following the first session of the process mining MOOC |
CourseraMOOCprocmin001 |
created |
Yes |
collected |
coursera.org |
several large tables |
learning analytics within EDSA |
every Coursera course has this data recorded |
|
|
|
Raw data is owned by TU/e and cannot be shared due to Coursera restrictions of use. |
No |
Privacy |
Not |
N/A |
No |
TU/e internal |
N/A |
N/A |
N/A |
around 1 GB |
Joos Buijs |
N/A |
N/A |
WP4 |
SOTON |
Web server logs and Google analytics of project website access |
WebsiteAnalytics |
|
|
Collected |
http://edsa-project.eu |
1 website |
Internal analysis for dissemination and community analysis. Secondary use for implicit demand analysis. |
None. Provides evidence of engagement and basis for UX improvement. |
Quantitative recording of website traffic via Google Analytics dashboard, analysed using a variety of analytic tools. |
Sessions, Page views, Demographics, User Flow, Bounce rate, |
All available within https://analytics.google.com/ |
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
User privacy. The data can be aggregated and published under an open license |
Analysed data will be made available throughout deliverable reports in WP4. |
Internal institutional Soton/OU repositories |
N/A |
https://analytics.google.com/ |
N/A |
N/A |
at least until the end of project |
< 1GB |
OU lead data management and curation. Soton contribute |
Backed up remotely |
Free storage. 0.5 day per month |
WP4 |
SOTON |
Generated social media engagement data |
SocialMediaEngagements |
|
|
Collected |
Twitter |
1 Twitter Account |
Internal analysis for community strength and project dissemination. |
None that relate to EDSA. Provides evidence for engagement with project, effectiveness of dissemeniation activities. Provides basis for understanding what content users find most engaging. |
Regular access of data from analytics.twitter.com |
Tweets, Impressions, Profile Visits, Followers, Mentions |
https://analytics.twitter.com/user/edsa_project/home |
Data will be licensed in compliance with each social network's terms and conditions |
No |
Data sharing needs to comply with individual site licenses. However the majority of social networks do not permitted collection, harvesting and republication of data |
Dashboard on EDSA website. Deliverable reports in WP4. |
Internal institutional Soton repositories |
|
Data not accessible directly without tools. Required Twitter harvester. |
|
|
Until the end of the project |
< 1GB |
Soton lead data management and curation. |
Backed up remotely |
Free storage. 1 day per month |
WP5 |
ideXlab |
List of project exploitation results ��_��_��_ collaborations, institutional and geographical beneficiaries, |
ProjectExploitation |
|
|
Generated |
Project partners |
Variable |
Internal analysis for results to be exploited and targets |
None. Provides data on dissemination activity, network and results. |
Report detailing results from interviews and exploitation activities |
N/A |
|
Raw data will be owned by the project and unlicensed. It will not be available for reuse. |
No |
Confidentiality |
Deliverable reports in WP5. |
Google docs shared document |
N/A |
|
N/A |
N/A |
Until the end of the project |
< 500MB |
ideXlab lead data management curation |
Backed up remotely |
Free storage. 1 day per month |