csv,conf,v8

Speakers

We have assembled an exciting lineup of speakers – of data makers and enthusiasts representing academia, science, journalism, government, and open source projects - for csv,conf,v6.

Dr. Kadija Ferryman

Race Matters in Health Data

Keynote

Dr. Kadija Ferryman (https://www.kadijaferryman.com/) is a cultural anthropologist and bioethicist who studies the social, cultural, and ethical implications of health information technologies. Specifically, her research examines how genomics, digital medical records, artificial intelligence, and other technologies impact racial disparities in health. She is currently Industry Assistant Professor at New York University’s Tandon School of Engineering. As a Postdoctoral Scholar at the Data & Society Research Institute in New York, she led the Fairness in Precision Medicine research study, which examines the potential for bias and discrimination in predictive precision medicine. She earned a BA in Anthropology from Yale University, and a PhD in Anthropology from The New School for Social Research. Before completing her PhD, she was a policy researcher at the Urban Institute where she studied how housing and neighborhoods impact well-being, specifically the effects of public housing redevelopment on children, families, and older adults. Ferryman is a member of the Institutional Review Board for the All of Research Program, a Mozilla Open Science Fellow, and an Affiliate at the Center for Critical Race and Digital Studies. Dr. Ferryman has published research in journals such as Paediatric and Perinatal Epidemiology, the Journal of Health Care for the Poor and Underserved, European Journal of Human Genetics, and Genetics in Medicine. Her research has been featured in multiple publications including Nature, STAT, and The Financial Times.

Watch Talk
Julia Kodysh, Michal Mart, Kevin Miller, Kara Schechtman

Keynote

From a spreadsheet to critical data infrastructure: building The COVID Tracking Project

The COVID Tracking Project (https://covidtracking.com/about) began in March 2020 as a stopgap spreadsheet maintained by a handful of journalists, hoping to provide some information on COVID across the US until the federal government stepped in. But that day never came. Powered by over a thousand volunteers collecting data from disparate state and federal systems every day, the project accidentally became an indispensable source of data used by governments, individuals, and institutions to make critical decisions. The decentralized nature of public health infrastructure in the United States, which is mostly managed by overstretched health departments at the local and state level, made it impossible to automate the collection and normalization of data on the pandemic. States, which suddenly found themselves needing to produce and report data out of underfunded and overstretched systems, produced COVID dashboards that were all different from each other, didn't provide APIs and used technologies that are difficult to scrape. Human gleaning of the data from these systems allowed us to identify sudden changes in reporting, keep an eye on data definitions, and develop a deep well of experience and metadata that informed how we reported every state's data. Volunteers not only do the critical work of data collection, but through their experience in working with the data, are empowered to make key decisions in our analysis and reporting. Our tooling has matured as we have tested the edges of the possible with Google Sheets. We still use spreadsheets, but have developed more powerful tools to ensure data quality and improve our publishing process. This infrastructure allows teams working on our website and API to use a reliable dataset that serves millions of users a day. The COVID Tracking Project became the de-facto source of COVID data for so many because of a community built in Slack channels by strangers. The tools and sheets and websites we have built are impressive and useful for others to learn from. But the biggest legacy of the project will be the thousands of people who caught a glimpse of the best of themselves during a terrible time.

Find Julia Kodysh, Michal Mart, Kevin Miller, Kara Schechtman's work on GitHub · Watch Talk
Simon Willison

Datasette and Dogsheep: Liberating your personal data

Keynote

Datasette is an open source web application for exploring, analyzing and publishing data. I originally designed it to support data journalists working in newsrooms, but quickly realized that it has applications way beyond journalism. I decided to start digging into my own personal data - the data that sites and services collect about and for me, which thanks to regulations like Europe's GDPR is increasingly available for me to export myself. This led to Dogsheep, a collection of tools for importing personal data from Twitter, GitHub, 23AndMe, Foursquare, Google, HealthKit and more. Being able to export your data isn't much good if you can't easily do interesting things with it. I'll show how the combination of Datasette and Dogsheep can help liberate your personal data, and discuss the lessons I've learned about personal data and open source along the way.

Find Simon Willison's work on GitHub · Watch Talk
Abhishek Gupta

The Lab Notebook: Bringing Science Back to Data Science

All of us deal with data. A lot of us do data science. And yet only some of us get a chance to really infuse science into that data science work. Ever visit one of your old experiments and find that you want to pull out your hair because you are not sure how you arrived at some of the models that you ended up selecting, why you transformed your data the way you did, and other choices that now seem arbitrary but were perhaps perfectly reasonable then? While we invent a time machine that allows us to go back and inspect our previous (more brilliant?) selves, i have a simpler proposal: the humble lab notebook. Remember those ruled notebooks we carried around in physical labs at school diligently writing things down as we figured out how to build the best soda volcanoes? Turns out they can help us solve this problem of tracking our decisions as we arrive at different configurations that we run in our data science work so that we don’t need to curse at our past selves for making poor choices. Not only do they act as great supplements for existing AI lifecycle management tools, but they also help us share our learnings better with our colleagues (and future selves!). Come join me on this journey and let’s explore how the lab notebook can bring back science into data science. We’ll look at why you should have a lab notebook for all your data science work, how you should go about maintaining that lab notebook, and what you should and what you should NOT include in that lab notebook.

Find Abhishek Gupta's work on GitHub · View Talk Slides · Watch Talk
William Lyon

Will You Be My Graph Data-bae? A Graph Data Love Story

This talk tells the story of two technologies that were always meant to be together but came from different worlds, yet finally were able to come together and help the world make sense of data through APIs: GraphQL and graph databases. We'll discuss graph databases like Neo4j - what they are, why and how to use them, as well as the tradeoffs of building applications with graph databases. We'll review GraphQL and how to build GraphQL APIs. Then we'll see how GraphQL and Neo4j are even more powerful when used together, each leveraging features of the other (the property graph data model, a strict type system, index-free adjacency, extensibility through schema directives, and the Cypher query language) - resulting in a truly symbiotic and mutually beneficial relationship.

Find William Lyon's work on GitHub ·
Jeroen Janssens

Set your code free; turn it into a command-line tool

If your data analyses involve coding, then you know how liberating it is to use and create functions. They hide complexity, improve testability, and enable reusability. In this talk I explain how you can really set your code free: by turning it into a command-line tool. The command line can be a very flexible and efficient environment for working with data. It's specialized in combining tools that are written in all sorts of languages (including Python and R), running them in parallel, and applying them to massive amounts of (streaming) data. Although the command line itself has quite a learning curve, turning your existing code into a tool is, as I demonstrate, a matter of a few steps. I discuss how your new tool can be combined with existing tools in order to obtain, scrub, explore, and model data at the command line. Finally, I share some best practices regarding interface design and distribution.

Find Jeroen Janssens's work on GitHub · View Talk Slides · Watch Talk
Brigitte Vezina

Should open content be used to train artificial intelligence?

Developments in artificial intelligence (AI) raise several questions when it comes to the use of copyright material and Creative Commons-licensed content in particular. One of them is whether CC-licensed content (photographs, artworks, text, music, etc.) should be used as input to train AI. This question illustrates the tension between the value of open data vs. legitimate concerns about ethical, moral and responsible use of openly licensed content, which this session will explore.

Watch Talk
John Borghi

Identifying the who, what, and (sometimes) where of research data sharing at an academic institution

In many areas of academic research, there is a growing emphasis on sharing data and other materials. However, tracking which researchers affiliated with a given institution are sharing, what they are sharing, and what tools they are using to share remains extremely challenging. To help inform the support services offered to affiliated researchers, I attempted to track the research data shared by Stanford Medicine in 2020. In this talk, I will describe the methods I used, the (incomplete but hopefully representative) information I was able to collect, and how this information will be used moving forward. Spoiler alert, I sorted through thousands of journal articles and data availability statements and discovered data in some unexpected places.

View Talk Slides · Watch Talk
Vincent Warmerdam

Human-Learn: Let's Draw Machine Learning Models and Train Humans Instead

Back in the old days, it was common to write rule-based systems.

Systems that do; data -> rules -> labels.

Nowadays, it's much more fashionable to use machine learning instead.

Something like; (data, labels) -> ML -> rules.

I started wondering if we might have lost something in this transition. So I made an open source package in an attempt to keep the hype at bay. Maybe natural intelligence is just enough and maybe enough is plenty. In this talk I'll present a toolbox called "human-learn" that allows you to, literally, draw out a machine learning model. I'll demonstrate the benefits of the approach. Not only do we allows the machine to learn from a domain expert, we also force the algorithm designer to actually look at the data.

Find Vincent Warmerdam's work on GitHub · Watch Talk
Katie Shaw

From opaque to open: untangling apparel supply chains with open data

Historically, the apparel sector has been shrouded in secrecy and mystique, using this to generate an impression of exclusivity and glamour. As the tragic Rana Plaza building collapse in Dhaka, Bangladesh in 2013 showed (in which well over 1,000 workers died manufacturing clothing for many major western brands) the reality is anything but. The aftermath of the Rana Plaza disaster shone a light on how few brands were aware of where their products were being manufactured, let alone the conditions in which they were being made. In response to this, a growing trend has developed for supply chain disclosure in the apparel sector, spurred on by groups such as the Transparency Pledge coalition and Fashion Revolution. However, in response to these calls for greater transparency, supply chain disclosure has been inconsistent, difficult to track from one website to another and data is often locked away in non-machine readable formats such as PDFs or tables embedded in websites. A lack of standard formatting for information as basic as name and address data (coupled with the poor quality of this data) makes it difficult and costly for anyone to compare across datasets and understand shared connections to facilities. Data has been stuck in silos and lacked a universal, central ID through which systems could synchronize, making interoperability between systems impossible. Enter the Open Apparel Registry (OAR). The OAR was launched in March 2019 to address this data challenge. At its heart, the OAR exists to drive improvements in data quality for the benefit of all stakeholders in the apparel sector. As well as many other efficiency and process benefits, the way the OAR organizes and presents data ultimately improves the lives of some of the most vulnerable workers in global supply chains. During our talk, we’ll share details about the challenges facing the apparel sector, including low levels of technical sophistication and understanding of open data; collaborative work that’s being done to educate the sector on the power of open data, including the launch of the Open Data Standard for the Apparel Sector (ODSAS); before finishing up with inspiring examples of how data from the Open Apparel Registry being freely shared and used is creating meaningful changes in the lives of some of global society’s most oppressed people.

Find Katie Shaw's work on GitHub · View Talk Slides · Watch Talk
Anthony Auffret, Geoffrey Aldebert and Pavel Soriano-Morales

csv-detective: solving some of the mysteries of open data

Over the last few years, the emphasis on data quality has evolved from being a nice to have to an absolute necessity. As already stated in past editions of this venue, the quality of open data is essential to *truly fulfill its promise as a channel of empowerment*. In practical terms, data quality for tabular data entails, among other tasks, checking the structural integrity and schematic consistency of its contents. In order to satisfy these requirements, we need to look into the files content and first determine whether our files are indeed CSV files, and secondly, and more importantly, we want to discover the type of data we have in order to properly validate it and thus evaluate and then hopefully improve the quality of our datasets. In this talk, we present our work on the automatic detection of data types within the columns of CSV files. Going beyond classic computer science data types (float, integer, date), we are also interested in detecting more specific in-domain data categories. Specifically, given a CSV file, we look into its columns and determine whether it contains postal codes, enterprise identifiers, geographic coordinates, days of the week, and so on. We will show how applied to data from the French open data platform that we maintain (data.gouv.fr.), these specific types allow users to better control data in order to join or link datasets between them. We see our work as a first step and important stepping stone towards data integrity and validation checks while also facilitating the discoverability and contextualization of open datasets. Our approaches are evaluated by annotating and testing over thousands of columns extracted from more than 15 000 CSV files found in data.gouv.fr.

Find Anthony Auffret, Geoffrey Aldebert and Pavel Soriano-Morales's work on GitHub · View Talk Slides · Watch Talk
Donny Winston

CSV-LD: Spreadsheet-based Linked Data

Comma-separated-values (CSV) is a useful data serialization and sharing format. This talk introduces the idea of CSV-LD as a CSV-based format to serialize Linked Data, mirroring the way that JSON-LD is a JSON-based format to serialize Linked Data. "CSV" here includes any dialect that uses a different delimiter, such as tab-separated-values (TSV). The syntax of CSV-LD is designed to easily integrate into workflows that already use CSV, and provides a smooth upgrade path from CSV to CSV-LD. It is primarily intended to be a way to use Linked Data as part of spreadsheet-based data entry; to facilitate data validation, display, and conversion of CSV into other formats via use of CSV on the Web (CSVW) metadata; and to build FAIR data services. The term "CSV-LD" was previously used to describe a now-obsoleted precursor to the CSVW specifications; both approaches require a second file, a JSON-LD template document, to be shared along with a CSV file. The approach described here, in contrast, requires only a CSV file from the data producer, one that includes links to CSVW-powered metadata.

Find Donny Winston's work on GitHub · View Talk Slides · Watch Talk
Derek Beaton

The Ontario Neurodegenerative Disease Research Initiative's Neuroinformatics & Biostatistics teams pipelines, partnerships, and products.

The Ontario Neurodegenerative Disease Research Initiative (ONDRI) is a large-scale longitudinal project across five disorders spanning 520 participants, each with 3 or more annual assessments, and from 14 sites across the province. ONDRI’s data span lab measures, genetics, ocular imaging, eye-tracking tasks, many cognitive assessments, various brain imaging modalities, and an extensive array of clinical assessments. These data come in many types, shapes, and sizes. Different domain and data experts have different expectations of data standards, how data are curated, and how data are analyzed. However, ONDRI is more than data: it is also 100s of researchers and clinicians coming together to better understand neurodegenerative and cerebrovascular disorders. So, ONDRI’s Neuroinformatics & Biostatistics (NIBS) team built both an infrastructure (data) and a culture (people) around the standardization and integrity of such diverse data. We did this with (1) pipelines: supporting curation from preparation through to analyses and reporting, (2) partnerships: working closely with others on training & collaboration, and (3) products: the tools we build and use; many of whichare publicly available: https://github.com/ondri-nibs. ONDRI’s data standards and outlier analyses are central to the pipelines. Our standards revolve around “data packages” that include data and dictionaries with specific nomenclature and features (e.g., predefined and unambiguous missing data codes). We built the standards around tabular (.csv format) data, then extended standards to “non-tabular” or file-based data, such as neuroimaging. We also perform a battery of (mostly) multivariate outlier analyses. We share standardized reports (written in RMarkdown) for standards and outliers with research and clinical experts to review. These steps identify anomalies and allow us to correct errors. We were able to build a data-focused culture based on shared clinical & research goals, recognizing the difficulty with & importance of data curation, and with an eye toward public release of the data.

Find Derek Beaton's work on GitHub · Watch Talk
Kelsey Breseman

Environmental Enforcement Watch: Environmental Data Justice through Participatory Data Science

The U.S. EPA maintains an open dataset of information about permits issued under the Clean Air, Clean Water, and Resource Conservation and Recovery Acts, among others. This dataset describes, for instance, how much pollution an industrial facility is permitted to discharge into waterways and how many times they have been fined by the EPA—important information for environmental justice. But though there is a lot of information about permits and pollution, it is not organized in such a way as to answer questions about health or aggregate impacts—the kinds of questions people living in the area might find most meaningful. In the Environmental Enforcement Watch (EEW) project, we work with communities potentially impacted by facilities’ (non)compliance with their permits to co-develop open source Jupyter Notebooks. We use these to explore questions partners have about their environment and their communities’ health. In this presentation, we’ll explain our approach, show Notebooks we’ve developed, and share stories and outcomes from this attempt to model a more participatory approach to environmental governance.

Find Kelsey Breseman's work on GitHub · View Talk Slides · Watch Talk
Alyssa Travitz

Scaling computations and maintaining reproducible projects with signac, an open-source framework

Data science projects are often messy, with complex workflows and large, heterogeneous parameter spaces. Bash scripting and long file names can only get you so far, and can result in a project that is incomprehensible to collaborators or your future self. The open-source data management framework, signac, (https://signac.io) makes it easy to maintain organized and reproducible projects, and integrates with existing file-based workflows. The serverless data management and lightweight workflow model ensure that projects are easily transferable between laptops and high-performance computing environments, and the data model is well suited for high-dimensional parameter searches or hyperparameter optimization of machine learning models. In this talk, we will demonstrate the signac approach to managing data science projects, emphasizing its use in real-world scenarios. The signac framework not only increases research efficiency, but makes best practices and scalability easy and intuitive.

Find Alyssa Travitz's work on GitHub · Watch Talk
Anne Lee Steele

Data visualization and crowdsourced research: experiments in collective storytelling

These days, everyone seems to trying to harness the power of the crowd. Amnesty International’s “Decoders” brings together “digital volunteers to tackle human rights abuses around the world”. Bellingcat crowdsources information for their online investigations, and has transformed their newsroom into an open source project. Amazon’s Mechanical Turk has long altered how research is conducted through their crowdsourcing marketplace. These tools have paved the way for new kinds of journalism and research, where theoretically anyone can contribute to projects from anywhere around the world. However, while crowdsourcing has grown as a source of data, visual storytelling has usually been left out in the process. Online tools like Datawrapper and Tableau have made data visualization easier than ever, but because storytelling with data is computationally expensive – and usually custom-made for the story – the crowdsourcing process has usually been completed before it is turned into a visual design. They don’t usually walk in lockstep with each other. This talk will attempt to bridge this gap, if only just a little. It reviews a project called supply-chains.us, which is building a sustainable open-source database that anyone can contribute to, with an integrated visualization that updates over time. This talk will describe how the project was built, why visual storytelling is important for research, and how open-source projects can be an opportunity for collective storytelling.

Find Anne Lee Steele's work on GitHub · View Talk Slides · Watch Talk
K Adam White

WordPress as Data, 5 Years In

In 2016,WordPress introduced a REST API to expose site content as JSON. Also in 2016, we had the opportunity to present at csv,conf about this paradigm shift: what would it mean for all the content in any WordPress website to be accessible as data? Since then, WordPress has fundamentally reinvented itself using that new API. The project has been reimagined from the inside out, with the humble content block as the foundational unit of site data. But, where was that explosion of data-driven WordPress applications we hoped for? In this talk we will look at what worked as we integrated this new API into WordPress, and at where we stumbled. We'll explore how roles, access, and authentication limit the utility of data. We'll candidly discuss how burnout affects project stewardship. And we'll also celebrate a tremendous, successful change of course for a venerable open source project, and think ahead to what "WordPress as Data" may mean in five more years' time!

Find K Adam White's work on GitHub · View Talk Slides · Watch Talk
Henry Senyondo

Rdataretriever: A platform for downloading, cleaning, and installing publicly available datasets

The rdataretriever provides an R interface to the Python-based Data Retriever software. The Data Retriever automates the core steps of data preprocessing including downloading, cleaning, standardizing, and importing datasets into a variety of relational databases and flat file formats. The rdataretriever additionally supports provenance tracking for these steps of the analysis workflow by taking snapshots of the datasets to be committed at the time of installation and allowing them to be reinstalled with the same data and processing steps in the future. Finally, the rdataretriever supports the installation of spatial datasets into relational databases with spatial support. The rdataretriever provides an R interface to this functionality and also supports importing of datasets directly into R for immediate analysis. These tools are focused on scientific data applications including several widely used but difficult to work with, datasets in ecology and the environmental sciences. The rdataretriever allows R users to access the Python Data Retriever processing platform through a combination of the reticulate package and custom features developed for working in R. Because many R users, including the domain researchers most strongly supported by this package, are not familiar with Python and its package management systems, a strong emphasis has been placed on simplifying the installation process for this package so that it can be done entirely from R. Installation requires no direct use of Python or the command line. Detailed documentation has been developed to support users in both installation and use of the software. A Docker-based testing system and associated test suite have also been implemented to ensure that the interoperability of the R package and Python package are maintained, which is challenging due to frequent changes in reticulate and complexities in supporting cross-language functionality across multiple operating systems and R programming environments (terminal-based R and RStudio).

Find Henry Senyondo's work on GitHub · Watch Talk
Mike Trizna

Making Smithsonian Open Access Accessible with Python and Dask

In February 2020, the Smithsonian Institution released almost 3 million images and over 12 million collections metadata records under the Creative Commons Zero (CC0) license. The release was made available via a web API, a GitHub repository, and via the Registry of Open Data on Amazon Web Services (AWS). The format of the release on the GitHub and AWS sources made the data well-suited for parallelized analysis, but only with deep knowledge of the complex data structures. In this talk we will discuss how we used the Python Dask library to unlock this parallelization and make the data more accessible, as well as a student intern project that used Python tools to uncover insights specifically into the holdings of the National Museum of American History.

Find Mike Trizna's work on GitHub · View Talk Slides · Watch Talk
Grant Vousden-Dishington

Using GitHub Actions to Accelerate your Data Efforts

Data science work often requires computing resources that isn’t available to practitioners from disadvantaged backgrounds or located in “data deserts” with low technology access. Cloud computing is an option in such situations but can be costly and not friendly to beginners who don’t know what infrastructure to pick. Most of the data science community thinks of GitHub as only a code storage repository with some management features, but GitHub Actions provide a powerful and often free resource for computing that can perform many data collection and analysis needs. This introduction to GitHub Actions will provide an overview of everything needed for GitHub users to get started applying GitHub Actions to their projects. We’ll see both simple and advanced examples of the YAML format that controls various workflows and how it ties into traditional data science needs, like testing and automation. Time permitting, we’ll also see how these workflows can be used for special applications, such as open source intelligence and machine learning.

Find Grant Vousden-Dishington's work on GitHub · View Talk Slides · Watch Talk
Catherine Stihler

Better sharing, brighter future

2021 marks the 20th anniversary for Creative Commons. For 20 years we have helped build a commons of 'open' creative content free of most copyright restrictions. With CC licenses and tools, we created a simple way for creators to opt into a more permissive model of sharing. As we reflect on the past and think of the future, we know that today we must pursue a commons of knowledge and culture that is inclusive, just and which inspires reciprocity - a commons that serves the public interest. To that end we must transition from promoting more sharing to fostering better sharing. In this talk, I want to share what better sharing means in practice and how we can achieve this together.

Watch Talk
Ryan Harter

Getting Credit for Invisible Work

A lot of the hard important work that goes into good data work is invisible. For example, most of an exploratory data analysis is testing *and discarding* hypotheses. We explore complex data, so we can distill our findings into a simple narrative. We never talk about all the hypotheses we’ve discarded!If we’re doing it right, we make our work look simple. This is super valuable, but can cause problems when we try to demonstrate our value. This talk covers some strategies for getting credit for this super valuable but invisible work.

Find Ryan Harter's work on GitHub · Watch Talk
Bastian Greshake Tzovaras

Using wearables to detect infections: A co-created & community-led pandemic response

In March 2020 – during the early days of the COVID-19 pandemic – a big question popped up in the heads of the data nerds of the Open Humans & Quantified Self community: We are wearing all these sophisticated wearable devices that measure our heart rate, respiratory rate and even sometimes even body temperature - can these things be used to find out when we’re falling sick even before we consciously notice having symptoms? Starting from this initial question a community quickly formed to iteratively and collaboratively create “Quantified Flu” – an open source online-tool to facilitate individual learning and insights at the intersection of early physiological warning signs and self-observed symptoms. The final prototype enables people to report over 15 different symptoms through daily reporting while monitoring physiological signals from a large variety of the most common wearable devices. We’ll not only showcase the final prototype, but also investigate how the open and iterative development process unfolded and how this helped to create a tool that fits the needs of the self-tracking community. We furthermore see how this fit of digital affordances & needs enabled a level of user engagement that’s highly untypical for the “mobile health” space.

Find Bastian Greshake Tzovaras's work on GitHub · View Talk Slides · Watch Talk
Harris Lapiroff

An API for Tracking Press Freedom

In 2017, as the Trump administration was ramping up a war of open hostility against the press, a collaborative of press freedom organizations launch a database and website to track aggressions against journalists. The U.S. Press Freedom Tracker records incidents include arrests, border stops, physical assaults, subpoenas, equipment damage, and more. In addition to detailed articles, all four years (and counting) of the Tracker data is made available through an API that can be used to track trends, aggregate incidents related to specific events, and more. We'll show off examples of how the API is already used to power interactive data visualizations and give an overview of how anyone can access and use the data.

Find Harris Lapiroff's work on GitHub · Watch Talk
Bernease Herman

Static datasets aren't enough: where deployed systems differ from research

The focus on static datasets in machine learning and AI training fails to translate to how these systems are being deployed in industry. As a result, data scientists and engineers aren't considering how these systems perform in changing, real world environments nor the feedback mechanisms and societal implications that these systems can cause. In the session, we will highlight existing tools that work with dynamic (and perhaps streaming) data. We will suggest some preliminary studies of activities and lessons that may bridge the gap in data science training for realistic data.The goal of the talk is to:
- Point to resources for AI practitioners to engage with dynamic datasets
- Engage in discussion about the impact of feedback loops and other consequences on the real world
- Brainstorm new approaches to teaching skills on dynamic datasets

Find Bernease Herman's work on GitHub · View Talk Slides · Watch Talk
Jay Miller

Making Police Call Data Readable using Pandas and Eland

Many cities in the United States have public data as a part of the oversight initiatives. These datasets often make the data incredibly hard to analyze. This talk breaks down how I compiled data from the San Diego Police Department into dataframes and make it observable in Elasticsearch Kibana.

Find Jay Miller's work on GitHub · Watch Talk
Dr Andrea Wallace and Douglas McCarthy

From tweet to sheet: crowdsourcing a global survey of Open GLAM

Three years ago a simple tweet between friends caught fire, inspiring Douglas and Andrea to start surveying open access in the GLAM (Gallery, Library, Archive, Museum) sector. Since that moment, the tiny acorn of a Google Sheet has matured into a oak tree, growing in scale, complexity and prominence. In this talk, Douglas and Andrea will share the remarkable story of the Open GLAM survey's development and impact. They'll share actionable insight into crowdsourcing and managing data with a global community of researchers, activists, copyright experts and GLAM professionals. Finally, Douglas and Andrea will discuss how they've leveraged the Open GLAM survey into Wikidata and other domains. Twitter: https://twitter.com/CultureDoug + https://twitter.com/AndeeWallace

Watch Talk
Kelsey Montgomery

Save your sanity with Synapse

Synapse is a freely available data repository to organize your research, get credit for it and collaborate with colleagues and the public. Data exploration provokes a curiosity that can lead the analyst through a labyrinth of possible outcomes. Synapse contains features to track file versions with an immutable identifier, string together file transformations, manage the visibility of sensitive data, make data discoverable with structured metadata, describe experimental context with unstructured copy and connect your output with code. All of this can be accessed from the web or programmatically with R, Python or the Command Line, thus integrating seamlessly into your scientific analysis. Allow Synapse to hold the memory of the places you have been to instead focus your energy on testing creative hypotheses.

Find Kelsey Montgomery's work on GitHub · Watch Talk
Jason A. Clark

Algorithms as Data: A Case Study for Turning Algorithmic User Experiences into Research Data Objects

Why does your technology seem to know what you want before you do? Increasingly, our digital experiences are mediated by obscure algorithms. But what are algorithms and how can we audit or quantify them? This session introduces an "algorithmic awareness" research module and a generic code application for auditing algorithmic user experiences and quantifying those experiences as datasets for analysis. What if algorithms weren’t the Ghost in the Machine? Can algorithms be understood as part of the open data continuum? Montana State University Library, with grant funding from the Institute of Museum and Library Services, has conducted research in order to develop software and a curriculum to support the teaching of "Algorithmic Awareness": an understanding around the rules that govern our software and shape our digital experiences. Taking our inspiration from investigative data journalists, like The Markup, we are looking to introduce our research module for algorithm auditing practices using code, web scraping methods, and structured data formats to uncover proprietary algorithms and turn them into research data objects for analysis. (Code is available in our #AlgorithmicAwareness GitHub repository.) Our case study for the module will be the YouTube Video Recommendation Algorithm which has come under criticism for its tactics in drawing parents’ and childrens’ attention to their videos. Our goal will be to show the generic patterns, data points, and scripts one can use to analyze algorithmic user experiences and demonstrate how code can be used to turn algorithms into datasets for analysis. In the end, attendees will be able to realize actionable steps for seeing algorithms as data objects, gain a sense of the first steps one can take to programmatically audit these systems with code, and take away investigative data techniques for their own work.

Find Jason A. Clark's work on GitHub · View Talk Slides · Watch Talk
Simon Tyrrell

Frictionless Data for Wheat

The international wheat community has embraced the omics era and is producing larger and more heterogeneous datasets at a rapid pace in order to produce better varieties via breeding programmes. These programmes, especially in the pre-breeding space, have encouraged wheat communities to make these datasets available more openly. This is a positive step, but the consistent and standardised collection and dissemination of data based on rich metadata remains difficult, as so much of this information is stored in papers and supplementary information. Furthermore, whilst ontologies exist for data descriptions, e.g. the Environmental Factor Ontology, the Crop Ontology, etc., use of these ontology terms to annotate key development characteristics across disparate data generation methods and growing sites is rarely routine or harmonised. Therefore, we built Grassroots, an infrastructure including portals to deliver large scale datasets with semantically marked-up metadata to power FAIR data in crop research. As part of the UK Designing Future Wheat (DFW) programme, we generate a variety of data ranging from field trial experimental information, sequencing data and phenotyping images, through to molecular biology data about host and pathogen interactions, nitrogen use efficiency, and other key treatment factors. As such, there is an increasing need to be able to manage this data and its metadata to allow for consistent, easy dissemination and integration with other datasets and within analytical tools and workflows. We decided that Frictionless Data was the right framework to use due to its ease of use and open standards, so we developed open source tools for FAIR data sharing to automatically generate Frictionless Data Packages for these datasets on both our Apache/iRODS and CKAN portals.

Find Simon Tyrrell's work on GitHub · View Talk Slides · Watch Talk
Giulia Santarsieri

(Machine) Learning from Open Data platforms

Today, open data platforms host a wide and heterogeneous catalog of datasets. However, these datasets are often neglected in Machine Learning (ML) and other related tasks. This mainly happens because there are few available open data catalogs specialized in ML applications and because it is often unclear whether Machine Learning algorithms would be adequate and well performing on such datasets. Therefore, several open datasets go unused while they could be leveraged by the ML community to explain, evaluate, and challenge existing methods on real open data. For instance, these real-world data could be used by professors teaching ML courses, by students taking these courses, by researchers testing current and novel ML approaches, and possibly to promote the intersection of open data, ML and public policy. In this talk we will show you how we are tackling this issue working on datasets from data.gouv.fr (DGF), the French open data government platform. We aim to answer the question of what makes a dataset suitable and well performing for Machine Learning tasks by leveraging open source tools. Our goal is to establish a first small empirical assessment of the characteristics of a dataset (size, balance of its categorical variables and so on) that make it a “good fit” for Machine Learning algorithms. Specifically, we first manually select an adequate subset of datasets from DGF. Then we perform a statistic profiling on each of these datasets. Thirdly, we automatically train and validate a set of ML algorithms on them and we cluster the datasets according to their evaluation results. These steps help us to better understand the nature of each dataset and thus determine which ones seem suitable for ML applications. Based on these datasets, and inspired by existing resources, we build the first version of a catalog of open datasets for ML. We hope that this platform will be a first stepping stone towards the reuse of open datasets in Machine Learning contexts.

View Talk Slides · Watch Talk
Silvia Canelón, Elizabeth Hare

Revealing Room for Improvement in Accessibility within a Social Media Data Visualization Learning Community

We all aim to use data to tell a compelling story, and many of us enjoy sharing how we got there by open-sourcing our code, but we don't always share our story with everyone. Even kind, supportive, and open communities like the #TidyTuesday R learning community on Twitter has a ways to go before the content shared can be accessible to everyone. Lived experiences of blind R users tell us that most data visualizations shared for TidyTuesday are inaccessible to screen reading technology because they lack alternative text (i.e. alt text) descriptions. Our goal was to bring this hidden lack of accessibility to the surface by examining the alternative text accompanying data visualizations shared as part of the TidyTuesday social project. We scraped the alternative text from 6,443 TidyTuesday images posted on Twitter between April 2, 2018 and January 31, 2021. The first image attached to each tweet was considered the primary image and was scraped for alternative text. Manual web inspection revealed the CSS class and HTML element corresponding to the primary image, as well as the attribute containing the alternative text. We used this information and the ROpenSci {RSelenium} package to scrape the alternative text. Our preliminary analysis found that only 2.4% of the images contained a text description entered by the tweet author compared to 84% which were described by default as "Image". This small group of intentional alternative text descriptions had a median word count of 18 (range: 1-170), and a median character count of 83 (range: 8-788). As a reference point, Twitter allows 240 characters in a single tweet and 1,000 characters for image descriptions. This analysis was made possible thanks to a dataset of historical TidyTuesday tweet data collected using the ROpenSci {rtweet} package, and openly available in the TidyTuesday GitHub repository (https://github.com/rfordatascience/tidytuesday). Twitter: @spcanelon; @DogGeneticsLLC GitHub: @spcanelon; @LizHareDogs

Watch Talk
Katy Sill, Adam Kariv

The Water Point Data Exchange - using Open Data for Universal Water Access

Water points are still the main - or sometimes only - source of water for millions of people living in rural areas of the world. The Water Point Data Exchange (WPDx) unlocks the potential of water point data to improve the information available to governments, service providers, researchers, and NGOs, and create data-based decision-making for increasing rural water access. Governments and their partners are already collecting substantial amounts of water point data. However, data is often too fragmented to use, with different approaches for data collection and storage. Without harmonization among these different data sources, the true potential of this information remains untapped so it is only used for reporting and not decision support. By establishing a platform for sharing water point data throughout the global water sector, WPDx adds value to the data already being collected. It brings together diverse data sets and establishes an unprecedented understanding of water services. All stakeholders can easily use the WPDx Data Standard to harmonize their existing data structure. Formatting existing data into the WPDx Data Standard typically takes less than 30 minutes. The WPDx Data Standard fits most water point data, even if it was collected with no knowledge of the standard. After formatting, files can be uploaded and published using the WPDx ingestion engine. It removes duplicate entries, links multiple updates to the same water point, and integrates the data into the Global Data Repository. Using this data and advanced GIS and machine learning analysis, several decision support tools were built. Designed in partnership with governments and data scientists, these tools provide concrete insights, like which water point to send a technician to rehabilitate next to reach the most people. In this presentation we will share some of the challenges we face and the methods we use in making this possible.

Watch Talk
Zane Selvans

Distributing Power with Open Data

Some lessons learned from our work getting electricity system data into the hands of activists and researchers who are trying to shift power in US energy policy making. Under what circumstances can data help drive policy changes, and how does that change happen? Is open data always preferable in an advocacy context? What does it really mean for data to be accessible to advocates? US energy regulation is extremely technocratic, and largely happens out of public view. For years we’ve tried to address the information asymmetry that exists between advocates and large utilities by publishing cleaned open datasets detailing the inner workings of the US electricity system. We use Python data science tools to prepare the data and they serve us well, but most advocates -- even those with a lot of domain expertise -- are either hardcore spreadsheet users, or are just starting to dabble in Jupyter Notebooks. Now that we have a decent data pipeline, we are turning our attention to improving the data’s accessibility for our target users, using Datasette, Docker containers that run Jupyter, and Intake data catalogs.

Find Zane Selvans's work on GitHub · View Talk Slides · Watch Talk
Karl Broman

Data cleaning principles

Why don't we teach data cleaning? It has been said that it is difficult to generalize: that what we learn from cleaning Medicare data cannot be readily applied to the cleaning of RNA-seq data. To the contrary, I think there are important general principles for cleaning data, and there are more commonalities in the creative process of data cleaning than in other aspects of data analysis. I will seek to delineate and illustrate these principles, which include:
(1) think about what might have gone wrong and how it might be revealed
(2) study the pattern of missing data and ask why they are missing
(3) use care when merging datasets and focus on labels not position
(4) if things are supposed to match, check that they do
(5) if things are supposed to be distinct, check that they are
(6) look for outliers and other oddities by making lots of plots, particularly scatterplots
(7) look for batch effects by making lots of plots, particularly plots against time and scatterplots colored by batch
(8) ask for the primary data and also metadata
(9) don’t be shy about asking questions
(10) document not just what you did but also why you did it
(11) don’t trust anyone (even yourself)
(12) allocate sufficient time and energy to the effort

Find Karl Broman's work on GitHub · View Talk Slides · Watch Talk
Emily Riederer

Building a team of R packages

Many case studies demonstrate the benefits of organizations and research groups developing internal R packages. But how do you move your organization from individual internal packages to a coherent internal ecosystem? This talk applies the jobs-to-be-done framework to consider the different roles that internal tools can play, from unblocking IT challenges to democratizing tribal knowledge. Beyond technical functionality, we will explore design principles and practice that make internal packages good teammates and consider how these deviate from open-source standards.Finally, we will consider how to exploit the unique challenges and opportunities of developing within an organization to make packages that collaborate well – both with other packages and their human teammates.

Find Emily Riederer's work on GitHub · Watch Talk
Katerina Drakoulaki, Daniel Alcalá López, Jacqueline Maasch, Sam Wilairat, Lilly Winfree

Frictionless Data workshop led by the Reproducible Research fellows

Birds of a Feather session

This workshop will cover an introduction to the open source Frictionless Data tools. Participants will learn about data wrangling, including how to document metadata, package data into a datapackage, write a schema to describe data and validate data. The workshop is suitable for beginners and those looking to learn more about using Frictionless Data. It will be presented in English, but you can ask questions in English or Spanish.

Find Katerina Drakoulaki, Daniel Alcalá López, Jacqueline Maasch, Sam Wilairat, Lilly Winfree's work on GitHub ·
Emily Lescak

Birds of a Feather session

Tracking Impact and Measuring Success in Data Education Events

With the increase in computing power and available technologies to store and analyze data, the demand for data science skills has grown. To address this need, numerous entities are creating training via institutional and community-led events. Data education events upskill and train anyone who works with data, from new researchers to experienced data science practitioners. Organizers of such events strive to measure immediate and long-term impact so that they can improve training efficacy, recruit new partners and event participants, acquire funding, and / or fulfill funding requirements. Measuring the impact of these events can be challenging as the impact is often only shown after the event has completed, as skills are applied in real world scenarios. Also, impact is multi-dimensional and not always quantifiable. This session, co-developed by Emily Lescak, Beth Duckles, Yo Yehudi, Yanina Bellini Saibene, Ciera Martinez, Leslie Alanis, and Reshama Shaikh, will facilitate discussion for newcomers and experienced event outcome measurers alike, covering topics such as: motivation for measuring impact, determining what we can measure and how we can measure it, challenges to measuring impact, and designing impact strategies for different stakeholders (e.g., funders, co-organizers, the larger community). Following the session, we will create a blog post that synthesizes the discussion and collates resources shared by facilitators and attendees.

Find Emily Lescak's work on GitHub ·
Nick Santos

Birds of a Feather session

Versioning, Sharing, and Managing Medium Data

How should we manage medium data - the data that is too small to justify management with 'big data' tools and too large to store in a source controlled repository? Data formats in this size class, such as large SQLite databases, (Geo)Tiffs, and other text and binary data need versioning, shareability, and strong, accessible metadata as well. But current tools for these formats typically lack one or more of these qualities, are closed source, or require setup of significant external dependencies with need for ongoing funding, maintenance, and security expertise. It seems that no approach fits most use cases since each one has important tradeoffs - if you use tools like SQLite, GitLFS, DVC, and Dolt or work with medium data, then please come discuss tools, methods, and workflows to version and manage medium data of all types.

Find Nick Santos's work on GitHub ·
Stephen Jacobs

Birds of a Feather session

How can we use data on Open Work for Tenure and Promotion?

Open@RIT is the Open Programs Office for the Rochester Institute of Technology. We are working on a draft set of recommendations for faculty to to describe, and administration to understand and evaluate, the impact and translation of work in Open Software, Hardware, Data, Science, Educarional Resources etc. In this BOF session we'll share the current straw man and discuss how it might apply to other colleges and universities.

Sara El-Gebali

Birds of a Feather session

OpenCIDER: Open works, Computational Inclusion, Digital Equity and Regulatory Handbook

To advance active participation in open data practices and to improve the quality of research, we need to adopt inclusive practices with particular attention to communities with limited resources to ensure equitable and effective engagement of underrepresented groups. To that end, OpenCIDER is a knowledge space where we highlight communities and resources related to several aspects under the umbrella of Open Data from a global perspective with a strong focus on Low-Middle Income countries (LMICs). We believe that enriching the Open Data community by propagating the foundations of equity, inclusion, sharing, and FAIRness on a GLOBAL scale, will provide significant benefits for research. It will also translate to more effective use of resources, increased rigor, transparency, re-use, participation, accountability, and reproducibility

Find Sara El-Gebali's work on GitHub ·
Wesley Bernegger

Discovering and Communicating Empathy in Data

Birds of a Feather session

Over the past decade there’s been a growing movement to rediscover and nurture the empathy underlying data analysis and visualization — to humanize data. It’s natural to become numbed to data that has been abstracted or numbers that are just too large to comprehend. How can we remind ourselves that what we’re looking at are not just dots, lines, or colors but people or animals. This past year, with all its strangeness and tragedy, has made a few things clear: How much we need human connection. How much we rely on quality data, analysis, and reporting in times of crisis. And how difficult it is to combine those two needs. In this hour we’ll get together to discuss the role of empathy in data visualization. Why is it important? Where are we succeeding? Where are we falling short? Which data visualizations have touched you on a deeper level? Which visualizations tried to but somehow missed the mark? How do you meaningfully visualize 500,000 American lives lost to COVID-19? 500,000 dots? 500,000 faces? 500,000 names? Or is there something we’re still missing? Some context that makes the data feel real. Let’s dig into this.

Alexandra Chassanoff, Tim Stallman, Michelle Tackaberry, JT Tabron

Birds of a Feather session

Historic Data Harms: Lessons from Curating a Civic Data Collection

How can civic data and digital tools engage community members and encourage new understandings of history? This session will introduce and describe “Hacking into History”, a year-long research effort aimed at documenting and transcribing racial covenant clauses found in Durham County, North Carolina property deeds. Partly educational and part volunteer transcription, the project began as a collaboration between The School of Library and Information Sciences at North Carolina Central University (NCCU), DataWorks NC and the Durham County Register of Deeds. While the project originally planned to hold multiple in person transcription events and educational workshops, we have transitioned most work to the virtual sphere. We will reflect on some of the unexpected challenges and potential opportunities raised by this move, including some lessons learned in preparing community members for engagement with racially explicit civic documentation in an online space. Anticipated outcomes from our work include the creation of a publicly accessible, transcribed collection of racially restrictive property deeds in Durham to further serve as a primary source for public engagement and historical understanding. The panel will also discuss providing access to historical public records information, digitization, technologies, data, and community partners involved, skills of the community project team that were essential to this type of project, efforts undertaken by the NCCU team, and the suitability of such a project for similar groups. We will close our presentation with lessons learned from making these materials computationally ready for reuse. Time for questions and comments will round out our session.

Project website · Project platform · Project email

Meag Doherty

Exploring the possible of biomedical data visualization

Birds of a Feather session

I would like to bring together a group of practitioners to discuss what are the ingredients for building a successful biomedical data resource. At All of Research Program, we think about the following core elements of building a sustainable researcher community: Access, Data, Tools, Community, Collaboration, Training, and Sharing. I would like to discuss how others are building similar frameworks and review some tactics and best practices from other disciplines like community advocacy and security-centered design.

Find Meag Doherty's work on GitHub ·
Ryan Harter

Soft Data Science Practitioners

Birds of a Feather session

I’m convinced that great data scientists aren’t defined by their hard skills (like statistics and programming). Instead, the exceptional data scientists I know demonstrate excellence through soft skills like client management, great writing, or strong intuition. This birds-of-a-feather session is for data scientists who agree that soft-skills rule and want to find a way to legitimize this brand of data science.

Find Ryan Harter's work on GitHub ·
Agnes Cameron, SJ Klein

Data collaboratives in practice

Birds of a Feather session

This is a time for rethinking data and its stewardship. Advances in AI + ML are expanding how we can extract meaning from data, at the same time as a flourishing in communities of personal knowledge tools (such as Roam & JAMStack blogs), and a growing interest in distributing power over information (on both technological and human levels). We want to facilitate a discussion about building and maintaining data collaboratives -- communities of practice curating structured parts of the open knowledge landscape.Questions we are particularly interested in: - What tools do you use to facilitate coordinated data curation? - How do you interlace this work with the broader landscape of public knowledge? - What are steps you encourage others to implement now in their own research + teams, to make their current knowledge flows more legible to, and compatible with such collaborative sense-making? - What of these steps are not currently prioritized by the knowledge-sharing tools you use, but should be? We will share perspectives from the Knowledge Futures Group, including data collaboratives for innovation datasets, citation graphs, open-source housing projects, and open journals.