csv,conf,v5

Speakers

We assembled an exciting lineup of speakers for csv,conf,v5 -– of data makers and enthusiasts representing academia, science, journalism, government, and open source projects. Links to session slides and recorded talks are at the bottom of each speaker's blurb.

Rudo Kemper

Mapping and safeguarding indigenous oral histories using an open source tool

For many indigenous and other local communities in the rainforests of South America, survival in the rainforest has always depended on an intimate knowledge of their territory, passed down by their ancestors. Place-based stories help determine where food or resources are located, or where dangers lie hidden. Most importantly, the oral histories reinforce their historical and cultural connection to their homelands, which in turn informs their collective identity. However, in the contemporary context, oral history storytelling traditions are at risk of disappearing as younger community members leave their villages recurrently or permanently, in search of work. To prevent invaluable oral histories from disappearing into oblivion, we developed a methodology and built a FOSS application to help communities map their oral histories. Terrastories was born in 2018 after realizing the need for an offline-first geostorytelling tool that can work in very remote conditions such as the Amazon rainforest, and give communities the power to manage their own traditional knowledge and storytelling data. I will share how the project was born and why indigenous peoples needed this tool, how Terrastories works, and discuss some tough questions around data sovereignty, protection of sensitive data, and archiving traditional knowledge.

View Slides · Watch Talk
Angela Li

Data Communities and Those Who Build Them

In order to support the use of data in various contexts, it is important to have champions who demonstrate the value of tools, shepherd new users, and provide support to those learning to use data. In this talk, I’ll cover my experiences with data community building and share strategies from open source communities such as R-Ladies and The Carpentries, as well as the spatial data science community I support at my research center. I’ll discuss the value of the data community builder and their role within / outside of institutions for encouraging uptake of technical tools. Some of the skills needed for these roles may not be what you think: building relationships, being empathetic toward use cases (and users!), organizing events, teaching effectively, and setting strategy for expanding data networks, among others.

View Slides · Watch Talk
David Selassie Opoku

Low-Income Data Diaries - How “Low-Tech” Data Experiences Can Inspire Accessible Data Skills and Tool Design

How would you communicate a data visualisation through radio or remotely teach data journalism to a team of freelance journalists with access to 1GB RAM laptops and spotty internet? How does a civic technologist without access to a credit card get access to cloud services for her community contracts monitoring app? How do you explain the importance of data privacy to a community of farmers who could not complete basic school. [Hopefully in 2020], most people will not deny the transformative power of data literacy in the digital age. Whether you’re a journalist, a business owner, a government official, an activist, a researcher or a student, knowing how to access the right data, transform them into information and leverage its insights for decision-making and action has become a life skill. Take a journey with me as we travel to and alongside several data community members working in low-income context and hear experiences of how their “low-tech” (LOW-TECHnical knowledge and LOW-TECHnological tools/resources) contexts, are elevating and challenging gaps in the current data community and driving opportunities to rethink and shape how the data community can make data skills and tools accessible to more.

View Slides · Watch Talk
Emily Riederer

RMarkdown Driven Development

RMarkdown enables analysts to engage with code interactively, embrace literate programming, and rapidly produce a wide variety of high-quality data products such as documents, emails, dashboards, and websites. However, RMarkdown is less commonly explored and celebrated for the important role it can play in helping R users grow into developers. In this talk, I will provide an overview of RMarkdown Driven Development: a workflow for converting one-off analysis into a well-engineered and well-designed R package with deep empathy for user needs. We will explore how the methodical incorporation of good coding practices such as modularization and testing naturally evolves a single-file RMarkdown into an R project or package. Along the way, we will discuss big-picture questions like “optimal stopping” (why some data products are better left as single files or projects) and concrete details such as the {here} and {testthat} packages which can provide step-change improvements to project sustainability.

View Slides · Watch Talk
Wendy Wong

On using AutoML to predict clinical outcomes

AutoML allows users to create high-quality machine learning models to solve real-world problems without much coding. Recently, Automl has been used in machine learning competitions such as Kaggle and showed excellent performance. The purpose of this study is to investigate whether AutoML can be utilized for biologists who have little experience with machine learning can use AutoML to gain insights from their data. In this study, I will re-analyze a case-control gene expression data set with open-source AutoML frameworks. I will compare the frameworks on their performance on creating a predictive model for disease using biomarkers from the expression data. I will demonstrate how to keep track of models and their hyperparameters using MLflow. Finally, I will attempt to gain insights by integrating information from explainable AI tools such as DALEX and biological pathways for gene set enrichment analysis.

View Slides · Watch Talk
Philip Ashlock

The State of Open Government Data Infrastructure

This talk will provide an overview of the current state of open government data infrastructure and the broader ecosystem from the perspective of Data.gov and the implementation of open data laws in the United States. This will cover the widespread use of the W3C DCAT metadata standard across all Federal agencies as well as widespread use by state and local governments. This same metadata also helps generate the Schema.org variant of the specification with fuels listings on general purpose platforms like Google Dataset Search. The European Union has been updating their DCAT Application Profile with the recent development of DCAT 2.0 and the US Government will be revising its DCAT specification to meet updated requirements in the new comprehensive open data law ("Evidence Act") with public input on GitHub. This talk will provide an overview of the current state of this infrastructure and ecosystem and ponder how other metadata standards including CSVW, Tabular Data Packages, DSPL, and SDMX fit into the mix as well as how we can better leverage CSVs and tabular data tools and capabilities within platforms like Data.gov and other CKAN-based data catalogs. Since this talk should be in the midst of the public comment period for revising our USG-focused profile of DCAT, it will also be a good opportunity to solicit comments and public participation in the update to the metadata specification used across all government agencies. The current legacy version of this can be found at https://resources.data.gov/schemas/dcat-us/v1.1/

Watch Talk
Amanda Ludden

Around the world in 80 data formats: Re-packaging the Harvard Business Review archive as an accessible, internal database

We often evaluate digitally-published content using event-based metrics (clicks, conversions), persisting only snippets of the original content (headlines, tags) rather than full text. At Harvard Business Review, decades of editorial rigor have yielded an archive that's small but rich and modular relative to, e.g., breaking news sites -- and our strategies ought to reflect that. Here, I'll share outcomes and lessons from an initiative to re-package our archive text and metadata -- from XML to data.frames, .RData, CSV, XLS, and JSON -- as an internal data product for editors, analysts, product managers, and others across the organization.

Watch Talk
Kathleen Sullivan and Andrew Mckenna-Foster

The Complicated Problem of Closing Open Data

Open government data portals can be a valuable source of diverse public-sector information, from lobbyist spending to water quality test results. They also have a lot of aging junk: little-used datasets with cryptic titles and scant documentation. The library domain has years of experience “weeding” out under-used or outdated materials, and our team brought that perspective to the Washington State open data portal (data.wa.gov/browse). But weeding open data turned out to be pretty complicated. What counts as “low quality” or “little” use? Who makes the weeding decision, in a publishing environment that’s usually decentralized and managed anonymously? How do portals, created for transparency, remove content in a transparent way? We examined nearly all U.S. state and many U.S. city government portals and found very few formal data removal policies -- a void that raises questions about authority, usability and management, as many civic data portals enter their second decade. Supported by the Open Data Literacy project (https://odl.ischool.uw.edu/), this work is contributing to a new open data curation partnership between the Washington State Library and the state’s data portal. The talk will cover our findings, recommendations from archivists and other public records experts, and how the Washington State Library and state agencies are working together to improve open data resources for the public.

View Slides · Watch Talk
Gabriele S Hayden

The cultural meaning of programming languages

Human languages develop over time a set of cultural associations. For example, during the early and mid-twentieth century, the Spanish language in the United States was seen both as a European high literary language and as a language associated with femininity and racial mixture. In my dissertation work, I documented how US modernist poets who translated from Spanish used or resisted those associations as they intervened in controversies over race and style in twentieth century poetics. Coding languages, too, can take on gendered and raced cultural associations. This talk explores the changing cultural associations of coding languages and how those cultural associations were expressed in changing coding styles or conventions. It draws on the work of Vikram Chandra (Geek Sublime) and other writers exploring the cultural meanings and stylistic associations of computer code.

View Slides · Watch Talk
Tempest van Schaik, PhD

Building successful collaborations around healthcare data

Besides a few high-profile machine learning for health breakthroughs, what is the state of real-word data science for health, and why do so few algorithms make it into production? We’ll explore the rich variety of health data that exists, how to avoid common pitfalls with it, ways to make the most positive impact with health data, and the culture of different stakeholders who work with data. We’ll consider this topic with reference to Project Fizzyo, which aims to improve the lives of children with cystic fibrosis, using data from custom respiratory devices which are used during physiotherapy.

Watch Talk
Samuel Brice

Demystifying Clearview: Vehicle Tracking with Public CCTV Cameras

Recently the New York Times published an article about Clearview AI - the secretive company that might end privacy as we know it. Using a database of billions of images scrapped from websites such as Facebook, and Instagram Clearview can track and identify anyone with a web presence. The tool is actively being used by police agencies around the country and many citizens are concerned about its potential for abuse. My talks goes into detail explaining how Clearview's system works by demonstrating a similar system for tracking and identifying cars using public CCTV cameras. I will cover the steps of implementing such a pipeline from collecting training data, to building a neural net model, to tracking the movements of a car in time and space. Lastly I will cover the implications of such a capability on privacy as well as what can be done to protect our privacy today.

Rebecca Williams and Hunter Owens

The good, the bad, the extremely obtuse: a survey of government (open) data regulations and how to successfully build accountability programs in government

You've heard of GDPR, but are you aware of how recent legislation by Congress and California may affect your beat? This session will bring together two experts in government data management to discuss the impacts of recent legislation, including: the Foundations for Evidence-based Policymaking Act, the Geospatial Data Act, the Grant Reporting Efficiency and Agreements Transparency Act, and the California Consumer Privacy Law. They will discuss how this legislation affects access to data, tracking down data, formats of data, and other nuances of the governmental data making process. They may also speculate wildly on how future legislation could impact government data management. Rebecca Williams on Twitter:@internetrebecca. Hunter Owens on Twitter:@hunter_owens

View Slides · Watch Talk
James Perry Evans

Protected health information breaches on GitHub

Medical scientists are encouraged to use GitHub for software development, but without training, they might leak protected health information (PHI) by inadvertently including data in what should be software-only repositories. During the fall of 2016, we attempted to identify obvious breaches of PHI on GitHub as part of an ongoing interest in patient privacy. Searching GitHub for keywords patient, dob, and ssn uncovered hundreds of repositories, which were further scanned for sensitive information (names, organizations, phone numbers, street addresses, credit cards, IPs, SSNs, and emails) using Python's common regex module and the Stanford Natural Language Toolkit. Manual investigation of the results uncovered four repositories that exposed patient information. 1) A popular health care provider exposed approximately 4000 patient names. On Dec 1, we were able to track down the healthcare provider from both the repository name and doctor names in the repository files. We contacted the organization and those repositories were taken down within the day of contact. 2) A health collection agency’s repository led to the exposure of social security numbers, dates of birth, home addresses, email addresses, and insurance and billing information of roughly 30,000 patients. After we contacted the repository owner, the data and the repository were removed from GitHub some six months after the data had first been exposed. 3) A crisis center’s long-term breach of PHI was discovered in August 2016, and from the repository commit dates, it had been up for at least three years. The original repository managed the medical records application developed for the crisis center. We contacted that organization and the repository was taken down within a few days. 4) A contractor for a health insurance wellness program leaked some patient data with names, social security numbers, addresses, and health measures such as blood pressure. This organization was contacted by our hospital compliance office and the GitHub repository was removed. Our talk will cover the discovery of these PHI breaches, and how we handled them with GitHub, our hospital compliance office, and the organizations involved.

View Slides · Watch Talk
Monica Granados and Lily Zhao

How Frictionless Data can help you grease your data

I think we have all been subject to other people’s data - the frustration and the disappointment that follows when we determine that the data is unusable. The Frictionless Data initiative at Open Knowledge Foundation aims to reduce friction in working with data, with a goal to make it effortless to transport data among different tools and platforms for further analysis, and with an emphasis on reproducible research and open data. As inaugural Reproducible Research Fellows of the program, we will demonstrate how we have applied the principles and tools of Frictionless Data to our own research data on the octopus trade and open access bibliometric data to make our data more reusable by others. This talk is aimed at all data wranglers, and along the way we will talk about our experience in the fellowship, what were some difficulties, and about our triumphs. Monica Granados on Twitter:@monsauce. Lily Zhao on Twitter: @lily_z_zhao.

View Slides · Watch Talk
Wesley Teal

More than manuscripts: Transforming special collections materials into ornithological data

In 2016, the Iowa State University Library was awarded a CLIR grant to digitize several ornithological collections that were part of our special collections to create avIAn: Avian Archive of Iowa Online. What made this project unique was that in addition to creating a fairly traditional digital collection, we would also build a data set out of rare bird sighting documentation, transforming an inaccessible paper archive into online data for anyone to use. This talk will give an overview of how a team of library staff without ornithological expertise tackled the task of building a public data set.

View Slides · Watch Talk
John Muyskens

Data Journalism in the Anthropocene

From networks of thousands of ground-based sensors to constellations of satellites, a flood of data creates an increasingly clear picture of our impact on Earth. I will share my experiences as a journalist who uses data and visuals to cover climate change and other environmental issues. I will also talk about how we at The Washington Post use (and create) publicly accessible data in our journalism.

Watch Talk
Katherine Simeon

Learner-centered teaching for the non-traditional data science classroom

There is considerable variability in how individuals learn computing skills. While many formal courses, both online and in-person, are available to learners, typical methods of learning are acquiring knowledge on-the-job and participating in short-term workshops. While these non-traditional contexts are highly successful, instructors are challenged to capitalize on a limited amount of time and resources, while catering to learners of different backgrounds and experience levels. In this talk, we will discuss how pedagogic research can be applied to computing and programming education in non-traditional settings (e.g., corporate training, 1-day workshops, and one-on-one mentorship). Specifically, we will explore how to modify established learning formats in order to address specific needs of learners. Finally, we will outline best practices to facilitate a collaborative and inclusive learning environment that can motivate both instructors and learners. Katherine Simeon is the presenting author, Diya Das and Angela Li are the other authors.

View Slides · Watch Talk
Salina Cheuk Ting Ho

Decision making in 'successful' data analysis

As the tech world continues to abstract away complexities and realities into proxy metrics, data and analysis can present itself as objective and builds a reputation of legitimacy. However, in practice there are many dynamics to data analysis that contribute to the determination of whether an analysis is ‘successful’ or not, many of which exists outside of the data itself. This is why the same regression model can be accepted as ‘successful’ in one context but not in another. Understanding the interplays between data and different factors can help analysts strategize to deliver a ‘successful’ analysis, such as applying design-thinking, storytelling, and empathy techniques. Many discussions within the data community have taken place regarding these techniques and its importance, but few have discuss the ethical impact of such techniques – what are the implications if an analysis crafted through a specific lens is ‘successful’? This talk will go through the various dynamics that contribute to a ‘successful’ analysis, and how intentional decision making at each level of these dynamics is important in data analysis.

View Slides · Watch Talk
Andres Snitcofsky

How simple spreadsheets helped us spread gender perspective in Argentine politics

In Economía Feminista we try to get the feminist perspective in usual male dominated areas as Politics and economics. Through the use of simple spreadsheet born interactive and digital projects we did just that with minimum resources: A crowdsourced database to pre-whip / figure out congress votes on legal abortion, a twitter bot that shows the lack of women in op-ed every day, an automated scrapper to monitor menstruation "cost", a feminist index for candidates for 2019 election and other small feminist nerd projects

View Slides · Watch Talk
Soila Kenya

Error 404: Data Archival from a journalist's perspective

Have you ever read an awesome interactive article online from your favourite news source and clicked on a link that landed you on an error 404 page? Or a now-deleted tweet? In all the rage of data journalism, one thing that needs to be put at the forefront is the necessity of archiving data used within these news stories for posterity. Decades from now, we should still have full access to these stories in the way they were originally meant to be consumed. This talk will cover a few tips data journalists can start using NOW to ensure the links they use to give more context to their stories will live on.

View Slides · Watch Talk
Vicky Steeves and Sarah Nguyen

commit- ment issues with Git: investigating and archiving y’alls work

Git and platforms like GitLab and GitHub have revolutionized how people track and share their work. This reality brings librarians to an interesting crossroad as they move to understand, inventory, archive, and preserve source code and its contextual ephemera, such as commit messages, merge requests, and issue discussions. This talk from a project team of research scientists and librarians will share how Git breaks barriers and promotes open research and scholarship, as well as identify ways it can be archived for the long-term citation and reproducibility. Part environmental scan, part behavioral study, we will discuss the ins-and-outs of a researcher’s Git hosting platform experience and digital preservation tools that can be creatively used in Git preservation efforts.

View Slides · Watch Talk
Daniel Bernstein

Getting to CSV: Unlocking Public Data in the Civil Justice Space

Many Americans are forced to navigate the civil justice system without an attorney because they cannot afford representation. To better support self-represented individuals, we need detailed information about these individuals’ experiences. Data documenting the parties, events, and outcomes in court cases could help legal aid organizations better address their local needs and court systems better manage public resources. However, state and local courts operate their own data management systems and do not make this information readily available in a format suitable for analytical purposes. Courts also vary drastically in the granularity of their public records, causing inconsistencies when attempting to compare data across jurisdictions. This talk will discuss our work to web scrape, manage, and analyze millions of civil court records to better understand how individuals without attorneys fare in the civil justice system, especially in important areas such as eviction, debt, and domestic violence. This talk will include lessons learned in creating massive datasets and promoting data sharing while maintaining data privacy.

View Slides · Watch Talk
Emmy Tsang and Daniel Ecer

PeerScout: Diversifying peer review with data and machine learning

How does one find the most suitable peer reviewers for a research manuscript? Journal editors face this challenge every day– many resolves to choose based on their own networks and past experience. This introduces biases in the peer-review process. In this talk, we share our journey on building PeerScout, an open-source machine learning solution. We will explore the challenges we are facing with analysing data and research papers, and what we learnt working with the editorial and wider communities.

View Slides · Watch Talk
Hao Ye

Accessibility and reproducibility in ecological time series analysis

Although many scientific datasets are now shared openly and there are numerous tools for reproducible research, many barriers remain. Datasets are diversely structured and dispersedly situated; most domain scientists lack formal training in data management, software development, or reproducible research; and learning these skills or assembling a skilled team requires a large investment of time and/or resources. To resolve these challenges, we built MATSS (Macroecological Analysis of Time Series Structure), an R package that provides access to over 80,000 ecological time series in a standardized format and promotes best practices in computational research through reproducible and shareable templates of full workflows bundled as research compendia ~ Marwick et al. 2018.

View Slides · Watch Talk
Bradly Alicea and Jesse Parent

Epistomological Directories for Research Development and Education

Learning and research involves more than simply looking things up on Wikipedia. The presenters propose a Github-based system for advancing research and interdisciplinary education called Epistomological Directories (EDs). We will present two versions of this: one for topics in Cybernetics and Systems and the other on Frontier Maps for education in AI, Philosophy, and Systems Science. Unlike conventional topical wikis, our approach serves as a set of research nodes, each of which contain a take-off point for other nodes. Each node (defined as a Markdown file or folder) might be based on interpretations of key papers in the field, or proposing new research directions based on a given research group's activities. At some nodes, we also plan to include annotated data in the form of .csv files (data tables) and .ipynb files (Jupyter notebooks), which serve to provide information about sample data for a topical domain, or for more theoretical topics what the data should look like (pseudo-data). Frontier Maps also serve the purpose of providing minimal information needed to understand jargon and advanced topics in fields unfamiliar to the user. More generally, EDs can serve as a tool for theory-building, self-directed education, and collaboration by bringing people up to speed on niche topics. They might also be used to engage in meta-research for special topics and published papers.

View Slides · Watch Talk
Erica Hagen

Open Mapping and Government Projects in Kenya: Map Kibera and the state of citizen data

Citizens in rural Kenya are collecting data and field mapping projects they've helped to choose through a local government participatory budgeting process. This talk will share how Map Kibera and GroundTruth Initiative recently used OpenStreetMap and ODK/Kobo to give citizens the tools to collect not only locations and data on projects, but also provide their opinion and assessment of the completion status and quality. Doing this jointly with local county governments has unearthed some major data management challenges. Systems are being set up new for governments created only 10 years ago by a new constitution, and flashy tech is mixed with messy paper records, creating a difficult data environment. However, it's also an opportunity for open data and citizen involvement in transparent governance. I will get into the details of the current policy, politics, and geospatial data environment in Kenya with a focus on how citizens have been included and excluded from the process to date and how we see things moving forward.

View Slides · Watch Talk
Christa Hasenkopf

Fighting air inequality by building a global air quality open data ecosystem

Air inequality - the unequal access to breathe clean air across the world - is one of the biggest, yet most solvable problems of our time. Ninety-percent of the world’s population breathes air that doesn’t meet World Health Organization guidelines, and this results in an estimated 1 out of every 8 deaths on the planet, more deaths than caused by HIV/AIDS and malaria, combined. Creating a core data-sharing infrastructure for simple access to air quality information can enable an ecosystem of solutions and solvers across diverse geospatial scales and sectors. This presentation will share the impact of OpenAQ, a non-profit that has built an open-source air quality data-sharing infrastructure and community to fight air inequality around the world. Impact stories will include examples from science, policy, software development, and journalism.

View Slides · Watch Talk
Lucille Moore

Open science in infant neuroimaging research

Functional magnetic resonance imaging (fMRI) has allowed scientists to investigate brain circuitry underlying healthy and maladaptive behavior. As such, it is an essential tool for advancing understanding of brain systems underlying psychiatric disorders. Because fMRI is costly and time-consuming, sharing data and tools in this space is critical for driving forward scientific discovery. However, the field is still developing rigorous standards for not only data sharing, but also quality control in image processing and analysis. Standardization issues are magnified in infant fMRI, which presents unique challenges for processing and analysis due to substantial differences in image properties compared to adults. Infant fMRI is vital for research aimed at supporting healthy brain development and preventing the onset of psychiatric disorders. My talk will highlight burgeoning open science fMRI initiatives that will be vital to drive the field forward, including my group’s pending release of an existing state of the art imaging pipeline modified for infant processing. We are employing this pipeline in a clinical trial aimed at reducing the effects of early adversity on developing brain systems implicated in psychiatric disorders. Such efforts are critical for expanding the impact of neuroimaging in clinical- and policy-level efforts to support healthy brain development.

View Slides · Watch Talk
Christina Gosnell and Pablo Virgo

Getting climate advocates the data they need

When it comes to electricity regulation a little data can go a long way. There is a wealth of data about the US electric system, but unfortunately it is not accessible, clean or connected. The Catalyst Cooperative collates and analyzes data to help advocates close coal power plants. We’d like to share what we’ve learned about how open source data can further the policy conversation around clean energy, using real data and examples from our work.

Watch Talk
Lisa Federer and Maryam Zaringhalam (co-author)

Data and Code for Reproducible Research

The National Library of Medicine held two workshops in 2019 in which National Institutes of Health researchers attempted to reproduce the results of published bioinformatics paper that stated they made their data and code publicly available. Not a single one of the ten teams over two workshops could fully reproduce the results of the paper. Despite the teams’ inability to reproduce the results, the workshops provided valuable insights into where researchers are falling short in sharing and documenting their data and code. This talk will discuss some of the issues that contributed to the irreproducibility of these papers to provide a better understanding of how researchers can ensure that their science is transparent and reproducible.

View Slides · Watch Talk
Mateusz Kuzak

Growing and Supporting Open and Inclusive Research Software Culture

Research Software Engineering combines an intricate understanding of research with expertise in programming and software engineering. Making the process of contribution and participation in research software projects more inclusive and welcoming for a diverse audience will lead to the software better for all researchers. The complexity and domain specificity of research software poses unique challenges for community building and external contributions. How can we help researchers and Research Software Engineers make their projects more inclusive for the users and contributors with different backgrounds? I hope to share some insight I gained working with different communities in the open science, research and data spaces, from global movements like the Carpentries and Mozilla Open Leaders, to local hacky hours, study groups and Reprohacks. I wish to discuss and explore with the community some of the practices that I think are transferable into the research software community and can help lower the barriers for contribution.

View Slides · Watch Talk
Lai Yi Ohlsen

Measurement Lab - Open Internet Data

Measurement Lab (M-Lab) is the largest open internet measurement platform in the world, hosting internet-scale measurement experiments and releasing all data into the public domain (CC0). We are an open source project with contributors from civil society organizations, educational institutions, and private sector companies, and are a fiscally sponsored project of Code for Science and Society. Our mission is to Measure the Internet, save the data, and make it universally accessible and useful. M-Lab works to advance network research and empowers the public with useful information about broadband and mobile connections by maintaining a scalable, global platform for conducting internet measurements, and by supporting an ecosystem of external partners and users around the world interested in using the resulting open data. Our users are researchers, activists, analysts, journalists, experiment developers, hosting providers, regulators, municipalities, and every day consumers. M-Lab works to enhance internet transparency, and help to promote and sustain a healthy, innovative internet by supporting our users in their research and data analyses, developing and publicizing new use cases for our datasets, forming collaborative partnerships, and building open source measurement tools. Last year we introduced our platform and data to csvconf; this year we would like to introduce the project to those who are new to the community as well as dive deeper into our open source alternative for on-premise measurement devices that contribute to our public data set. We would also like to share our community engagement methodology, in hopes of starting a discussion around how to best make data open through accessibility and usability.

Watch Talk
Serena Peruzzo

Improving law interpretability using NLP

Laws define how people may or may not behave in society, but are often hard to interpret and inaccessible to the public. Data Scientists from Bardess, in collaboration with a research group from the Government of Ontario, have investigated how NLP can be applied to understand linguistic patterns in legislative texts and extract information that is meaningful for the public. The methodology developed provides us with a framework for representing legal texts that can be used to simplify the way information in the law is accessed and, at the same time, inform legislators on how to write more clear and accessible legislation by highlighting parts of the law that are particularly hard to interpret.

Watch Talk
Melissa Santos

Time-to-Event Analysis for Non-Medical Applications

How do you estimate the time until an event, especially if the event might never happen? The statistical methods for this come from studying time from disease diagnosis to death, but we can use these methods for much more cheerful data. For example, how long does a subscription customer continue to pay you? How long does it take from someone commenting on your open source code to becoming a contributor? How long does it take from the user being seen the first time to them becoming a paid customer? Kaplan-Meier survival curves are a non-parametric estimates of the time to an event. They make no assumptions about the distribution of the time to the event, and they handle samples of various ages that may or may not have made it to the event. As well as the theory of these, we'll dive into how to calculate them directly in SQL. To finish, I'll share some ways we've been using Kaplan-Meier curves to make decisions at a Software as a Service company.

View Slides · Watch Talk

Schedule

WEDNESDAY - MAY 13 - DAY 1

Welcome to Day 1
Join Session: crowdcast.io/e/csvconfv5-0-welcome-day1
6.45am PST · 9.45am EST · 1.45pm UTC · 2.45pm BST · 4.45pm EEST Welcome/intro
Session 0
Join Session: crowdcast.io/e/csvconf5-0-session-0
7am PST · 10am EST · 2pm UTC · 3pm BST · 5pm EEST The State of Open Government Data Infrastructure
Philip Ashlock
7.20am PST · 10.20am EST · 2.20pm UTC · 3.20pm BST · 5.20pm EEST Growing and Supporting Open and Inclusive Research Software Culture
Mateusz Kuzak
Session 1
Join Session: crowdcast.io/e/csvconf5-0-session-1
8am PST · 11am EST · 3pm UTC · 4pm BST · 6pm EEST Learner-centered teaching for the non-traditional data science classroom
Katherine Simeon
8.20am PST · 11.20am EST · 3.20pm UTC · 4.20pm BST · 6.20pm EEST Data Communities and Those Who Build Them
Angela Li
Session 2
Join Session: crowdcast.io/e/csvconf5-0-session-2
9am PST · 12noon EST · 4pm UTC · 5pm BST · 7pm EEST The good, the bad, the extremely obtuse: a survey of government (open data regulations and how to successfully build accountability programs in government)
Rebecca Williams and Hunter Owens
9.20am PST · 12.20noon EST · 4.20pm UTC · 5.20pm BST · 7.20pm EEST Demystifying Clearview: Vehicle Tracking with Public CCTV Cameras
Samuel Brice
9.40am PST · 12.40noon EST · 4.40pm UTC · 5.40pm BST · 7.40pm EEST Improving law interpretability using NLP
Serena Peruzzo
Keynote
Join Session: crowdcast.io/e/csvconf5-0-keynote-sisi
10am PST · 1pm EST · 5pm UTC · 6pm BST · 8pm EEST KEYNOTE
Sisi Wei: How data has transformed journalism. Inside and out.
CommaLlama Cam
Join Session: crowdcast.io/e/csvconf5-0-day-1-llama
11am PST · 2pm EST · 6pm UTC · 7pm BST · 9pm EEST lunch/munch
11.30am PST · 2.30pm EST · 6.30pm UTC · 7.30pm BST · 9.30pm EEST Llama cam
Session 3
Join Session: crowdcast.io/e/csvconf5-0-session-3
Session3 and 4 happening at the same time
12noon PST · 3pm EST · 7pm UTC · 8pm BST · 10pm EEST Getting climate advocates the data they need
Christina Gosnell and Pablo Virgo
12.20pm PST · 3.20pm EST · 7.20pm UTC · 8.20pm BST · 10.20pm EEST Accessibility and reproducibility in ecological time series analysis
Hao Ye
12.40pm PST · 3.40pm EST · 7.40pm UTC · 8.40pm BST · 10.40pm EEST Data Journalism in the Anthropocene
John Muyskens
Session 4
Join Session: crowdcast.io/e/csvconf5-0-session-4
Session3 and 4 happening at the same time
12noon PST · 3pm EST · 7pm UTC · 8pm BST · 10pm EEST On using AutoML to predict clinical outcomes
Wendy Wong
12.20pm PST · 3.20pm EST · 7.20pm UTC · 8.20pm BST · 10.20pm EEST The cultural meaning of programming languages
Gabriele S Hayden
12.40pm PST · 3.40pm EST · 7.40pm UTC · 8.40pm BST · 10.40pm EEST More than manuscripts: Transforming special collections materials into ornithological data
Wesley Teal
Session 5
Join Session: crowdcast.io/e/csvconf5-0-session-5
1pm PST · 4pm EST · 8pm UTC · 9pm BST · 11pm EEST Data and Code for Reproducible Research
Lisa Federer
1.20pm PST · 4.20pm EST · 8.20pm UTC · 9.20pm BST · 11.20pm EEST Decision making in 'successful' data analysis
Salina Cheuk Ting Ho
1.40pm PST · 4.40pm EST · 8.40pm UTC · 9.40pm BST · 11.40pm EEST Epistomological Directories for Research Development and Education
Bradly Alicea
Closing Session
Join Session: crowdcast.io/e/a3r7n5bd
2pm PST · 5pm EST · 9pm UTC · 10pm BST · 12am EEST END OF DAY 1

THURSDAY - 14 MAY - DAY 2

Session 6
Join Session: crowdcast.io/e/csvconf5-0-session-6
7am PST · 10am EST · 2pm UTC · 3pm BST · 5pm EEST Low-Income Data Diaries - How “Low-Tech” Data Experiences Can Inspire Accessible Data Skills & Tool Design
David Selassie Opoku
7.20am PST · 10.20am EST · 2.20pm UTC · 3.20pm BST · 5.20pm EEST PeerScout: Diversifying peer review with data and machine learning
Emmy Tsang and Daniel Ecer
7.40am PST · 10.40am EST · 2.40pm UTC · 3.40pm BST · 5.40pm EEST Fighting air inequality by building a global air quality open data ecosystem
Christa Hasenkopf
Session 7
Join Session: crowdcast.io/e/csvconf5-0-session-7
8am PST · 11am EST · 3pm UTC · 4pm BST · 6pm EEST commit- ment issues with Git: investigating & archiving y’alls work
Vicky Steeves and Sarah Nguyen
8.20am PST · 11.20am EST · 3.20pm UTC · 4.20pm BST · 6.20pm EEST Open science in infant neuroimaging research
Lucille Moore
8.40am PST · 11.40am EST · 3.40pm UTC · 4.40pm BST · 6.40pm EEST RMarkdown Driven Development
Emily Riederer
Session 8
Join Session: crowdcast.io/e/csvconf5-0-session-8
9am PST · 12noon EST · 4pm UTC · 5pm BST · 7pm EEST Mapping and safeguarding indigenous oral histories using an open source tool
Rudo Kemper
9.20am PST · 12.20noon EST · 4.20pm UTC · 5.20pm BST · 7.20pm EEST Getting to CSV: Unlocking Public Data in the Civil Justice Space
Daniel Bernstein
9.40am PST · 12.40noon EST · 4.40pm UTC · 5.40pm BST · 7.40pm EEST Measurement Lab - Open Internet Data
Lai Yi Ohlsen
Keynote
Join Session: crowdcast.io/e/csvconf5-0-keynote-emily
10am PST · 1pm EST · 5pm UTC · 6pm BST · 8pm EEST KEYNOTE
Emily Jacobi
CommaLlama Cam
Join Session: crowdcast.io/e/csvconf5-0-day-2-llama
11am PST · 2pm EST · 6pm UTC · 7pm BST · 9pm EEST lunch/munch
11.30am PST · 2.30pm EST · 6.30pm UTC · 7.30pm BST · 9.30pm EEST Llama cam
Session 9
Join Session: crowdcast.io/e/csvconf5-0-session-9
Session9 and 10 happening at the same time
12noon PST · 3pm EST · 7pm UTC · 8pm BST · 10pm EEST Error 404: Data Archival from a journalist's perspective
Soila Kenya
12.20pm PST · 3.20pm EST · 7.20pm UTC · 8.20pm BST · 10.20pm EEST How Frictionless Data can help you grease your data
Monica Granados and Lily Zhao
12.40pm PST · 3.40pm EST · 7.40pm UTC · 8.40pm BST · 10.40pm EEST Around the world in 80 data formats: Re-packaging the Harvard Business Review archive as an accessible, internal database
Amanda Ludden
Session 10
Join Session: crowdcast.io/e/csvconf5-0-session-10
Session9 and 10 happening at the same time
12noon PST · 3pm EST · 7pm UTC · 8pm BST · 10pm EEST How simple spreadsheets helped us spread gender perspective in Argentine politics
andres Snitcofsky
12.20pm PST · 3.20pm EST · 7.20pm UTC · 8.20pm BST · 10.20pm EEST Open Mapping and Government Projects in Kenya: Map Kibera and the state of citizen data
Erica Hagen
12.40pm PST · 3.40pm EST · 7.40pm UTC · 8.40pm BST · 10.40pm EEST The Complicated Problem of Closing Open Data
Kathleen Sullivan and Andrew Mckenna-Foster
Session 11
Join Session: crowdcast.io/e/csvconf5-0-session-11
1pm PST · 4pm EST · 8pm UTC · 9pm BST · 11pm EEST Protected health information breaches on GitHub
James Perry Evans
1.20pm PST · 4.20pm EST · 8.20pm UTC · 9.20pm BST · 11.20pm EEST Time-to-Event Analysis for Non-Medical Applications
Melissa Santos
1.40pm PST · 4.40pm EST · 8.40pm UTC · 9.40pm BST · 11.40pm EEST Building successful collaborations around healthcare data
Tempest van Schaik, PhD
Closing Session
Join Session: crowdcast.io/e/a3r7n5bd
2pm PST · 5pm EST · 9pm UTC · 10pm BST · 12am EEST OUTRO / END OF CONFERENCE