csv,conf,v3

Schedule

We are assembling an exciting team of data makers and enthusiasts representing academia, science, journalism, government, and open source projects.

We are also running a series of workshops in Room D - B101 called Data Tables.

TUESDAY - May 2 - Day 1
Room A - A108 Room B - B102 Room C - Eliot Chapel Room D - B101
9:00: AM 9-10:00am Coffee/Breakfast/Registration/Hangout time in Atrium
10:00: AM 10-10:30am Intros/Hello in Eliot Chapel
10:30: AM Empowering people by democratizing data skills
Erin Becker
Designing with data: prototyping at the speed of learning
Michelle Hertzfeld
Smelly London: visualising historical smells through text-mining, geo-referencing and mapping
Deborah Leem
Data Tables - Bionode
11:00: AM Teaching quantitative and computational skills to undergraduates using Jupyter Notebooks
Brian Avery
Designing data exploration: How to make large data sets accessible (and fun to use)
Simon Jockers
Our Cities, Our Data
Kate Rabinowitz
Data Tables - Bionode
11:30: AM Data in the Humanities Classroom
Miriam Posner
The Journey to a better bar graph, and beyond
Daniel Orbach
Open Contracting Data in Mexico City
Gabriela Rodriguez Beron
Data Tables - Bionode
12:00: PM Lunch in Atrium
12:30: PM Heather Joseph. Keynote in Eliot Chapel
1:30: PM Urban/Information: Mapping the US Postal Service Vacancy Survey
Dare Anne Brawley
Data & Abuse of Power
Moiz Syed
Analyzing The Trumpworld Graph: Applying Network Analysis to Public Data
William Lyon
Data Tables - Stencila
2:00: PM When Data Collection Meets Non-technical CSOs in Low-Income Areas
David Selassie Opoku
Data Lovers in in a Dangerous Time
Brendan O'Brien
Scratching Someone Else's Itch
Adam Hyde
Data Tables - Stencila
2:30: PM What's next in open data? The limits of the publication discourse
Paul Walsh
How to mislead the public
Philipp Burckhardt
In search for the ideal csv template to map elections
Roger Fischer
Data Tables - Stencila
3:00: PM Break
3:30: PM Mantle: A Proposal for Automated Metadata and Linked Data for Scientific Users
Toph Allen
Continuous Data Validation for Everybody
Adrià Mercader
How do I get that? Strategies for requesting data and negotiating for it
Carli Brosseau
Data Tables - data.world
4:00: PM Metatab: Metadata for Mortals
Eric Busboom
DDF: Gapminder's data model for collaborative harmonization of multidimensional statistics.
Jasper Heeffer
Role of data intermediaries in the civic data ecosystem
Bob Gradeck
Data Tables - data.world
4:30: PM Angela Bassa. Keynote in Eliot Chapel
5:30: PM Reception in Buchan Hall until 7:00pm
WEDNESDAY - May 3 - Day 2
Room A - A108 Room B - B102 Room C - Eliot Chapel Room D - B101
9:00: AM 9-10:00am Coffee/Breakfast/Registration/Hangout time in Atrium
10:00: AM Laurie Allen. Keynote in Eliot Chapel
11:00: AM A Better "Edit" Button for GitHub-Hosted Data
Matt "Potch" Claypotch
The evolution of a GTFS data pipeline
Danny Whalen
✊s, 🌹s, and major 🔑s: Emoji data science & journalism
Hamdan Azhar
Data Tables - Data Package Pipelines
11:30: AM DIPpering the data: open source tools for data integration
Lilly Winfree
A match made in heaven: domain expert meets csv; gives birth to ontologies
Nicole Vasilevsky and Daniel Keith
Location - Trends in Time and Place
Christopher Moravec
Data Tables - Data Package Pipelines
12:00: PM Data Modeling for Humans: A Learnable Conceptual Model for Relational Data
Jason Crawford
Towards a Taxonomy of Government Data
Hunter Owens
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany
Data Tables - Data Package Pipelines
12:30: PM Lunch in Atrium
1:00: PM Comma Llama in Courtyard
1:30: PM Mike Bostock. Keynote in Eliot Chapel
2:30: PM The Art and Science of Generative Nonsense
Mouse Reeve
Opinionated Analysis Development
Hilary Parker
Surveying the Commune Cloud: Joining Hands to Decentrally Process Decentralized Data
Noah Cawley
Data Tables - Fieldbook
3:00: PM The world is weird and wonderful!
Dan Phiffer
Applying software engineering practises to data analysis
Emil Bay
The power of fuzzy
Max Harlow
Data Tables - Fieldbook
3:30: PM Break
4:00: PM Innovation to facilitate data sharing in the life sciences and biomedicine
Naomi Penfold
Open Data Networks with Fieldkit
Eric Buth
Mapping Data in Jupyter Notebooks with Pixiedust
Raj Singh
Data Tables - Dat
4:30: PM Reproducible and reusable research: Data sharing policies & community driven strategies for improving the status quo
Robin Champieux
Math, Numeric Computing, and JavaScript
Athan Reines
Machine Learning is for Everyone
Melissa Santos
Data Tables - Dat
5:00: PM Outros/Goodbye in Eliot Chapel
5:30: PM Hangout time

Speakers

Erin Becker

Empowering people by democratizing data skills

Although petabytes of data are now available, most scientific disciplines are failing to translate this sea of data into scientific advances. The missing step between data collection and research progress is a lack of training for scientists in crucial skills for effectively and reproducibly managing and analyzing large amounts of data. Already faced with a deluge of data, researchers themselves are demanding this training and need to learn while on-the-job. They require training that is immediate, accessible, appropriate for their level and relevant to their domain. This training needs to include not only technical skills, but ways of thinking about data to provide learners with the knowledge of what is possible along with the confidence to continue self-guided learning. Short, intensive, hands-on Software and Data Carpentry workshops give researchers the opportunity to engage in deliberate practice as they learn these skills, starting with strong foundational skills and receiving feedback as they learn. This model has been shown to be effective, with the vast majority (more than 90%), of learners saying that participating in the workshop was worth their time and led to improvements in their data management and data analysis skills. We have trained over 20,000 learners since 2014 on 6 continents with over 700 volunteer instructors, with the goal of providing effective training that empowers researchers to turn data into knowledge and discovery.

Michelle Hertzfeld

Designing with data: prototyping at the speed of learning

We've all done it. Mocked up a perfect dashboard with evenly-spaced bars on the bar charts, each with a readable label because of course all the bars are big enough for that. All the boxes look so nice, evenly spaced. Then, in comes the data, and it's a jungle out there! The biggest value on the bar chart is huge, rendering the rest of the bars unreadable. We forgot to design for what we thought were mere edge cases (data-equals-zero versus data-equals-no-data, anyone?). Everything lining up evenly still? Ha! With some labels coming in at one word, and others hefting across at seven, forget it! ...But it doesn't have to be this way. I'm going to tell you about the user-research- and data-driven design and prototyping process that created https://useiti.doi.gov/. By creating a team prototyping process that moved from testing core ideas in words and quick data prototypes; to making more robust, visually-designed prototypes to test with core users; to incorporating those learnings back into the next prototypes and research plans, we were able to enjoy a team flow that created designs that fit the data, and data that communicated clearly. A++, would employ process again.

Danny Whalen

The evolution of a GTFS data pipeline

The General Transit Feed Specification is one of the primary formats that public transit agencies use to communicate changes to their systems. The specification is simple and flexible, leaving much to be interpreted by the publisher. In order to build a GTFS data pipeline that is similarly flexible you need to understand what makes a particular transit system unique from others, and how those differences are expressed through through the data standard. I’d like to share stories of a few fascinating transit systems and the wonderful people that work hard to share them with the world. (This is a big part of what we do at Remix.)

Dare Anne Brawley

Urban/Information: Mapping the US Postal Service Vacancy Survey

In this talk I will use one dataset, and 165 maps, as a lens to examine the interplay between information landscapes and urban landscapes. In an era when the largest technology companies have established urban think tanks which seek to develop modes of urban governance by algorithm, the talk aims to reveal the ways that sources of data often invisibly, and perniciously, shape public policy through the methods behind their collection. The dataset – the ‘U.S. Postal Service Vacancy Survey’ – is collected by US postal workers on their daily routes delivering mail and was used to help determine which cities received federal aid in the aftermath of the 2008 foreclosure crisis. At each home or business on her route, a mail carrier notes whether a house is occupied (whether the occupants have been picking up their mail). The residences left empty for 90 days or longer are represented in this dataset. The USPS Vacancy Survey powerfully reveals the logic of abstraction at work in all “data”—in this case, the “national vacancy rate” is an abstraction that is both fully dependent on and masks the daily routines of individual postal workers making observations about specific bundles of mail left to languish.

Hamdan Azhar

✊s, 🌹s, and major 🔑s: Emoji data science & journalism

Emojis have been called a new type of language. According to statistics cited by Ad Week as much as 92% of the online population uses emojis. Twitter reports that since 2014 alone over 110 billion emojis have been tweeted. Yet despite the profusion of emojis in digital life little research has been done that leverages emojis to understand popular sentiment around current issues. I’ll talk about how emoji data science (a largely unexplored field) might be a powerful new methodology for both the computational social sciences as well as fast data journalism. I'll share preliminary research based on an analysis of millions of tweets that explores the relevance of emoji analytics to fields ranging from pop culture (i.e. the Kanye West vs. Taylor Swift dispute) to politics (the US presidential elections as well as Brexit) to gender norms to the Olympics and more.

Carli Brosseau

How do I get that? Strategies for requesting data and negotiating for it

Governments of all sizes create and maintain data sets that they don’t put on the internet. But all that data is public, and it could be yours to analyze and visualize – well, most of it. We’ll talk about how to identify what public data sets exist and how to access them using state and federal public records laws. We’ll also survey common roadblocks and how to navigate around them.

Deborah Leem

Smelly London: visualising historical smells through text-mining, geo-referencing and mapping.

Smelly London project (www.londonsmells.co.uk) aims to bring together historical data with modern digitisation and visualisation to give us a unique revealing and visceral glimpse into a London of the past and what it tells us about London today. Text-mining and analysing the MOH reports tells the intimate narratives of the everyday experiences of 19th and 20th century Londoners through the 'smellscape'. The Smelly London project provides a great opportunity to demonstrate how new knowledge and insights have risen from the use of powerful digital applications. All outputs generated from the project will be open access and open source. Our data is available in a public repository GitHub (https://github.com/Smelly-London) and other platforms such as Layers of London (http://layersoflondon.blogs.sas.ac.uk/about-the-project/) and Smelly Maps: Good City Life (http://goodcitylife.org/index.html).

William Lyon

Analyzing The Trumpworld Graph: Applying Network Analysis to Public Data

A few weeks ago BuzzFeed released a public dataset of people and organizations connected to Donald Trump and members of his administration. As they say in their blog post announcing the data: No American president has taken office with a giant network of businesses, investments, and corporate connections like that amassed by Donald J. Trump. His family and advisers have touched a staggering number of ventures, from a hotel in Azerbaijan to a poker company in Las Vegas. In this talk we will show how to model this data as a graph, write Cypher queries to find interesting connections and visualize the results. In addition, we will show how to add public data from USASpending.gov on government contracts and campaign finance from the FEC, allowing us to answer questions like: * How are members of the Trump administration connected to vendors of government contracts? * Who are the most influential people in the network and how are they connected to Trump?

Matt 'Potch' Claypotch

A Better 'Edit' Button for GitHub-Hosted Data

GitHub is a fantastic way to encourage collaboration on your work and more than ever people put their data in text-based formats on GitHub. This allows anyone to suggest changes and fix mistakes but at the cost of putting new technical barriers to entry on making the changes themselves. We need a way to put a better editing UI on textual data formats and I'd like to talk about how that might work (and show a very rough demo!)

Adam Hyde

Scratching Someone Else's Itch

Why has open source done so well in producing developer tools and infrastructure, but has done so poorly in the user-facing world? How can we improve what we do to better fulfill user needs and produce user applications that will kill the proprietary competition?

Naomi Penfold

Innovation to facilitate data sharing in the life sciences and biomedicine

The drive for more open data in the life sciences and biomedical research is supported by research funders and policymakers as a means to address the ‘reproducibility crisis’, derive more value from research projects, and avoid duplication of effort in research. Journal and funder policies encouraging the deposition of datasets in repositories, where appropriate, help to enforce data-sharing practises but do not ensure that the data itself is of sufficient quality. The physics, computer science, and computational biology communities have developed tools that facilitate the publication of reproducible computational analysis, but these are geared at users with a high degree of digital literacy. For biologists who rely on basic spreadsheets and desktop statistical analysis software, the barrier to entry to these tools remains high. Further, the effort required to translate a myriad of data objects into a well-curated dataset for deposition is not directly paired to the benefits that may come from sharing such data. At eLife, we are interested in tools and technologies that will help and incentivise life science and biomedical researchers to share their data more openly. We are actively seeking and showcasing open-source tools, such as Binder, that augment the static research article with richer, dynamic and reproducible artefacts, and cater for biologists in doing so. We are also interested in processes that reduce the work required to share datasets, improve their discovery, facilitate their reuse, and pair the effort of sharing with meaningful rewards, such as credit, citation, and reputation gain. In this talk, I will discuss specific examples where data sharing has gained traction in the biology community, and outline key pain points remaining. I invite the community to join me in a conversation about how research technologists and publishers can best contribute towards open data in the life sciences and biomedicine.

Toph Allen

Mantle: A Proposal for Automated Metadata and Linked Data for Scientific Users

Scientists work with increasing volumes of data, and managing that data is a growing part of their day-to-day workload. One set of time-consuming tasks is data integration; that is, bridging gaps of incompatibility to merge datasets from different sources. Even datasets in the same format may refer to the same conceptual entity with different language. The way to solve this problem is to connect datasets via their metadata. Much progress has been made in metadata standards, practices, and formats (CSV on the Web a.k.a. WCSV, Ecological Metadata Language (EML), and JSON-Linked Data). However, much data lacks robust metadata due to the limits of manual metadata curation. We propose an automated approach to metadata discovery that aims to facilitate data discovery and linking, joining, and merging of heterogeneous datasets. We outline the challenges to this problem and our strategy for addressing them, including a set of machine-learning classifiers and heuristics to identify the correct ontology or ontologies for fields and entities in datasets (e.g., location names, species taxonomies), and link these data across data sets with different sources and formats. We discuss how this tool can fit into scientific workflows and an open data ecosystem.

Christopher Moravec

Location - Trends in Time and Place

Location is often dismissed as a way to 'put something on a map' but can we see more than just points on a map? In this talk we will explore methods of extracting location and time trends from data - and plotting that data and extracting meaning for our users beyond just location. We will cover everything from hot-spot analysis vs heat maps to plotting location trends over time and even filtering common locations. We'll also take a special look at some Google Location History to figure out what types of information we can extract from knowing where a person has been before and what we can predict from it.

Moiz Syed

Data & Abuse of Power

We need data on public institutions so that we can keep them accountable. But when that data is collected and shared by the institution itself, they are in a powerful position to control the narrative. What can we as data practitioners do to help identify potentially biased narratives that could exist in this type of data? Because blindly trusting a biased dataset is just as harmful as blindly trusting a biased source. I would like to talk a little about a vetting process, and about a few important types of biases in data that you should have in mind. This list is by no means complete and is still a work in progress, but it includes some of the more major types of bias that I’ve noticed in data, and what I think you can do about them.

Hilary Parker

Opinionated Analysis Development

Traditionally, statistical training has focused primarily on mathematical derivations and proofs of statistical tests. The process of developing the technical artifact -- that is, the paper, dashboard, or other deliverable -- is much less frequently taught, presumably because of an aversion to cookbookery or prescribing specific software choices. In this talk, I argue that it's critical to teach generalized opinions for how to go about developing an analysis in order to maximize the probability that an analysis is reproducible, accurate and collaborative. By encouraging the use of and fluency in tooling that implements these opinions, we as a community can foster the growth of processes that fail the practitioners as infrequently as possible.

Brendan O'Brien

Data Lovers in in a Dangerous Time

In recent months archive-a-thons aimed at saving government data have sprung up all around the country in response to a shift in politics. These "DataRescues" have drawn together librarians, anthropologists, developers, data scientists, scientist-scientists, all in the name of preserving a shared interest in open knowledge. Hear a first-hand account of how this group is growing to meet the challenge of classifying, monitoring, and archiving everything ".gov".

Mouse Reeve

The Art and Science of Generative Nonsense

Teaching machines to understand and communicate in human language is one of the great challenges of computer science. But what about the reverse -- what role can computers play in taking perfectly reasonable text and rendering it unreasonable? Consider Lewis Carroll's Jabberwocky, asemic writing, or invented languages. The tools for discerning semantic meaning in text can be used to dismantle it, and the corpora of written text leveraged towards generative confusion. This talk explores basic computational linguistics concepts, how to use them to make things make less sense, and the simple joys, subversive charms, and conceptual merits of nonsense.

Simon Jockers

Designing data exploration: How to make large data sets accessible (and fun to use)

When we talk about data exploration, we usually think of something expert users do: Business analysts using statistical software, Scientists wrangling research data, or journalists digging for stories. But data exploration can also be a great tool for making data accessible and interesting to a wider audience. By publishing your data sets through web applications or interactive data visualizations, you can enable users to explore it in their own terms and to understand how it relates to their personal lives. But how can we tell stories with large data sets? What are good practices for making certain forms of data tangible? In this session, we will discuss design strategies and patterns for effective, compelling, and engaging exploratory data visualizations, based on examples from data-driven journalism and academia.

Lilly Winfree

DIPpering the data: open source tools for data integration

Scientific research has a data problem: researchers spend their careers producing data that is rarely shared and that remains siloed, both physically and semantically. To address these data integration problems, the Monarch Initiative (monarchinitiative.org), a collaborative, international open science effort, is semantically integrating and curating biological data from many species and sources. Here we discuss Monarch’s Extract-Transform-Load (ETL) workflow, focusing on our open source data ingest pipeline, called DIPper (https://github.com/monarch-initiative/dipper). With DIPper, Monarch extracts data, often in CSV form, from several biomedical databases, transforms this data to a shared data model, and loads it into the Monarch graph database where it can be accessed by varied users in a Web application. Using this workflow, we are able to semantically integrate a large number of data sources into resource description framework (RDF) graphs, and we have implemented a full-featured API. Bioinformaticians and other data lovers from varied backgrounds will learn how to use DIPper to integrate diverse data and make this data more queryable and useful for researchers and clinicians.

Robin Champieux

Reproducible and reusable research: Data sharing policies & community driven strategies for improving the status quo

There is wide agreement in the biomedical research community that research data sharing is a primary ingredient for ensuring that science is more transparent, reproducible, and productive. Publishers could play an important role in facilitating and enforcing data sharing; however, many journals have not yet implemented data sharing policies and the requirements vary widely across journals. We analyzed the pervasiveness and quality of data sharing policies in the biomedical literature. We'll talk about our findings, but concentrate on community driven strategies for tracking this landscape over time and improving the status quo!

Emil Bay

Applying software engineering practises to data analysis

In this talk I'll show concrete techniques from software engineering and how the can be applied to a data analysis pipeline. I'll cover "decoupling" to gain a clear contract between analysis steps, "pure functions" for reusability and clarity in auditing, "testing" as a means to audit the data pipeline. I will use a recent R data project as vehicle for showcasing these techniques.

Noah Cowley

Surveying the Commune Cloud: Joining Hands to Decentrally Process Decentralized Data

There seems to be something of a decentralized data revolution underfoot. The story of centralized data is coming to be seen as one of exploitation and fragility: exploitative in the ways data is used to generate revenue and fragile in the way political change can cause crucial data to be pulled from public view. In an attempt to realize alternative visions of data's future, communities are working to build decentralized applications that avoid exploitation and fragility. These applications, however, largely concern the storing and sharing of data, with no general platform for decentralized applications having taken root. This talk explores the idea of a commune cloud as a general solution to the centralization of both data and its processing. What is a commune cloud? Why might we want one? What might it look like? This talk explores these questions, seeking to develop one perspective on what the answers to them might be.

David Selassie Opoku

When Data Collection Meets Non-technical CSOs in Low-Income Areas

The ability of civil society organisations (CSOs) to collect relevant data for their decision-making and actions has not been more crucial in this age of increased societal challenges and limited resources. In many low-income areas such as Africa where there are significant gaps in open data, more and more CSOs will like to leverage on tools and methodologies to collect relevant data on their target audiences. However, do these common methodologies and tools for data collection still hold when the CSO is non-technical and/or in a low-income area ? The Open Data for Development (OD4D) team at Open Knowledge International through its work with CSOs in low-income areas has learned valuable lessons on what it takes to build this much-needed capacity for non-technical CSOs in low-income areas. The talk will leverage on two cases from Nigeria and Ghana to expand on lessons and possible ways forward.

Jason Crawford

Data Modeling for Humans: Developing A Learnable Conceptual Model for Relational Data

Relational databases are designed by and for developers, and optimized for performance and scaling. To do so, they put cognitive load on the user, during both schema creation and querying. Thus the non-technical user, who can easily set up a spreadsheet, often finds a SQL database inaccessible. What would a database look like if it were designed for end users, and optimized for discoverability and usability? What if the system did the hard work, to present the user with the most natural, learnable conceptual model of relational data? What if non-technical users could create a relational model as easily as a setting up a spreadsheet? At Fieldbook, we've done hundreds of usability tests to design a conceptual model and UI for relational data that non-technical users can pick up easily. In this talk, I'll share the story of these tests, what we've learned from them, the bad ideas we discarded, and the system we've evolved as a result.

Daniel Orbach

The Journey to a better bar graph, and beyond

Thinking about well-designed data visualization often conjures images of complex charts whose beauty lies in the dense texture of data they bring to life. However, these visualizations are often lacking when it comes to being easily read and consumed by another human being. In this talk, I’ll discuss how to meet in the middle by visualizing information in a practical but also beautiful way. We’ll walk through a brief history of data visualization and look at various types of graphs and charts as well as tips and tricks for getting visualizations “past default”. Attendees will walk away with a better understanding of why graphs are the way they are, as well as tips and tricks they can use right away to bring compelling visualizations into their everyday work.

Paul Walsh

What's next in open data? The limits of the publication discourse

In 2017, discourse around open data still revolves around publication of open data. While clearly important, it is like talking about building roads without thought towards their ability to facilitate transport. We need a radical shift in discourse towards data quality in order to facilitate the ability to use open data. The talk will cover a little history to show the historical forces that got us to where we are today - lots of open data portals full of barely reusable data, and present a vision for where we need to be and how we could get there, in order for open data to truly fulfil its promise as a channel of empowerment.

Bob Gradeck

Role of data intermediaries in the civic data ecosystem

Data intermediaries are important actors in many civic data ecosystems. My talk will explore mechanisms data intermediaries have used to improve the quality of civic data, and help to make it more accessible and useful to people. Libraries and non-library intermediaries in many communities enhance data quality by encouraging publishers to adopt a stewardship ethic, along with sound governance and management practices. Intermediaries also build vital feedback mechanisms to connect publishers with users. They also support users by helping foster data and technological literacy, strengthening relationships within the ecosystem, and providing context to data users. My talk will draw on lessons learned through my work managing the Western Pennsylvania Regional Data Center at the University of Pittsburgh, a regional open data program that includes Allegheny County and the City of Pittsburgh as partners. I will also share lessons captured through participation in communities of practice for data intermediaries, notably the National Neighborhood Indicators Partnership. It is my hope that my talk will encourage other communities to explicitly include a role for data intermediaries in their civic data ecosystem.

Brian Avery

Teaching quantitative and computational skills to undergraduates using Jupyter Notebooks

As the data that we collect dramatically increases in both quantity and complexity, all college graduates will need more quantitative and computational skills to be productive and successful members of society. I will present my experience using the open source Jupyter Notebook system as an undergraduate instructor and mentor. My colleagues and I have developed inquiry based, active learning materials in Jupyter Notebooks to teach coding and quantitative skills to undergraduate students at several different levels. Our materials address basic calculation and graphing skills, reproducible research, and include a semester long scientific computing course using Python. Jupyter Notebooks have several advantages for teaching and learning over traditional coding in the shell or an IDE. The system is relatively easy to install, combines text, code, and output all in one place that is easily exportable. It also makes it easy to guide students through solving problems with code and to see students’ thought process as they work through everything from simple exercises to complex data analysis projects. We see Jupyter Notebooks as an easily accessible tool to get students at various levels engaged in doing data science.

Eric Buth

Open Data Networks with Fieldkit

Fieldkit is an open data platform for field researchers. The project began as part of series of expeditions to the Okavango river delta in Botswana where a team deployed an array of data collection and communication devices, allowing a global audience to follow travel, sightings, and sensor readings in real time using a map-based website and a public API. Building on that hardware and software we’re releasing a shared version of the same system with the idea of making it straight-forward and inexpensive to participate in open science and conservation on any scale. This presentation will focus on our strategies for flexible data ingestion and output and address how anyone can get involved. https://fieldkit.org/

Athan Reines

Math, Numeric Computing, and JavaScript

JavaScript and number crunching may seem an odd pair, but this is rapidly changing. In this talk, I will discuss the current state-of-the-art for numeric computation in JavaScript and highlight emerging technologies and libraries for neural networks and multidimensional data structures. I will discuss what to look for in numeric computing libraries, common implementation mistakes, and how to avoid portability issues. I will explain why JavaScript is poised to become the next big thing for data science and numeric computing. And to conclude, I will outline future steps and identify opportunities for community development of next-generation tools.

Philipp Burckhardt

How to mislead the public

Philipp Burckhardt will discuss ways in which analyses and visualizations of raw data can be misleading. Drawing from various real-life examples such as alleged gender bias in graduate admissions at UC Berkeley in the 1970s or the claim that small high schools have better student performance which led the Bill and Melinda Gates Foundation to invest hundreds of millions of dollars into the creation of such schools. He will shed light on the common pitfalls one will encounter when working with data. He will discuss ceiling effects as well as the impact of missing data and outliers on statistical conclusions. Finally he will discuss various remedies of how to address these issues - including data transformations, statistical modeling and most importantly how to ask the right questions.

Eric Busboom

Metatab: Metadata for Mortals

Metatab (http://metatab.org/) is a method of storing strucured metadata in a tabular form, making it much easier to create data packages with good metadata, helps non technical users to read the metadata, and allows creating data package in Excel and Google Spreadsheets. I'll present the system, how it is used in data packaging, the effort to integrate it with Open Knowledge's data packages, and the larger project to improve how public data is created, packaged and distributed, using examples from the California Department of Health and Human Services.

Kate Rabinowitz

Our Cities, Our Data

Cities are collecting more and more data, and occasionally sharing it with the public. Open data provides a powerful lens to better understand our cities. I'll discuss the current urban open data landscape, what data analysis and visualization can tell us about our cities, and the challenges to working with this data.

Adrià Mercader

Continuous Data Validation for Everybody

Automated testing of code as part of Continuous Integration has become commonplace on modern software development, and a lot of its concepts and benefits can be applied to data publishing as well. Continuous Data Validation as a service will ensure that issues are flagged early on and data quality is maintained throughout the publishing process. We are building an open extensible platform that integrates with different data sources and automatically validates data on updates, providing detailed reports and notifications. A platform for data hackers, researchers and organizations to seamlessly put data quality at the core of their work.

Max Harlow

The power of fuzzy

Like our reality, our data is often messy. Finding meaningful connections between such datasets often means using fuzzy matching algorithms. We will take a high-level look at some of the most commonly used of these algorithms, their pros and cons, and how they can be used in practice. I'll also touch on how this approach has been used by journalists to find stories that otherwise might perhaps have gone untold.

Nicole Vasilevsky and Daniel Keith

A match made in heaven: domain expert meets csv; gives birth to ontologies

Ontologies logically structure information about a domain or knowledge base and are used in a wide variety of applications, from the recommender service on Amazon to rare disease diagnostics. Good ontology development requires deep knowledge from domain experts, but the tools for building ontologies can be cumbersome and unintuitive. In many such contexts, domain experts revert to use of the plethora of csv-based tools. Fundamentally, the problem then becomes how to best utilize tools to enable domain-specific concept visualization, table-based editing, and to output semantically consistent computable artifacts for use in software applications and data analytics. Here we explain our approach to viewing and editing ontologies within a lightweight spreadsheet-style web application (https://incatools.github.io/table-editor). This presentation will be given by a domain expert and the tool developer as a demonstration of the team science required to realize the dream of good ontologies.

Roger Fischer

In search for the ideal csv template to map elections

Currently nobody has a correct election map below the state level for last year's presidential election. Unfortunately this also makes it easy to throw around wild claims of voter fraud. We at Datamap started on the journey to solve this problem. There are 3142 counties in the US, so there is a lot of work to do. Let's get started!

Raj Singh

Mapping Data in Jupyter Notebooks with Pixiedust

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. The Jupyter stack is built from the ground up to be extensible and hackable. The Developer Advocacy team at IBM Watson has developed an open source library of useful time-saving and anxiety reducing tools we call "Pixiedust". It was designed to ease the pain of charting, **importing CSVs into Spark**, saving data to the cloud, and exposing Python data structures to Scala code. I'll talk about how I built mapping into Pixiedust, putting data from Spark-based analytics on maps. Attendees will learn how to programmatically extend Jupyter Notebooks, the Mapbox API, and combining Python with JavaScript using the Jinja2 Python template engine. You will also learn how to use Pixiedust to automate some of the drudgery of data exploration, and get a gentle introduction to joining the Pixiedust developer community. References: Pixiedust: https://github.com/ibm-cds-labs/pixiedust/ Jupyter: http://jupyter.org/ Jinja2: http://jinja.pocoo.org/

Jasper Heeffer

DDF: Gapminder's data model for collaborative harmonization of multidimensional statistics.

At Gapminder we spread a fact-based worldview everyone can understand. Our fact base is largely statistics. For us to compare and visualize statistics from many sources, we need to harmonize the data, both syntactically and semantically. Our solution is an automized toolset around a data model called DDF. The model aims to be both human and machine readable, contains a query language, is format-agnostic (though our main format is csv) and it is flexible enough for anyone to fit their data in. The toolset contains solutions for Data QA, a declarative scripting language to transform and merge datasets, a data reader for our visualization framework Vizabi, a database server and more. We'd love to show it to you and get your feedback to make it even better! It's all open source!

Yasmin AlNoamany

Using Web Archives to Enrich the Live Web Experience Through Storytelling

Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exist to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand. Meanwhile, as the sheer volume of data grows on the Web, "storytelling" is becoming a popular technique in social media for selecting Web resources to support a particular narrative or "story". I’ll explain a proposed framework that integrates "storytelling" social media and Web archives. In the framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify.

Gabriela Rodriguez Beron

Open Contracting Data in Mexico City

How does the local government in Mexico City is releasing open contracts data? How much involved is the civil society and community in the process? What are the challenges in opening data in Mexico? This talk will narrate the story of implementing the open contracting data standard in Secretaria de Finanzas in Mexico City. This was the beginning of the implementation in Mexico and one of the first initiatives in opening public contracts in a city level.

Hunter Owens

Towards a Taxonomy of Government Data

Working with data for a local government means that you spend a non-trivial amount of time trying to find data. This talk introduces a taxonomy of gov data for US governments that helps us find data that we need to provide services for residents of the 2nd largest city in the United States. In the talk, I hope to explain how to think about government data and access it. Additionally, we will talk about what meta-portals and other resources are available for collecting information about your community.

Dan Phiffer

The world is weird and wonderful!

It starts from a simple premise: flat text files are the most stable format, and GeoJSON seems like a reasonable way to encode geographic descriptions of places. Let’s say ALL of the places in the world—continents, time zones, countries, cities, venues, voting constituencies, a large rock on a hill that got painted to look like the poop emoji that one time—every place we can find online in an open license or create ourselves as Creative Commons-zero. The project is called Who's On First, and my job is to develop the web-based editor for it, called Boundary Issues. This talk covers a summary of the Who's On First project generally, and some of the challenges and quirks of working with ~26 million text files managed in git repositories (hosted at github.com/whosonfirst-data).

Miriam Posner

Data in the Humanities Classroom

Data specialists might be surprised to learn that students are increasingly working with data in college humanities classes. In UCLA's Digital Humanities classes, we assign our students a wide range of data sets, which they explore, refine, research, visualize, and map. (See, for example: http://nyphilcollection.com/.) The goal is to balance data analysis with open-ended humanities inquiry, and a number of other DH programs are also picking up this pedagogical strategy. We're all constantly searching for awesome datasets for our students. I'd like to talk about what our students do with data, and I'd also like to explain how people can build datasets that are "student-ready" for the digital humanities classroom.

Melissa Santos

Machine Learning is for Everyone

Machine Learning isn't a cure-all that will fix all your problems, but it is a tool you can try fairly quickly and easily! And you probably can't do much worse than that guy a Amazon who thinks you need to buy more garbage cans after buying one. Scikit-learn is the Python library for machine learning. It is well-tested, well-supported, and well-documented - complete with thorough examples. Using a simple data file and a Jupyter notebook, we will follow an example program and inspect the data in each step. After this talk, you'll be comfortable exploring, transforming, and modeling simple datasets in Python.