Save the Data: Houstonians Join a National Effort to Archive Federal Data Sets
To the casual onlooker, it would seem everyone inside the library lecture room was simply working behind their laptops with snacks and coffee. Only the role-defining tri-folds on each table and soft conversations suggested it was actually a virtual assembly line. As one of four Data Rescue events hosted on the same day across the country, this session at Rice University drew more than 70 librarians, researchers, scientists, software engineers and students for one common purpose: backing up federal data in case it one day disappears.
Concerns about losing access to environmental data have recently been raised by President Donald Trump’s seemingly hostile attitude towards the Environmental Protection Agency and other scientific offices within the U.S. government, but it was Canadian scientists’ experiences being muzzled by former Prime Minister Stephen Harper through legislation enacted in 2006 that first spurred international action. In 2014, 800 scientists from 32 additional countries sent an open letter to Harper calling for an end to "burdensome restrictions on scientific communication and collaboration faced by Canadian government scientists."
Two years later, Canadian scientists warned their U.S. colleagues of the negative consequences that could come with an anti-science government and urged them to safeguard their data. That same month, in December 2016, the first Data Rescue event aimed at saving American data took place in Toronto; since then, various organizations have worked together to develop the standardized infrastructure and guidelines needed to build a cohesive movement.
“There was concern that data might be removed because of expressed political perspectives of the new administration,” said Lisa Spiro of Rice University’s digital scholarship services, who co-organized the event. “This is an effort to ensure that information is freely available.”
Librarians and archivists, driven by their professional values and ethics, felt particularly compelled to ensure that existing information remains available to all. Still others found strong reasons to help preserve public information whether as researchers, as data geeks, or simply as citizens.
“I am concerned about the possibility of losing many years of work. Our current government seems to think that the facts provided are not necessary, which is, to me, disrespectful to all of the work that has been done,” said Ethan, who declined to give his last name, a graduate student at the University of Houston who had previously conducted a study based entirely on federal data.
Depending on their skill sets, participants were assigned to one of eight teams, each specializing in one part of an eight-part process. Following agency lists (“primers”) provided by an international data archiving network, “seeders” explored government web pages and identified “uncrawlable” pages, often those that contain drop-down menus or links to datasets, which couldn’t be automatically picked up by existing archiving softwares. Researchers then went through those websites to identify exactly what needed to be retrieved before pitching it to the harvesters, who then wrote computer programs to extract raw data.
To ensure the integrity and usability of the data, checkers, “baggers” and “describers” triple-checked that the data uploaded was correctly labeled and easily understood. In a separate room, “surveyors” developed additional primers by looking at federal agencies and identifying key webpages for future events. Throughout the process, “storytellers” kept track of what was being done and brainstormed ways to broadcast their results to the general public.
Neeraj Tandon and Jeff Reichman, two local community organizers active in data visualization, believe that with a little bit of help, the harvested data could prove itself important even at a local level. Water data, for instance, illuminates flooding problems that Houstonians are all too familiar with.
“There’s reliance between all these different government entities to paint the full picture, to tell the whole story,” Reichman said. “And so we can use the EPA data, and we can use state data, and we can use local data sets that the City of Houston has prepared to get a better sense of rising tides, or run-offs, or things that we are investigating today.”
After a full day of work, volunteers were rewarded by the announcement that at least 70 links had been thoroughly researched and partly harvested.
“I think it’s a really great way to bring people together that don’t normally do so. I’ve been mostly communicating within our own research group, since we can ask questions and support each other because we are all learning together, really,” said Eréndira Quintana Morales, a researcher in archaeology.
Though concrete plans for a second citywide event are still being discussed, data rescuers will be able to continue their efforts at home, joining a much larger network of archivists and technologists across the country—people who share an understanding and appreciation for data and its increasing importance in an era of “alternative facts.”
“This winter—December, January, February—was the hottest on record,” said Kathy Hart Weimer of the Kelley Center for Government Information, Data and Geospatial Sciences, another sponsor of the event. “Because we have a hundred plus years of weather data, we can actually say that that’s a fact.”