In this new blog post series — “A Month In Data” — I have again curated a (slightly larger) set of interesting articles, links and resources that I have come across this month relating to data, algorithms and policy: from data science, AI and machine learning, through to ethics, society and governance. As before, alongside the main list — which is presented in no specific order or precedence — I also offer a set of short links to posts, academic papers and other relevant resources.
Part IV: November 2017
In this fourth set of posts, we have everything from fairer machine learning, website tracking and regulation, through to open data in the public sector, supporting environmental (and food) sustainability, and the big spreadsheet that runs the international postal system:
The problems that occur when health data is not used
Health data is more than just statistics or numbers; it can be collected, used and shared in a multitude of ways. But ignoring certain medical data has the potential to change the way you are treated, how your care is provided and what happens to you as a result. A study by Swansea University (see paper) has found that there are many reasons for the non-use of health data, and that it is strongly implicated in the deaths of many thousands of people and the potential waste of billions of pounds.
Examples of how open data can improve public sector performance
Tangible examples of how open data can make government more efficient; in other news, the public sector has been urged to be more assertive in its data dealings.
Data Journalism will save Open Data
Talk of transparency and efficiency has disappeared from the political rhetoric; data journalism can be the catalyst of a public sector data revival.
Project Common Voice by Mozilla
The Common Voice project is Mozilla’s initiative to help teach machines how real people speak. They are building an open and publicly available dataset of voices that everyone can use to train speech-enabled applications. Also see: Project DeepSpeech, an open source Speech-To-Text engine, using a model trained by machine learning techniques.
The Enormous Spreadsheet that Runs the World’s Mail
Find out about the Universal Postal Union and how they set the prices for shipping mail around the world; also see the master UPU spreadsheets.
How data can keep fish and chips on the menu
Have you ever questioned the environmental or economic sustainability of the flathead you order from your local fish and chips shop? Do you know where it’s from? Not all fish are caught in an ecologically sustainable way, but scientists are working with fisheries managers to address this. Data modelling is key to managing fisheries sustainably, with the latest research uncovering the secrets of Rock Flathead growth (see paper).
How Facebook Figures Out Everyone You’ve Ever Met
Spurious connections seem inexplicable if you assume Facebook only knows what you’ve told it about yourself; they’re less mysterious if you know about the other file Facebook keeps on you—one that you can’t see or control. Behind the Facebook profile you’ve built for yourself is another one — a shadow profile — built from the inboxes and smartphones of other Facebook users. Contact information you’ve never given the network gets associated with your account, making it easier for Facebook to more completely map your social connections.
Facebook Workers, Not an Algorithm, Will Look at Volunteered Nude Photos First to Stop Revenge Porn
Recent reports that a Facebook pilot program would let users volunteer nudes to an algorithm to stop revenge porn — but those nudes will be viewed by a human at the company first. The approach has many similarities with how Silicon Valley companies tackle child abuse material, but with a key difference: there is no already-established database of non-consensual pornography.
Can A.I. Be Taught to Explain Itself?
As machine learning becomes more powerful, the field’s researchers increasingly find themselves unable to account for what their algorithms know — or how they know it. This article links back to a previous post in A Month In Data on a facial recognition experiment to distinguish between gay and heterosexual people.
Robots will be biased: live with it
Interesting editorial from The Law Society Gazette on the challenges of algorithmic regulation, perhaps a new regulatory body to vet and approve decision-making algorithms. However, if the government is serious about making the UK a world leader in artificial intelligence, it would be a bizarre step to risk stifling innovation just as we leave the stifling embrace of the EU’s `precautionary principle’.
Why we need to regulate the tech platforms
The US Senate Intelligence Committee’s grilling of Facebook, Google, and Twitter told us something we already knew: Russia manipulated the US election results. It also told us something that we knew, but had forgotten: industry self-regulation rarely works. From turn-of-the-century railroads, through energy markets in the 1990s, to the financial industry circa 2007, there are many examples that bear this out. The tech industry is only the latest case in point: companies should be made to open up the black box of their algorithms.
Paper: Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data
Decisions based on algorithmic, machine learning models can be unfair, reproducing biases in historical data used to train them. While computational techniques are emerging to address aspects of these concerns through communities such as discrimination-aware data mining (DADM) and fairness, accountability and transparency machine learning (FATML), their practical implementation faces real-world challenges. Furthermore, for legal, institutional or commercial reasons, organisations might not hold the data needed to diagnose and mitigate emergent indirect discrimination-by-proxy, such as redlining. This paper by Michael Veale and Reuben Binns presents and discusses three potential approaches to deal with such knowledge and information deficits in the context of fairer machine learning. Trusted third parties could selectively store data necessary for performing discrimination discovery and incorporating fairness constraints into model-building in a privacy-preserving manner. (also see a related paper by Veale: Enslaving the Algorithm: From a ‘Right to an Explanation’ to a ‘Right to Better Decisions’?)
AutoML for large scale image classification and object detection
A few months ago, Google announced their AutoML project, an approach that automates the design of machine learning models. While it is able to design small neural networks that perform on par with neural networks designed by human experts, these results were initially constrained to small academic datasets; however, the methods has now been extended to perform on larger more challenging datasets, such as ImageNet image classification and COCO object detection.
More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked
Using natural language processing techniques (see code and data) to analyse net neutrality comments submitted to the FCC from April to October 2017, and the results were disturbing: one pro-repeal spam campaign used mail-merge to disguise 1.3 million comments as unique grassroots submissions; there were likely multiple other campaigns aimed at injecting what may total several million pro-repeal comments into the system; finally, it’s highly likely that more than 99% of the truly unique comments were in favour of keeping net neutrality.
Over 400 of the World’s Most Popular Websites Record Your Every Keystroke
The idea of websites tracking users isn’t new, but recent research from Princeton University indicates that online tracking is far more invasive than most users understand: how third-party scripts that run on many of the world’s most popular websites track your every keystroke and then send that information to third-party servers. For a technical breakdown, see: No boundaries: Exfiltration of personal data by session-replay scripts.
You might also like…
- Announcement of a new UK Geospatial Commission: Chancellor to unlock hidden value of government data (also here and here)
- Launch of the LSE Truth, Trust and Technology (T3) Commission to deal with the crisis in public information
- Paper: Gerrymandering and Computational Redistricting by Olivia Guest, Frank J. Kanayet and Bradley C. Love (also see: redistrict.science)
- Thirty countries use ‘armies of opinion shapers’ to manipulate democracy
- Where to find data: a(n incomplete) list of places to find/access publicly available data (HT @storywithdata)