Precise Altruism

Mar 21, 2015

Precise Altruism is a service that reads a number of news feeds of effective altruism organizations and general news aggregators, classifies news articles according to their relevance to altruism and effective altruism, and posts matching articles to Tumblr, Twitter, and Facebook under the name of Altrunews.

Summary

Precise Altruism is a service that reads a number of news feeds of effective altruism organizations and general news aggregators and classifies the news articles according to their relevance to altruism and effective altruism. Articles that fall into this category are then linked and summarized on Tumblr and posted to Twitter and Facebook under the name of Altrunews. (A post is by no means to be understood as an endorsement.)

You can follow Altrunews on Tumblr, Twitter, and Facebook.

I’ve replaced this service with a Resyndicator instance.

Introduction

Precise Altruism is a university project by Lea Helmers and me, which we worked on throughout a data science course by Dr. Kashif Rasul at the Freie Universität Berlin.

The service reads feeds from the following sources and classifies them based on a hand-annotated corpus of a few hundred news articles.

The Against Malaria Foundation
GiveWell (two feeds)
GiveDirectly
Giving What We Can
The Live You Can Save
Charity Science
80,000 Hours
David Roodman’s blog
Julia Wise’s blog (Giving Gladly)
Ben Kuhn’s blog
Brian Tomasik’s blog (Reducing Suffering)
My own blog (claviger.net)
The Effective Altruism Forum
Animal Charity Evaluators
The Abdul Latif Jameel Poverty Action Lab (three feeds)
Center for Global Development
Sentience Politics
The Global Priorities Project
Gates Notes
Evidence Action
Your Siblings
The World Health Organization
Raising for Effective Giving
Good Ventures
Innovations for Poverty Action
Vegan Outreach
The Future of Humanity Institute
Animal Equality
The Google News feed of English-language news articles containing certain keywords
The Kuerzr feed of English-language news articles containing a similar set of keywords

Unfortunately I couldn’t find the feeds of the Schistosomiasis Control Initiative, the Copenhagen Consensus Center, and Mercy For Animals. I’m open for further source feed suggestions, preferably Atom, not RSS.

By the way, Peter Hurford runs an unfiltered feed exclusively over EA blogs, and I wrote a thing once, the Resyndicator, that could be used for something like that (especially in scenarios where it doesn’t already exist).

The Classifier

The heart of our application is a classification pipeline built with scikit-learn, which uses tf-idf to generate a feature matrix of our news data and then a Stochastic Gradient Descent classifier to assign them one of our two categories.

We used grid search and cross-validation to determine the optimal classifier and an optimal set of parameters for it. Using only a small set of plausible parameters and only three splits for the cross-validation, we quickly determined the four out of initially ten classification algorithms that performed best on our data, Stochastic Gradient Descent, Logistic Regression, and two variations of the Support Vector Machines classifier. In our final, most finely tuned run, Stochastic Gradient Descent achieved an F1 score of 93%, about two percentage points more than the best of the other three classifiers.

The clearest takeaways from the grid search over a plausible SGD parameter set were that as loss functions log, hinge, modified_huber, and perceptron performed well; that as penalty l2 and elasticnet performed well; that activating the shuffling helped; that using bigrams in addition to unigrams was useful but that 3-grams did not improve the F1 score; and that the best values for alpha and n_iter varied widely among the best configurations.

It’s been almost a year since I implemented this, so please don’t quiz me on the details.

The Daemon

The daemon is the service that continuously runs on the server and continually checks the source feeds. It sends if-modified-since and if-none-matches headers whenever possible to minimize server load and traffic. Then the feed entries are compared to those in the database to filter out known ones, whereby we also compute the Jaccard distance between the preprocessed titles to avoid posting the same press releases over and over.

The articles that are typically associated with these entries are then fetched, stripped of boilerplate using Readability, summarized using Sumy, and finally posted to Tumblr. We extended the extraction step with one that also extracts a featured image and added a naive keyword extraction for the post tags on Tumblr.

Impartial Priorities

Discussion about this post