Friday, February 22, 2019

Don’t Trust Algorithms You Can’t Read

I admit that I am a sucker for “the best [what ever] in each state” lists. I want to see how many I have visited and what is “the best” in my state. Most of these are based on someone’s personal opinion but others are based on some sort of data and an algorithm.  I think most people are aware of bias in the subjective lists and take them with a grain of salt. It is tempting to look at lists based on data and computer algorithms and more accurate though. I mean, look, it is supported by data!

The most recent example of this was a list of the most boring towns in every state. I was sure the town I live in had a short at that one. There is nothing to do. One country store is the only retail operation I know of in town. There are no attractions, unless you count the beehive hut or maybe our historic meeting house. We didn’t make the cut though.

The town that did make the cut for New Hampshire was Bartlett. Now Bartlett has an outstanding ski resort, Attitash, an amusement park, Storyland, and a lot of good places to eat. There is a lot to do in Bartlett. It is anything but a boring place.

The problem, of course, is the data selected for the algorithm. They based their decision in large part on average age and population density. My observation is that a lot of resort areas, which are almost by definition not boring, have low full-time population densities and older permanent residents. Now they also look at “things to do” but it is far from clear where they get that data.

Choosing the right data is part and parcel to getting a good result for any sort of algorithm. This is why transparency of data and algorithms is important in evaluating any conclusion.  This is an important concept for our students to understand. To me it is as much a computing topic as it is a societal topic. As we get more and more data and more and more people try to make sense of the data we have to understand that just because there is data in to an algorithm and a result coming out that is not enough.

I haven’t heard it said much recently but we used to say it a lot.

Garbage in, garbage out!


Mike Zamansky said...

We see black boxing on the algorithm level and bad (or wrong) data on the input levle all them time and this is nothing new.

I can't tell you how many people I've run into over my lives who have taken a "stat course" for their major where they would run a canned SPSS script on some data set and make conclusions where they really didn't understand the underlying stats nor if the data was the right data to be using in the first place.

It's only getting worse with all the machine learning and data mining.

In our own field, reformers have been calling public schools and public school teachers failing for years based on test results but they're starting with the wrong data, measuring it the wrong way and of course coming to the wrong conclusions. Actually they knew the conclusions they wanted all along and just needed to come up with data and algorithms to prove their original hypothesis correct.

Steven Jong said...

Interesting post, Alfred! I think that in addition to GIGO, there's more to be explored in your title.

Can you trust an algorithm if you don't know anything about it? Neural networking is giving us computers that make decisions based on algorithms that aren't programmed in but "learned," algorithms that even the creators of the systems don't know or understand.

In particular, the creators of Google AlphaZero, a chess-playing program, programmed in just the rules of the game, and set it playing games against itself. In four hours of CPU time it became apparently the strongest computer chess player available (with the possible exception of IBM's Deep Blue). Literally no one can say why the program makes the moves it does; chess players are going through its match games against commercial chess programs like Stickfish (which it absolutely crushed in a 100-game match) trying to understand how it evaluates positions and selects moves.

Chess is just a game, but this technology is going to be making decisions in business soon. Future mortgage decisions will be made by the same class of computer. On what basis will it approve or deny loans, or make medical decisions? We will have no idea.