Tuesday, October 22, 2019

How Many Words in That Text?

One of the projects I have used for years is a letter counter program. The idea is to count the occurrences of each individual letter. It’s a nice project that includes arrays, loops, and some string manipulation. This is the sort of thing that does have some real world utility. Cryptography uses word counts to try to crack substitution cyphers. Linguists use it to study languages. And that is just two of what comes to mind.

The next logical (to me anyway) step is to count words. I’ve been thinking about adding this in for a while. It is actually something I was assigned as a project many years ago when I was an undergraduate. It’s not as simple as counting letters. The most obvious method involved counting spaces. What happens if someone is old school and places two spaces after every period? Well, that is something to take into consideration. And what about other white space like tabs or line feeds? Or special characters?

Doug Peterson related in a recent post (About words) that two different programs gave him two different word counts for the same piece of text. The counts were off by 3 on a text of about 486 words. Not a huge percentage but on a book length text that could make a difference. Some articles in magazines are paid by the word. That means getting the count right means money.

Now people can count words with greater accuracy though I don’t want to do it myself. At some point someone is going to feed a lot of data into some artificial intelligence. Long sections of text that have accurate (human counted perhaps) word counts will be fed in and the AI will learn what words are and how to count them. It’s not going to happen until someone decides that developing this is worth the time and money. I wonder if it will be an academic or an industry researcher?

For the time being I think this will make an interesting conversation in class. Maybe we’ll have a contest to see  who can come up with the most accurate algorithm?

4 comments:

Anonymous said...

FWIW Word, Google Docs, and Open Live Writer all gave the same word count for this post. Maybe I made it too easy?

Doug said...

Thanks for the cheap plug to my post, Alfred!

I used similar problems in my CS classes. As you note, there is a considerable jump in difficulty in moving from letters to words. As an aside, you'll note that I'm old school by your metrics but that's a discussion for another day.

As there always seems to be, there are those students who wanted to go above and beyond. They all agreed that there would be words that would have double letters but there shouldn't be words with three or more letters and so flagged those as invalid data. More Computer Sciencey than "spelling mistake". It all came when one of the students decided to try three spaces between words just "to see what the algorithm would do". Don't you love thinking about that.

Since Canada is bilingual, they also wanted to test/modify it for French words with diacritical characters.

It was one of those assignments that had more mileage than what I had initially expected.

Doug said...

I've been thinking about this all day. Going back to the original premise of doing the letter counting, I remember some students also taking their collected data and drawing a frequency graph. They had to predict the result; it was a great deal of fun.

Garth said...

It is not like I do not have enough to do. Now I have to think about this.