Statistics – Killer Tofu

This semester I am taking a course called Probability and Statistics for Engineers and Scientists which is actually pretty cool. I’ve yet to take a class in stats so this stuff is mostly new to me. It’s a powerful form of Math and probably one of the most relatable and useful in “real life.” This post isn’t about stats as a class though it’s about stats as a project.

We were assigned a project in which we had to come up with a way to find the average date of all the books published in our library. You couldn’t brute force it either by checking every book and recording the copyright date. You had to come up with a method to take a random sample of data (how many to sample was also part of the project) and then using that, do calculations on the data. Here was my plan:

The library has all its books cataloged online.

All books (in the catalog) have ISBNs.

Randomly generate ISBN numbers and have it auto-search, parse, and record book dates.

Viola! Tons of randomly selected years from books.

In theory this is great! It is impartial and once setup can be repeated as many times as needed with minimal effort on my part. Before I get into why it’s the worst plan, let me show you how I accomplished randomly generating ISBNs because it’s not straightforward at all.

Let me preface this by saying I now know way more about ISBNs than I ever wanted to. I will give you the abridged version as not to bore you to tears. (Although the thought of any of my writing bringing you, the audience, to tears is tempting)

Here is a breakdown of a ISBN13 number (13 being the number of digits)

9781999186135 <—Random ISBN

so the first 3 digits

978

This apparently refers to a fictitious place called “Bookland” where all books come from, no joke. It’s called the “country code.” It can also be the industry, in this case book publishing.

The next number, 1, denotes language. English is either 0 or 1.

The numbers that follow are weird because it’s not like 3 numbers are the publisher code, it varies a lot, but suffice it to say the next numbers are publisher and title numbers.

The last digit, 5, is a checksum value. It basically makes sure the number is valid, by performing a checksum calculation.

Without getting into too much detail, they are very strict values and do not just vary from 0-9 for 13 digits and this made my MATLAB script go from 10-15 lines to about 100 lines. I worked for about 3 hours researching and creating the script and it works! The problem, and I feel really stupid for having missed it, is that having that many combinations of numbers, even if they are within guidelines produces a ton of possibilities. This means that almost every value I generated, while being valid within the confines of the standard, did not produce a number that was assigned to a real book.

Using only books in English there is a whopping 199,920,000 different combinations to be made. The chances of me hitting on a real book, especially in a small university library are slim to none. Needless to say, I didn’t end up using this method. Even the best laid plans can go awry. On the plus side I have all this useless information on ISBNs now; come on Cash Cab.

*Edit*

I feel like the dumbest. I did my math wrong and there is actually more combinations than I had previously stated. It’s not 22,952,230 different combinations, it’s 199,920,000 and this has been reflected above. I should really be better at math.