Today’s guest blogger is Sean Boisen, senior information architect at Logos.

Logos Bible Software iscontinually undertaking new projects to expand our tools for Bible study. Many of these involve wading through data, usually lots and lots of data.

For example, the Biblical People feature (described in this previous post) provides Bible references, family relationships, social roles, and other information for every person mentioned in the Bible, some 3000 different individuals in all.

I’m currently working to enrich this data set much further to include place names, other named entities (like ethnic groups and languages), and an even richer set of relationships: people who knew each other or collaborated together, places they lived or visited, their beliefs, and many other kinds of information.

But too many projects chasing too little time means you have to prioritize. This raises an interesting question: how to prioritize development for our people data so we spend the most effort on the names that will matter most to those studying the Bible?

Since I’m inherently a data-driven, quantitative type of guy, my practical answer is to:

  • assign a numeric weight to each name
  • start at the top and work my way down the list in order
  • stop when when the available resources, enthusiasm, or both are exhausted

Since we’ve got the data that connects people to the passages that refer to them, a good starting place is simply to go through and count how many times each person is mentioned in the Scriptures. There’s an important technical detail here:I really do mean references to people, not just names (as strings). To see why this matters, consider:

  • the same person can be known by several different names (Peter, Simon, Simeon and Cephas are all names used in the New Testament for Jesus’ disciple)
  • the same name can be used for several different people, or even different kinds of things

As an example of this second point, it’s not enough to find the string “Judah” in a verse: you want to know when it’s Judah the person, as opposed to a cover term for Israel or the Southern Kingdom. For hard cases like Judah, the only way to know is to go through verse by verse by hand and decide. (This investment of effort is one thing that makes Logos’ Biblical People data such a uniquely valuable resource.)

For many other cases, while the name is only used to refer to people, there are numerous individuals with the same name. Zechariah is the toughest case here: there are 30 distinct ones in our database. So just counting occurrences of the string “Zechariah” doesn’t get it right: you need to know whether it’s the prophet Zechariah (from the Old Testament book of the same name), the father of John the Baptist, or one of the 28 others (most of which are only mentioned oncein the entire Bible). So some pretty detailed data is required to do a reasonable job with this computation.

There are many different ways you could count and compute weights on a per-person basis. Here’s one (there are other reasonable possibilities too):

  • Let frequencybe a count of the number of verses that mention a given individual (only counting one for verses like Luke 22:31, “Simon, Simon, Satan has desired to sift you like wheat”, which shouldn’t really count as two observations of Simon’s significance as a Biblical character).
  • Let book dispersionbe the number of books of the Bible that mention the individual. The intuition here is that, for two individuals with the same frequency, the one that’s mentioned in more books is probably more important, broadly speaking.
  • Let chapter dispersionsimilarly be the number of chapters in which a mention occurs. This helps distinguish people mentioned frequently but within a relatively shorter range of verses.
  • Normalize these values by their maximums (frequency=1370, book mentions=31, chapter mentions=258) just to scale things more nicely
  • Assign a weight to each of these three factors (I used 0.6 for frequency, 0.2 for book dispersion, and 0.2 for chapter dispersion: clearly this choice affects the outcome).
  • Multiply each factor by its weight, and add the results to get a number between 1 and 0.

Here’s a graph that shows this metric for the top 50 people, along with the individual factors. (The image is linked to a larger version where the names can be read.)

While the top names (Jesus, David, Moses, Jacob, Abraham) are no surprise, there are some interesting observations farther down.

First, the composite metric really does change the rankings: Levi is #15 by this method, but #52 if you only ranked by frequency. Likewise, King Saul would be #51 if you only ranked by book mentions, because he’s mentioned in just a few books: but he’s clearly one of the most important characters in those books, and so it seems fitting that incorporating frequency and chapter dispersion boosts him up to #10 in the composite metric rank.

Graphically, the places where the lines approach each other are the cases where the various factors are more equal, and places where they’re farthest apart (Judah’s a good example) where they’re most skewed. Back to the previous point about counting genuine person name instances versus strings: only 99 of the approximately 780 occurrences of “Judah” actually refer to Jacob and Leah’s son, so counting strings would be highly misleading here.

Since names, like many linguistic phenomena, typically follow a Zipfian Distribution(sometimes called a “long tail” or power law distribution), it’s no surprise that the majority (1634 of the 2987) of these names occur exactly once in the Bible, and the 59 most frequent names account for about half of all the name mentions in the Bible. So clearly these top names deserve much more attention than the long tail. Important disclaimer:I’m not making any claims here about theological or historical importance. That’s a subjective matter, and you’d get different answers depending on your perspective.

One advantage of making ideas explicit and quantifiable is that you can compare their predictions against your intuitions and see how they compare. Some other factors that might improve the estimate even further (and remember, this is just an estimate):

  • Though we value the whole of Scripture, there’s a sense in which certain sections are broader in their implications. For example, anyone mentioned in the first chapters of Genesis should probably get an extra measure of importance: these are the foundational stories of Hebrew and Christian history.
  • We’re only counting proper names here: other descriptions and pronouns would help refine these measurements even further (we don’t have this data yet, however)
  • External sources (like Bible dictionaries) are a rich and quantifiable source of judgments about importance: the more words or sentences used to describe an individual, the more important they’re likely to be. By consulting several dictionaries, you can overcome the biases of an individual work or editorial slant. The key feature here is making the connection between the described individual (often in a numbered paragraph) and the Biblical character: we don’t have that data yet, but it’s in our plans for the future, and an approximation with
    a bit of programming ought to be possible at better than 90% accuracy.


