Hebrew Regular Expression Searching: Two Real-World Examples
When studying the Hebrew Bible, it is not uncommon to want to find some specific pattern of vowel points, or to find a given word only where it is spelled in a particular way. Logos Bible Software supports very powerful search syntax for performing pattern matching called Regular Expressions. This training article will walk through two Hebrew Regular Expression searches that I’ve been asked to compose by real students of the Hebrew Bible. For both these searches, I am going to use the BHS with Westminster 4.2 Morphology as my text, though the same principles will work in any tagged Hebrew Bible.
Find every instance of the lexeme אֱלוֹהַּ (eloah – God) that is spelled ‘defectively’ (with just a holem, not a holem-waw).
Logos Bible Software has a lot of ways to search for every instance of a word. One of the easiest methods is to right click on one instance of the word and choose the ‘selected text’ option followed by the ‘(lemma)’ form and then choose ‘Speed Search This Resource’. This will look at the lexical form tag in the morphological database and find every word that is tagged to the same lexical form no matter how it is spelled or inflected in the Bible text. If the Bible you are using supported homograph indicators to distinguish between words that are spelled the same but have different meanings, those will be used in the Speed Search.
However, in this test case we don’t want to find every instance of our word – we only want to find instances of the word that are spelled a certain way. To start out, we’ll duplicate the simple search for every instance of the word.
‘lemma:’ is a field search operator. Many bibles support named fields that can help restrict searches to higher quality hits. To know what fields a book supports, open the book and click ‘Help | About This Resource’. ‘lemma’ here refers to the field tag around the lexical form of the word in the morphological database. A search on the ‘lemma’ field will only find hits on lexical form tags, not on inflected or surface form text. Field searches are performed by typing the name of the field followed by a colon followed by the term you want to search.
The marks() operator specifies that we want to match vowel points on Hebrew and Aramaic words. (It also means ‘match accent marks’ in Greek searches.) By default, Logos Bible Software doesn’t require users to type vocalized Hebrew, but the use of the marks() operator allows you to specify exact matching on vowel points for more precise searching when you need it.
The Hebrew word itself must match the spelling in the morphological database we’re searching against. You can switch to your Hebrew keyboard using the F2 key if you want, or you can use the right-click option and choose ‘selected text’ followed by the ‘(lemma)’ form then choose ‘copy’ to copy the lexical form to your clipboard. Then type Control+v to paste the word into your query. This copy-paste method ensures that the spelling of the word in the query matches the spelling in the database because you copied it from the database.
Here is one tip that will make your Hebrew queries much easier to write: type all the non-Hebrew characters first, then fill in the Hebrew. Hebrew is a right to left language – the opposite of English (and Regular Expressions for that matter – more on that later). Since punctuation like periods and parenthesis and question marks can be used in Hebrew as well as English, it can be hard for computers to interpret these symbols when placed adjacent to Hebrew text. Microsoft does a rather nice job of knowing when a period should be treated as left to right or right to left punctuation in prose, but in something as esoteric as search syntax, the marks will often jumble around when you start typing Hebrew. The secret is, if you type in all the non-Hebrew first, you can ignore the display of the non-Hebrew marks when you type in the Hebrew. Even if they look wrong when you are done, the search will work, because the order of the marks in the search string doesn’t necessarily reflect what Microsoft’s rendering code is showing on the screen. If you get in the habit of typing non-Hebrew first, you won’t have any problems.
This means that first we type:
and then we paste or type the Hebrew word inside the parenthesis:
In the Westminster morphological database, our Hebrew word here has no homographs, but at this point if you needed to restrict hits to one homograph you could follow the instructions in the article on Homograph Indicators to limit search hits to one member of a homograph set.
Now we want to restrict our hits to inflected forms of the word that are ‘defective’ in spelling – that have just a holem vowel not a holem-waw. To do this we’re going to use the ANDEQUALS or @ operator to add another term to our search that is indexed at the same location as the lexical form the first half of our search finds. We want this second half of the search to match vowel points as well, but we’re not looking in the lemma field now – we’re looking at the actual text of the Bible, so there will be no field tag. So the skeleton of our search will be:
lemma:marks(אֱלוֹהַּ) ANDEQUALS marks()
lemma:marks(אֱלוֹהַּ) @ marks()
The @ sign is just shorthand for ANDEQUALS. We’re going to put our Regular Expression inside the second marks() operator. Use two forward slashes to delimit a regular expression, like so:
The expression is placed between the two slashes. A period (.) in a regular expression will match any character. An asterisk (*) means the previous element of the Regular Expression can occur zero or more times (as opposed to a plus sign (+) which means an element can occur 1 or more times, or a question mark (?) which means a character can appear zero or one times). So if we want to allow any prefix or suffix (including no prefix or suffix) on our Hebrew word, we can place a period and an asterisk at the beginning and end of the query to mean ‘match any character zero or more times’, like so:
To make it easier to follow along, I’ve placed an X where the rest of the query will go. The X has no special meaning in a finished Regular Expression – it’s just the letter X.
From here, remembering that it is best to type the non-Hebrew marks first, and that the Regular Expression is going to read the query from left to right, let’s sketch out the rest of the query. Eloah starts with an aleph, and usually has a vowel point under it. But some prefixes that could go before the aleph might ‘steal’ its vowel point. So we need to have an aleph that may or may not be followed by a vowel point. I’ll put a Y in the place where we’re going to put our aleph below:
Note the use of the .? to allow a mark (vowel) to follow the aleph, but not to require a mark (since a question mark means to match the previous element zero or one time). Technically, a period could match a consonant as well as a vowel, but since we know that we’re going to put a consonant (lamedh) first in the X section, and you’re never going to have aleph-consonant-lamedh with no vowels intervening, we can safely assume that a single period with a question mark will only match a vowel, if one is present. You’ll learn how to match only vowels more explicitly in the next example.
The query is now ready to add the Hebrew. We’re going to type an aleph where the Y is, and the string ‘lamedh-holem-he’ where the X is. There is no fancy Regular Expression syntax being used in the X section, and we can type Hebrew in the logical order – we don’t have to type it backwards. Even though the Y is being typed to the left of the X because the query is read from left to right, the software will read a string of Hebrew letters in the right order.
Because Microsoft’s rendering code will try to interpret the .? as Hebrew or right-to left marks, the display of the query will shift to:
This makes it look like the aleph was typed to the right of the rest of the query. Don’t worry about this – the query will work just fine, because we typed the non-Hebrew structure of the query in first. So our completed query looks like so:
lemma:marks(אֱלוֹהַּ) @ marks(/.*א.?לֹה.*/)
Executing the search against the Westminster database yields hits in four verses. Excellent! Let’s turn to query number 2.
Find every word that has the vowel pattern of a half-vowel (not including vocal shewa) followed by a patach or a qamets followed by a patach or a qamets.
Most students of the Hebrew Bible are satisfied with searching for words based on stem labels found in the morphological databases, such as the Qal stem or the Hiphil stem. But sometimes students and scholars interested in morphology or orthography want to find words of a very particular pattern of vowels. We can use all the Regular Expression tricks learned above to find these patterns. We just need one new trick.
Regular Expressions use square brackets to create character classes. So if you want to match any vowel in English, you’d have a Regular Expression like:
This would match any single instance of a lower-case vowel. If you wanted to match one or more vowels, you can use the syntax trick we’ve already learned, the plus sign:
We’ll start the search similar to the previous example, using the marks() operator to match vowel points, and the // to delimit a new Regular Expression:
Again, we’ll allow for any prefix or suffix in our Regular Expression:
This time we’re going to use the period (which matches any character) in place of our three consonants. Then we’re going to use square brackets to create character classes for all the vowel points that are allowed in each slot after the consonants.
Now all we have to do is replace the X, Y and Z with the vowels we want to allow in each position. The X will be replaced with all the half vowels: hateph-seghol, hateph-patach, and hateph-qamets. The Y and Z will both be replaced with qamets and patach. The resulting query looks like:
(Inside Logos Bible Software, the order of the character classes is not reversed when only vowel points are used in a query, but your web browser may display slightly differently. But again, typing all non-Hebrew characters first solves all the problems associated with these rendering difficulties.)
All the features of Logos Bible Software Regular Expressions can be found in the help files under Searching | Advanced Searching, but I hope these real world examples that combine Regular Expressions with field searches and the marks() and ANDEQUALS/@ operators will make it easier for Hebrew students to take advantage of these powerful features.
Advanced alternative to typing Hebrew
One can avoid the complexities of having to mix the right-to-left Hebrew script with left to write Regular expression syntax by typing the Unicode values of the characters in question. Unicode is the international standard for multi-lingual encoding. In Logos regular expressions, one can type \u followed by the 4 hexidecimal digits that define a Hebrew character instead of switching keyboards to type in Hebrew. So, for example, the query above could be rendered as:
It is also easier to use ranges in Hebrew regular expressions using this syntax. For example, matching a half vowel in the expression above could be shortened to [\u05b1-\u05b3] .
The Code chart for Unicode Hebrew can be found here.