Entropic Timer: The Information Entropy of Crossing the Street

You know those countdown timers at crosswalks? Sometimes when crossing the street, I like to try to guess what number it’s on even when I can’t see the whole thing (like when approaching the intersection at an oblique angle).

Crosswalk signal countdown timer

This got me (over)thinking: if I want to know how much time is left, is it better to see the right side of the countdown timer (approaching from the left), or the left side (approaching from the right)? In other words, does the left or right side of the display carry more information?

These timers use seven-segment displays. Even if you didn’t know they were called seven-segment displays, you see them all over the place. They use seven separate segments, labeled A–G, to create each of the 10 digits from 0–9.


To form each of the ten digits, the seven segments are turned on (1) or off (0) in different combinations. Here are the standard representations of 0–9.


The seven segments aren’t on all turned on an equal number of times over the course of the ten digits. That means seeing some segments turned on is more probable than others.

On for how many digits?
Segment A8/10
Segment B8/10
Segment C9/10
Segment D7/10
Segment E4/10
Segment F6/10
Segment G7/10

So how can we tell which of these seven segments communicates the most information?

Information entropy

The segments that are on or off for close to half the digits contain more information than those that are either on or off for most digits. This is intuitive for the same reason a fair coin toss contains more information than tossing a coin with heads on both sides: you’re less certain what you’re going to get, so learn more by observing the value.

Claude Shannon’s1 concept of entropy from information theory is a good way to quantify this problem. Entropy, \(H\), is defined as

Oh no.

Here’s what’s that means in the case of a seven-segment display. \(X\) is a random variable representing whether a segment is on or off. Since a segment can only have two states, the random variable \(X\)’s actual values are either on or off. \(P\) is the probability operator, so \(P(x_i)\) really means the probability that a segment is on or off. (\(b\) is the base of the logarithm. We’re going to use 2 because we like bits.)

Let’s take segment A as an example. It’s on for 8 out of 10 digits, and off for 2 out of 10. That means the probability of seeing it on is 0.8, and the probability of seeing it off is 0.2. In other words (well, symbols), \(P(x_{\mathrm{on}}) = 0.8\) and \(P(x_{\mathrm{off}}) = 0.2\).

Plugging that in,

In Shannon’s terms, there are 0.722 bits of information communicated by segment A of a seven-segment display.

Doing this for all seven segments, we get these entropy values:

Shannon entropy
Segment A0.721928
Segment B0.721928
Segment C0.468996
Segment D0.881291
Segment E0.970951
Segment F0.970951
Segment G0.881291

It sure looks like segments E and F carry the most information. That makes sense because they’re the closest to being on/off 50% of the time. Guess it’s better to approach an intersection from the right in order to see the left-hand segments.

But wait.

When approaching an intersection, you can see both right segments (B and C), or both left segments (E and F). A pair of segments from a single display are anything but independent because they’re both showing part of the same digit, so we can’t just add up their entropies.

Instead, treat each pair as if it holds a single value. Taken together, two segments can take on any of four values (off–off, off–on, on–off, on–on), which is binary for 0–3.

Segments B & CBinaryDecimal
On – On 113
On – On 113
On – Off102
On – On 113
On – On 113
Off – On011
Off – On011
On – On 113
On – On 113
On – On 113
Segments E & FBinaryDecimal
On – On 113
Off – Off 000
On – Off102
Off – Off 000
Off – On 011
Off – On011
On – On113
Off – Off 000
On – On 113
Off – On 011

In this case, our random variable \(X\) can take on four possible values rather than just two. Taking segments E and F as an example, the joint value is 0 for 3/10 digits, 1 for 3/10 digits, 2 for 1/10 digits, and 3 for 3/10 digits. Going back to the initial definition of entropy, we get

So we get 1.16 bits of information in joint segments B–C, and 1.90 bits in joint segments E–F. So there you have it: it’s still better to approach an intersection from the right.

But wait!

When was the last time you walked up to an intersection and only saw the timer on one number? If you look for at least half a second (on average), you’ll see it tick down.


Luckily, Wikipedia says that

For a first-order Markov source (one in which the probability of selecting a character is dependent only on the immediately preceding character), the entropy rate is:

where \(i\) is a state (certain preceding characters) and \(p_{i}(j)\) is the probability of \(j\) given \(i\) as the previous character.

But actually, I don’t like this notation, so I’m going to rewrite it as

Alright, then. The probability of seeing a given state is the same as before. As for the conditional probabilities, let’s go back to the 0–3 binary values and assume 0 loops back to 92. If we see segments B and C in a 1 state (off–on), the next tick it will be in a 1 state half the time, and a 3 state half the time. Going through the rest of the states and transitions, we get these transition probabilities:

State transition probabilities

So for segments E and F, when \(i = 0\) and \(j = 2\), \(P(x_i) = \frac{3}{10}\) as with before, and \(P(x_j|x_i) = \frac{1}{3}\) because, as those circles show, a 0 transitions to a 2 a third of the time.

Now it’s just a matter of an inelegant nested for loop to determine that the first-order entropy rate of segments B–C is 1.00 bits, and 1.03 bits for segments E–F.

So, if you can manage to stare at either the left or right segments for a whole second, you’re still better off looking at the left segments, but not by much.

I’ll leave figuring out the entropy rates for looking at it longer as an exercise for the reader, because I’m done overthinking this (for now).

The 7-segment display CSS is on CodePen.

  1. Shannon and I both got undergrad degrees in EE from the University of Michigan, but he went on to create information theory, and I went on to write this stupid blog post.

  2. This makes sense for the 1s place for segments B–C, but not for E–F.

Is yontef early this year (dot com)

If two points (or posts) make a trend, interactive data visualizations of the Hebrew calendar are a thing I blog about now. In the long and storied tradition of single-use novelty sites, I’ve created isyontefearlythisyear.com (and its evil twin, isyonteflatethisyear.com). Now you can point to real data when the conversation inevitably comes up before every holiday.

screenshot showing an early Chanukah 2018

When is Passover (or Easter)?

A few weeks ago, @iamreddave tweeted a plot of Easter dates since 1600. I thought it was a very cool looking pattern, with very clear cyclicality.

Being Jewish, I immediately thought of the calendrical connection between Easter and Passover. Specifically, since Easter is usually around Passover, does the 19-year cycles of Hebrew leap years play a role in when Easter falls?

Very briefly (and approximately), a solar year is aligned with the seasons (because a year is one orbit of the earth around the sun), but the Hebrew calendar is based on a lunar calendar in which a month is determined by one cycle through the phases of the moon. The solar year is approximately 365 days, while 12 lunar months are approximately 354 days, or 11 days shorter. If the Hebrew calendar were a pure lunar calendar, over time the months would drift around the year. To make up for this shortfall, a 30-day leap month is added to the Hebrew calendar every two to three years, seven times in a 19-year cycle (years 3, 6, 8, 11, 14, 17, and 19). (30 days × 7 years ≈ 11 days × 19 years. Hey, I said this explanation is approximate.)

To see the effect of Hebrew leap years on Easter dates, I recreated iamreddave’s graph, but with larger points for leap years and points colored by position in the 19-year cycle.

Interact with these graphs at https://projects.noahliebman.net/pesach-easter/

Easter dates

What jumps out to me is that all of the late Easter dates are Hebrew leap years, which is what you’d expect when an additional month has recently been inserted, but all of the early Easter dates are also Hebrew leap years.

Passover, on the other hand, always occurs late in a leap year, as you’d expect:

Passover dates

Toggling between the two, it looks like it’s years with the latest Passovers that get leap-year–early Easters.

Animating between Easter and Passover

Zoom in a bit and you’ll find that the early Easter dates are always years 8, 11, and 19 of the 19-year cycle:

Easter, zoomed in on about 20 years

I thought maybe this happens because the Christian 19-year cycle is shifted by three years from the Jewish cycle (2014 was the first year of the Christian cycle, while 2017/5777 is the first year of the Jewish cycle), but this isn’t the case. Here’s what seems to be happening:

Easter is (by definition) the first Sunday after the full moon after the vernal (in the northern hemisphere) equinox. Typically, that’s the full moon of Nissan (the Hebrew month which contains Passover), but in those three years the leap month pushes Passover so late that it’s a full month later than the equinox. In other words, in those years the new moon that marks the start of Nissan is at least ~14 days after the equinox, which puts a full moon very shortly after the equinox, which is still in Adar II (the month before Nissan).

Shout out to the Hebcal team for their amazing tools!

Interact with these graphs at https://projects.noahliebman.net/pesach-easter/

On fear, nationalism, and oppression in Shmot

With parshat Shmot coinciding with the inauguration (err, Put-in) of Donald Trump, this image from Yossi Fendel has been making the rounds on social media. It quotes the eighth verse of the parsha (and book):

וַיָּ֥קָם מֶֽלֶךְ־חָדָ֖שׁ עַל־מִצְרָ֑יִם אֲשֶׁ֥ר לֹֽא־יָדַ֖ע [אֶת־יוֹסֵֽף]׃

A new king arose over Egypt who did not know [Joseph].

A new king arose who did not know. Image by Yossi Fendel

It’s an ominous image, and makes an important point, but it’s the next couple of sentences that have really stuck me for the last several years:

וַיֹּ֖אמֶר אֶל־עַמּ֑וֹ הִנֵּ֗ה עַ֚ם בְּנֵ֣י יִשְׂרָאֵ֔ל רַ֥ב וְעָצ֖וּם מִמֶּֽנּוּ׃

And he said to his people, “Look, the Israelite people are much too numerous for us.

הָ֥בָה נִֽתְחַכְּמָ֖ה ל֑וֹ פֶּן־יִרְבֶּ֗ה וְהָיָ֞ה כִּֽי־תִקְרֶ֤אנָה מִלְחָמָה֙ וְנוֹסַ֤ף גַּם־הוּא֙ עַל־שֹׂ֣נְאֵ֔ינוּ וְנִלְחַם־בָּ֖נוּ וְעָלָ֥ה מִן־הָאָֽרֶץ׃

Let us deal shrewdly with them, so that they may not increase; otherwise in the event of war they may join our enemies in fighting against us and rise from the ground.”

The liturgy talks a lot about the exodus from Egypt, but focuses far less on why the Israelites became enslaved in the first place. The answer, this parsha makes clear, is fear. Fear of shifting demographics. Fear of an ethnic group that looked different, spoke differently, and had different practices and customs — yet served an important economic function by doing the job no Egyptian was willing to do.

Faced with that fear from shifting demographics, the Pharaoh had at least a couple of courses of action. He could have pushed an agenda of multiculturalism, encouraging the Egyptians and Israelites to get to know one another, thereby mitigating their fear. Instead, he felt that it was more important to maintain what he considered the fundamentally Egyptian character of Egypt.

The United States — at least in theory — was founded not as “a place for a people”, but as a place for all people. Sadly, there are people who believe that America was a white country (back when it was great or something 🙄), and they are now feeling the same fear and oppressive urges the biblical Pharaoh felt.

This is precisely the danger that comes along with ethnic, racial, or religious nationalism. A nation founded as “a place for a people” cannot simultaneously offer full and equal rights/privileges to all, and continue to exist should that people become a minority. And the only ways to maintain the “desired” demographics are exclusion and oppression. Whether it’s in the context of Trump-emboldened white nationalism here in America, or Zionism, its moral equivalent, let’s learn from this week’s well-timed parsha: national ideals that depend on maintaining certain demographics are inherently oppressive.

In a place like America, although changing demographics can bring up a natural fear of the stranger, it also provides us with an opportunity to not be like Pharaoh and to strive for a multicultural ideal. The Torah reminds readers that, because the Israelites were strangers in Egypt, not only is one forbidden to oppress the stranger [1, 2], but it explains how: by loving that stranger [3]. But loving the stranger is abstract. Perhaps it’s better to take a cue from the JPS translation and befriend the stranger. Friends are way less scary than strangers.

Some thoughts on the Electoral College

I just read Madison’s Federalist Paper #10. Very interesting stuff. At a high level, the purpose of electors is to mitigate the effect of “factions”, which he defines (all emphasis in block quotes is mine):

By a faction, I understand a number of citizens, whether amounting to a majority or a minority of the whole, who are united and actuated by some common impulse of passion, or of interest, adverse to the rights of other citizens, or to the permanent and aggregate interests of the community.

A p’shat interpretation though a contemporary lens would seem to be a strong argument in favor of the electors being unfaithful and voting against Trump. After all, he explicitly threatened the rights of several groups of citizens, and his authoritarian tendencies pose a threat to the “aggregate interests of the community.”

Indeed, it is common to say that the purpose of the Electoral College is to protect the public good from the irresponsible or uneducated will of the people, and that’s also true:

The effect of [a Republic], on the one hand, to refine and enlarge the public views, by passing them through the medium of a chosen body of citizens, whose wisdom may best discern the true interest of their country, and whose patriotism and love of justice will be least likely to sacrifice it to temporary or partial considerations. Under such a regulation, it may well happen, that the public voice, pronounced by the representatives of the People, will be more consonant to the public good, than if pronounced by the People themselves, convened for the purpose.

However, Madison’s actual concern, it seems, is that non–land-owning voters1 would overwhelm the landed class. He even explicitly calls out “an equal division of property” as exactly the type of “wicked project” a representative republic can protect against.

But the most common and durable source of factions has been the various and unequal distribution of property. Those who hold, and those who are without property, have ever formed distinct interests in society. Those who are creditors, and those who are debtors, fall under a like discrimination. A landed interest, a manufacturing interest, a mercantile interest, a moneyed interest, with many lesser interests, grow up of necessity in civilized nations, and divide them into different classes, actuated by different sentiments and views. The regulation of these various and interfering interests forms the principal task of modern Legislation, and involves the spirit of party and faction in the necessary and ordinary operations of the Government.

It’s also impossible to ignore the effects of media and technology. We are hardly a united country, but the divisions depend on sociological environment (racial, religious, and ethnic diversity, wealth, rural–urban, etc.), not proximity.

The influence of factious leaders may kindle a flame within their particular States, but will be unable to spread a general conflagration through the other States: A religious sect may degenerate into a political faction in a part of the Confederacy; but the variety of sects dispersed over the entire face of it, must secure the National Councils against any danger from that source.

A conflagration can now easily spread across the continent.

Here in 2016 we have a situation where the Electoral College is about to vote for a candidate who is “adverse to the rights of other citizens, or to the permanent and aggregate interests of the community” when they are supposed to be the ones “whose wisdom may best discern the true interest of their country.” Therefore, it is easy to argue that they should vote counter to the will of the voters in their states. On the other hand, had Bernie Sanders been elected (if only!), someone reading the very same document could argue that the citizens whose rights are being infringed upon are the wealthy 1% whose property would be at risk of “[more] equal division”.

My take is that the threats to the Republic in the face of a Trump presidency are sufficient enough, and the adverse effects on the rights of citizens substantial enough, that the electors should vote for Hillary Clinton. The argument that a more left-leaning economic policy would infringe on the right of the 1% to hold their wealth breaks down because the effect would not be sufficiently “adverse”, and a better-functioning, more equitable economy is in “the true interest of their country.”

I’m sure there are other historical arguing for and against the Electoral College, but based on this one, I believe the electors should elect Hillary Clinton.

  1. Franchise was being slowly extended to non–land-owning white men in various states at the time. Wikipedia

Vote your conscience

What’s happening at the Republican National Convention doesn’t feel real, but it’s real. The self-aggrandizing nominee for president claimed, “I alone can fix it.” Later, chants of “Yes you will, yes you will.” This is not about policies; it’s fear and cult of personality.

Fascism is the following (copied and pasted from here):

  • Glorification of the past (before the debasement of the nation); past seen as glorious, source of inspiration for the present.
  • Exaltation of force, strength, violence: slogans, symbols, costumes, insignias, military. Promotes discipline, sacrifice, blind obedience to the leader.
  • A reaction (defines itself through reaction to something else): against those that have debased the nation, those that disunite it, that cannot defend it against its enemies.
  • In fascism, the enemies of the nation are old corrupt politicians, foreigners, especially Jews, communism (promoted by Jews).

And let’s not forget the calls of “America First”, which is a reference to the political party of Nazi sympathizer Charles Lindburgh.

By all means, vote your conscience. I just hope that your conscience tells you that, above all, this man and his party must be defeated.

Quantified cantillation III: sequences

First post
Second post

Earlier this year I published a couple of blog posts with some descriptive statistics of trop in the Torah. One of the biggest shortcomings of those posts was that they didn’t deal with the order of trop at all. This is a pretty big shortcoming when you consider that many trop come in pairs/groups, or that certain trop frequently or necessarily follow certain other trop. So, this time around I created an interactive tool I’m calling (for lack of creativity) the Trop Sequence Explorer. If you haven’t checked it out yet, I’d suggest playing around with it a bit; it’ll give you context for the rest of this post.

Basically, it shows each trop listed in order from most to least common. When you click one, it shows you all trop that can follow it and how often each one occurs in that sequence. In other words, it shows transition probabilities to each trop conditional on all trop that come before it in a sequence. There’s also a graph at the bottom that shows how often the selected sequence occurs in each perek of the Torah. Clicking a bar in the graph shows the text of the p’sukim in that perek that contain the current sequence.

Trop Explorer screenshot

What follows is a bit of the thought process that went into its creation, some issues I ran into, and some interesting observations. Feel free to jump to the section that’s most interesting to you.

The Jewish Nerd section

Back in the fall, I was gabbaiing and noticed two tevirs in a row. “How often does that happen?”, I wondered. Seven times, it turns out. It’s pretty well known that a zarka has to be followed by a segol or a munakh segol, but it turns out that the latter is actually more common (by a 13-point margin).

Beyond the factoids, there are other fun things to come across. Parallel sentence structures often have parallel trop, even when the trop itself is not that common. In B’midbar 26, gadol is used at a much higher rate than normal, mostly on names in a genealogy; it really pops out in the bar graph.

Gadol in B'midbar 26

One of the most surprising things for me, though, is how relatively unique each pasuk is. Once you get more than three or four levels deep in the tree, there are surprisingly few p’sukim that match that sequence. This is even true for seemingly common sequences. A pasuk that is merkha tipkha etnakhta merkha tipkha sof pasuk only happens 43 times in the entire Torah.

As I was creating the Sequence Explorer, I encountered some challenges and needed to make some decisions about how it used trop data. One question several people have raised is: Why are there ever trop following a sof pasuk? Shouldn’t a sof pasuk, by definition, be the end of a pasuk? The answer is that there are two sets of trop used for the 10 Commandments, the takhtonim, which are used for private study, and the elyonim, which are used for public readings. I chose to use the elyonim because I wanted to examine how trop are read out loud. The problem is that the two sets of trop also have different pasuk divisions. Even though I used the elyon trop, I had to use the takhton pasuk divisions, because the takhton divisions seem to be more standard, and are the ones returned by the Sefaria API, which is what I used to pull the in actual pasuk text when you click on a perek’s bar in the bar graph. Perhaps at some point I’ll add a setting so people can explore both versions.

Many authorities consider munakh legarmeh a separate trop. I decided not to count it separately for two reasons. The simple technical reason is that there is not a different Unicode character for it (distinct from munakh), so I would have to detect it based on context. The other is that, by definition, the munakh legarmeh is a munakh that precedes another munakh. Since that’s exactly the type of data this app shows, it felt both redundant and somewhat circular to distinguish a trop by what follows it. If you click the munakh, the number of munakhs that follow it should be equal to the number of munakh legarmehs.

Seeing sequences also helped me find issues in the data that I couldn’t see otherwise. For example, I found a couple instances where the data showed four pashtas in a row, but this wasn’t really the case. Trop typically indicate where the stress should fall in a word, but some trop must be placed at either the beginning or the end of a word regardless of stress. To help readers, many sources, including — I found out — the Tanach.us data source I used, put such trop on a word twice: once in the required position, and once where the stress falls. I cleaned out those doublings by searching for any word with two trop on it, and if the two trop were the same, I deleted one of them. Hopefully there was no collateral damage from that.

Another oddity was that there were ten tsinnorits and one geresh mukdam. This was odd because those trop aren’t used in the Torah, even if their lookalikes, zarka and geresh are. It seems like they were used for typesetting reasons — their placement on a word is slightly different — so I just lumped them in with their respective lookalikes.

There were also a number of p’sukim with no sof pasuk. I’m not sure exactly why, but I fixed them. Being able to see the bar graph across the bottom was hugely helpful in seeing that this was an issue.

Speaking of the bar graph at the bottom, aggregating by perek is somewhat arbitrary. At some point I would like to try aggregating in other ways, such as by parshah.

The Design Nerd section

I knew pretty early on that I wanted to do some sort of Markov chain–like visualization of transition probabilities, but I set the idea aside to do real work, which, fortunately, happened to involve learning D3. When I turned my attention back to this, I realized two things:

  1. Pairwise transition probabilities aren’t that interesting in isolation; sequences are much more interesting. (In other words, you need memory in your Markov chain.)

  2. As in the previous posts, we have the complete dataset. Descriptively exploring that is very different from wanting to make predictions or generate new sequences, which is a more typical use of Markov chains.

So, I settled on the basics of a design, but without a few key features. The original idea was a tree, where each level would show the conditional probability of going to a particular trop given all those that had come before it. The plan was just to show simple squares with a trop symbol, its name, conditional probability, and conditional count. And, there was no bar graph at the bottom to show where a given sequence occurred.

Original whiteboard sketch (or what's left of it)

It wasn’t until I was sketching out the visual design for the squares — well after I had it actually working — that I came up with the idea of shading them in, making them into a histogram of sorts. Since they seem to follow something not entirely unlike a Poisson distribution, I thought about log-weighting them, but decided it would be more straightforward not to since I’m also showing raw counts.

Once I could play with building sequence trees, I pretty quickly wanted to know where in the Torah those sequences were. And so, the bar graph at the bottom was born. For most of the time I was building it, clicking a bar would just open that perek on Sefaria. Using the Sefaria API to pull in the text of the actual p’sukim was one of the last features to go in.

The Programming Nerd section

When I first started thinking about how to implement this, my intuition was to have the data structure match the tree structure of the interface. It felt elegant, and it seemed like a good idea at the time. I wrote a recursive function (after fighting with mutable container objects in Python) to go through the trop strings and build a giant JSON file shaped like this:

  "name": "munakh",
  "count": 5456,
  "children": [
     "name": "revii",
     "count": 1410,
     "children": […]
     "name": "katan",
     "count": 4350,
     "children": […]

Well, that turned out to be 8.6 MB — way too big to download as part of a web app. A similar file that listed which prakim had which sequences was over two gigabytes. I wrote most of the UI (locally) with these two files. Thankfully, I finally realized that I could just download a 760 kB list of raw trop strings and search for sequences on demand in the browser. And that, folks, is why I’m in HCI, not real computer science. Derp.

Finally, D3 was great to work with. Being able to define a simple linear scale like this

var x = d3.scale.linear()
    .domain([0, width])
    .range([width, 0]);

even made it easy to work right-to-left when SVG objects have their origins in the upper left-hand corner.

Future work

I’m a grad student, so how can I resist a Future Work section? There are a number of features I’d like to add at some point. As I hinted at earlier in this post, it would be nice to be able to aggregate the bar graph by parshah instead of just perek. Combining other aggregations, like sefer, with the ability to limit sequence queries to certain parts of the text would open the door to adding the rest of the Tanakh. (The Emet books would be outta control!) And color coding disjunctive and conjunctive trop would be a nice way to see more structure in sequences. If you want to take a stab at any of these things, have a look at the issues list for this project on GitHub.

And if you’ve made it this far without actually using the app, go play with it now!

Unlocking an iPhone: Do You Have to Restore?

When you unlock an iPhone, the official instructions say something a little odd:

If you have a SIM card from a carrier other than your current carrier, follow these steps:

  1. Remove your SIM card and insert the new SIM card.
  2. Complete the setup process.

If you don’t have another SIM card you can use, follow these steps to complete the process:

  1. Back up your iPhone.
  2. When you have a backup, erase your iPhone.
  3. Restore your iPhone from the backup you just made.

Wait, what? Why would I want to unlock the phone if I didn’t have a SIM from another carrier? And isn’t doing a full restore kind of a lot to ask?

As far as I can tell, here’s what’s going on: when you request an unlock from your original carrier, they don’t unlock your phone, they tell Apple’s activation server that your phone is now unlocked. In order to finish the unlock process, your phone has to check in with Apple’s activation server.

There are apparently only two ways to force an iPhone to re-activate with the server: put in a new SIM, or restore the phone. But why would you go the restore route? Because if you’re traveling abroad, when you arrive at your destination and install your newly acquired SIM, it’ll try to contact the activation server. But it can’t reach the activation server because you don’t have data service on your new carrier yet. At this point you’ll be dropped into Activate mode and won’t be able to do anything with your phone until you activate it. If you happen to be somewhere with wifi that doesn’t require any sort of web-based authentication (so, not most airports or hotels) you may be able to activate that way. Otherwise, you’ll have to use iTunes on your computer — assuming your computer can get wifi.

If you won’t be traveling with your computer, or may not have access to a non-cellular internet connection, you’ll want to do that restore at home before your trip. Otherwise, skip the restore and activate through iTunes or wifi.

Quantified cantillation II: word counts

First post
Third post: Trop Sequence Explorer

A lot of discussion around my last post was about the role of sentence structure. For example, there’s a heuristic that psukim with fewer than five words don’t have an etnakhta, while those with more than five words do. This visualization lets you explore these types of relationships.

By word count visualization static

We see that, indeed, etnakhtas do approximately follow this pattern, while other trops’ counts naturally vary more linearly with word count (e.g., mapakh and pashta). Other trop, though, like tipkha, quickly hit a ceiling regardless of how long a pasuk gets.

Note that I’ve cut off the x axis at 33 words. While there are much longer psukim, there aren’t enough of them to get meaningful averages.

Wordcount distribution Wordcount distribution

Click in the legend to turn a trop on and off; double-click to solo it. As with the first post, there’s nothing revolutionary here, but I think it’s still interesting to see and explore. (Also, I'm no expert on D3/NVD3, so don’t judge me too harshly. And if you’re on IE and it doesn’t work, tough luck.)

Quantified cantillation

Second post: Word counts
Third post: Sequence Explorer

When read publicly, the Torah is often sung using a system of cantillation marks, or trop in Yiddish. There are many different cantillation marks, each of which has a name, a unique sound (or sounds), and comes in combination with other trop.

When the cycle of readings started over this year after Simchas Torah1, it seemed like there were more telisha gedolahs in Bereshit (Genesis), whereas there were more telisha ketanas in D’varim (Deuteronomy). I decided to find out whether or not this was really the case.

First, I needed a dataset. Tanach.us offers the entire Tanakh in XML form, including trop and vowels. I was only interested in the Torah, so I downloaded XML files for each of the five sfarim (books). I went through the XML and tabulated how many of each trop were present in each pasuk (sentence).

Aggregating by sefer to consider my original question about the relative frequencies of telisha gedolahs and telisha ketanas, we see that my intuition was somewhat correct: while there are more ketanas throughout, there are more overall ketanas in D’varim.

Telisha Ketana and Telisha Gedola by Sefer Telisha Ketana and Telisha Gedola by Sefer

However, the ratio of telisha gedola to telisha ketana is actually not substantially different in D’varim and Bereshit. So while overall counts are higher, the relative frequencies are not so different.

Ration of Telisha Ketana to Telisha Gedola by sefer Ration of Telisha Ketana to Telisha Gedola by sefer

Aggregating by sefer is interesting, but I wanted to see more continuous variations. Looking at a series of what for most trop would be zeros and ones, with an occasional two or three, isn’t that useful, but Zach (a Ph.D. student in Statistics) suggested a moving average, and that worked quite nicely. We used a 500-pasuk-wide window, which struck a balance between detail and low-pass filtering. (I come from a signal processing background, not time-series analysis.)

As with the initial bar graph, you can really see the number of telisha ketanas explode in D’varim. But more interestingly, we can get a sense of how they track each other through the Torah.

Telisha Ketana and Telisha Gedola through the Torah Telisha Ketana and Telisha Gedola through the Torah

Seeing how different trop track each other is fun. There are some things that you’d expect. For example, munakh is often associated with katan, revi’i, and mapakhpashta, and we see that clearly here.

Common associated trop through the Torah Common associated trop through the Torah

Particklarly striking is the tight correlation between zarka and segol.

Zarka and segol through the Torah Zarka and segol through the Torah

Although other combinations, though, like dargatevir are more loosely correlated.

Darga and tevir through the Torah Darga and tevir through the Torah

(For more correlations, here are the pasuk by pasuk and moving window correlation tables.)

While these patterns are intuitive, the fact that trop — especially common ones like merkha and tipkha — aren’t uniformly distributed across the Torah was, to me, somewhat less expected. A big reason for this is changes in sentence structure. This becomes extremely obvious when looking at etnakhta, which essentially functions as a comma.

Etnakhta through the Torah Etnakhta through the Torah

The reason for the rather dramatic plunge toward the beginning of B’midbar seems to be a shift in sentence structure. Checking the text, this part of the Torah contains quite a bit of genealogy, which contains many single-phrase sentences (“So-and-so begat so-and-so”2), and many occurrences of the common pasukוידבר ה` אל־משה לאמר”.

Oddly, I did a bit of digging into this, and it looks like a drop in words per pasuk actually lags the drop in etnakhtas. I’m not sure why.

Etnakhta and wordcount through the Torah Etnakhta and wordcount through the Torah

I could imagine running a logistic regression to see whether words per pasuk predicts the presense of an etnakhta, but I’m going to cut myself off now.

If you’re interested in playing around with this yourself, everything is on GitHub. If you just want to cut to the chase, here’s a CSV file of the raw data. And here’s an IPython Notebook.

  1. Benjamin, please forgive my transliterations.

  2. No wife required, incredibly.