Entropy of Random Talks: 2015

Wednesday, December 16, 2015

Auf der Jagd nach den verlorenen Bits

In einer Zeit, in der Information eine so große Rolle spielt, sollten wir genau wissen was mit der Information in den technischen Systemen, die wir täglich verwenden, passiert. Wie viel davon geht auf ihrem Weg von der Quelle zum Empfänger verloren? Und können wir unsere Systeme so bauen, dass dieser Informationsverlust minimiert wird?

Das Smartphone-App Ihrer Bank ist sehr praktisch wenn Ihnen auf dem Weg durch die Herrengasse in Graz in einem Schaufenster ein schöner Pulli auffällt: Ein Blick auf den Kontostand genügt und Sie wissen, ob Sie sich diese sicherlich unnötige Ausgabe leisten können. Um die Sache zu vereinfachen, stellt das App negative Kontostände mit einer roten Zahl, positive mit einer schwarzen dar. Aufgrund eines merkwürdigen (und zugegebenermaßen unrealistischen) Displayfehlers sehen Sie aber nur eine blaue Zahl. Das Display hat einen wichtigen Teil der Kontoinformation zerstört.

Aber was ist Information eigentlich? Im Wesentlichen ist Information unser Wissen bzw. Unwissen über etwas Zufälliges, je nachdem ob wir die Information besitzen oder nicht. Messen lässt sich Information z.B. über die durchschnittliche Anzahl von Ja/Nein-Fragen, die gestellt werden müssen, um unser Unwissen zu beseitigen: Nehmen wir einen zufälligen Münzwurf als Beispiel, so reicht bereits eine Frage, um uns über das Ergebnis zu erkundigen: Kopf oder Zahl? Bei zwei Münzwürfen benötigen wir zwei Fragen, um beide Ergebnisse zu erfahren, bei drei Würfen drei Fragen, und so weiter. In seiner bahnbrechenden Arbeit "A Mathematical Theory of Communication" präsentierte ClaudeE. Shannon eine mathematische Formel, um die Information eines Münzwurfes – seine Entropie – zu berechnen und definierte das Bit als Maßeinheit.

Die Information eines Münzwurfes ist genau ein Bit: man benötigt exakt eine Ja/Nein-Frage, um den Ausgang des Wurfes zu bestimmen. Anders sieht es bei einer gezinkten Münze aus die in neun von zehn Fällen Kopf zeigt. Unser (Vor-)Wissen ist größer als im Fall einer fairen Münze und "im Durchschnitt" benötigen wir weniger Fragen, um den Ausgang des Münzwurfes zu bestimmen. Konkret kann man sich das folgendermaßen veranschaulichen: Werfen wir eine faire Münze zweimal, müssten wir immer zwei Fragen stellen, um die Ergebnisse beider Würfe zu bestimmen. Werfen wir die gezinkte Münze zweimal ist in vielen Fällen eine einzige, schlau gestellte Frage ausreichend: "Zeigten beide Würfe Kopf?" Wird diese Frage bejaht (und das wird sie in 81% aller Fälle), muss eine zweite Frage mehr nicht gestellt werden. Mit Hilfe von Shannons Formel kann man zeigen, dass die Information dieses gezinkten Münzwurfes bei ca. 0.46 Bit liegt: Es reichen "durchschnittlich" etwas weniger als eine halbe Frage, um unser Unwissen über einen Münzwurf zu beseitigen, etwas weniger als eine Frage für zwei Münzwürfe und etwas weniger als eineinhalb Fragen für drei Würfe.

Wie viel Bit der Kontoinformation wurden aber durch Ihren Displayfehler zerstört? Mit dieser Frage beschäftigte ich mich im Zuge meiner Dissertation an der TU Graz, in der ich den Informationsverlust in technischen Systemen untersuchte. Da sich mit einer einzigen Frage feststellen lässt ob der Kontostand positiv oder negativ ist, kann der Informationsverlust Ihres Displays höchstens ein Bit betragen. Dass der genaue Wert im Wesentlichen von der Zufälligkeit Ihres Kontostandes abhängt ist weniger offensichtlich, aber in Hinblick auf die gezinkte Münze leicht verständlich. Wenn Sie eine sehr sparsame Person sind und immer einen kleinen Puffer auf Ihrem Konto wissen, fehlt Ihnen nicht viel Information: Sie können sehr sicher sein dass die Zahl eigentlich schwarz sein sollte und Sie sich den Pulli leisten können. Wenn Sie eine sehr verschwenderische Person sind, deren Kontostand höchstens ein paar Tage im Monat positiv ist, fehlt Ihnen auch nicht viel Information: Die Zahl ist mit hoher Wahrscheinlichkeit rot. (Den Pulli würden Sie sich in diesem Fall wahrscheinlich trotzdem kaufen.) Bewegen Sie sich allerdings zwischen diesen beiden Extremen, kann Ihr Display einen beträchtlichen Teil der Information zerstört haben – im schlimmsten Fall ein Bit.

Ein Bit ist doch nicht viel, sagen Sie? Das kommt ganz auf das Bit an! Wenn Sie wissen, dass Ihr Kontostand zwischen -5000 und 5000 € liegt, müssen Sie nach Shannons Formel rund 20 Fragen stellen, um den genauen Betrag bis auf den Cent zu erfahren. Ihr Display beantwortet 19 dieser 20 Fragen für Sie – nur leider die wichtigste nicht: Ist der Kontostand positiv oder negativ? So interessant es also ist den Informationsverlust eines Systems zu untersuchen, praktische Bedeutung bekommt diese Theorie erst unter Miteinbeziehung des Relevanzbegriffs: Welcher Anteil der verlorenen Information ist für uns relevant, und welcher irrelevant bzw. störend? Ich erweiterte die Theorie in meiner Dissertation in dieser Hinsicht und verwendete die Resultate, um zu tun, was einem Ingenieur zu tun bestimmt ist: Systeme zu bauen!

Systeme werden nach gewissen Anforderungen gebaut, um gewisse Aufgaben zu erfüllen. Ein elektronisches Filter kann zum Beispiel entworfen werden, um störendes Rauschen beim Telefonieren zu unterdrücken und dabei das relevante Sprachsignal des Gesprächspartners möglichst nicht zu beeinflussen. Historisch bedingt – und schlichtweg am einfachsten – werden Filter nach Energiekriterien entworfen: Die Energie des störenden Rauschens soll nach der Filterung so klein wie möglich sein. Gleichzeitig darf die Energie der durch das Filter hervorgerufenen Störungen im Sprachsignal nicht zu groß werden, um eine angenehme Kommunikation der Gesprächspartner zu garantieren.

Ich versuchte mit meiner Arbeit einen anderen Weg einzuschlagen: In einer Zeit, in der viele unserer Systeme Information verarbeiten oder übertragen, sollte Information als Kriterium für den Systementwurf verwendet werden. Die von Shannon entwickelte Informationstheorie besagt, dass jedes System Information nur verringern aber nicht vergrößern kann. Jene Information, die wir aus dem Lautsprecher am Smartphone hören, war zuvor in der elektromagnetischen Welle in der Luft, im digitalen Signal im Smartphone unseres Gesprächspartners und in den Schallwellen zwischen dessen Mund und dem Mikrofon. Mehr als das: In jedem dieser verarbeitenden Systeme – Mikrofon, digitale Schaltkreise, Lautsprecher – ging Information verloren. Welches Entwurfskriterium könnte also besser geeignet sein als der Informationsverlust? Auf das obige Beispiel angewendet gilt es also ein Filter zu entwerfen welches so wenig Sprachinformation wie möglich zerstört und dabei die "Information" des störenden Rauschens soweit wie möglich reduziert.

Dass ein hinsichtlich Informationsverlust entworfenes Filter besser zur Informationsübertragung geeignet ist als ein nach Energiekriterien entworfenes, dürfte Sie inzwischen nicht mehr überraschen – eine andere Methode liefert eben ein anderes Ergebnis. Nichtsdestotrotz stieß ich während meiner Dissertation in der Literatur immer wieder auf Stellen, in denen Energie mit Information gleichgesetzt wurde. So wird zum Beispiel in der Statistik seit Jahrzehnten die Hauptkomponentenanalyse eingesetzt, um die Komplexität großer Datensätze zu verringern. Dabei werden die mehrdimensionalen Datensätze transformiert und Daten mit geringer Energie verworfen. Diese Vorgehensweise wird mit der Behauptung gerechtfertigt, dass die Daten mit der größten Energie auch die meiste relevante Information beinhalten. Diese Behauptung ist nicht immer richtig (und wer am lautesten schreit, hat auch nicht immer recht): Für die Hauptkomponentenanalyse konnte ich zum Beispiel zeigen, dass sie den Informationsverlust nur dann minimiert, wenn die relevante Information in einem besonderen Zusammenhang mit den Datensätzen steht, einer Tatsache, die in der Statistik nicht immer zutrifft. Es ist höchste Zeit, umzudenken.

Neben Filterentwurf, der Hauptkomponentenanalyse und der Analyse Ihres defekten Displays gibt es natürlich eine Vielzahl weiterer Anwendungen einer Theorie des Informationsverlusts: Zum Beispiel entwickelte ich gemeinsam mit anderen Forschern eine Methode, um Markoffschen Ketten zu vereinfachen, ohne dabei Information zu zerstören. Markoffsche Ketten – Folgen von zufälligen Zahlen, die in einem statistischen Zusammenhang miteinander stehen – sind wichtige mathematische Modelle und werden in der Sprachverarbeitung, als Modelle chemischer Reaktionen, in der Genetik, in der Bioinformatik und in der Warteschlangentheorie eingesetzt.

Information ist überall. Um sie nutzbar zu machen, sollten unsere technischen Systeme – Computer, Smartphones, etc. – so wenig wie möglich davon zerstören. Und selbst wenn wir es nicht schaffen sollten, diese Systeme dementsprechend zu bauen, so sollten wir zumindest wissen wie viel Information durch ungeeignet entworfene Systeme verloren geht. Die Wichtigkeit der von Shannon begründeten Informationstheorie, die ich mit meinen Resultaten zum Informationsverlust ein klein wenig ergänzen durfte, ist nicht zu unterschätzen. Es gilt heute viel mehr als je zuvor Norbert Wieners Behauptung: "Information ist Information, weder Materie noch Energie. Kein Materialismus, der dies nicht berücksichtigt, kann heute überleben."

Friday, October 2, 2015

Writing for Wikipedia: A Teaching Experiment

Writing reports and presenting results is an important part in an engineer's life, hence teaching these skills is an important (implicit or explicit) part in engineering curricula. At my former affiliation, SPSC at Graz University of Technology, we were teaching the course "Verfassen wissenschaftlicher Arbeiten" to bachelor's students in their fifth semester, preparing them for the challenging task of writing and presenting their bachelor's thesis. There, the approach was (roughly) as follows:

Students grouped in pairs or triples and chose a topic related to the scientific process (e.g., writing good introductions, writing abstracts, giving a scientific presentation, plagiarism and literature searches, etc.).
They had to give a presentation on this topic, teaching their colleagues the respective skills (flipped classroom).
They chose a simple topic from signal processing (e.g., filter design) and wrote a four-page scientific LaTeX article presenting the topic as if it was their own invention.

Let me stress that again: Students should not write a review or a summary, but a scientific paper with a "novel" contribution. Why? Because that's what they have to do when they stay in academia.

Here, at the LNT of Technische Universitaet Muenchen, the offered Gradiate Seminar Mobile Communications and Coding is a very similar course (albeit for master's students): The expected outcomes are again presentation and writing skills, together with acquiring knowledge in a particular field inside communications and coding. In the last years, the approach was as follows:

Each student chose a scientific topic and had to write a four-page scientific LaTeX article summarizing the core aspects (in the structure of a scientific paper).
Students had to present these core aspects in a 20 minutes presentation.

The difference is apparent: Students at LNT had to write and present summaries rather than writing papers claiming original contribution. Why? Probably because that's what they have to do when they will NOT stay in academia.

But both approaches have one thing in common: Guess what happens to these four-page articles the students write. Nothing. I strongly doubt that any of our students every took a look at their paper after the end of the course. That does not mean that these articles are useless. They are not only formal requirements to achieve the degree, but they are valuable stepping stones for acquiring important skills: writing scientifically, learning about communications or signal processing, etc. In other words, the student must so to speak throw away the ladder, after he has climbed up on it.

In an effort to reduce the number of ladders thrown away, my colleague last semester required the students to copy their articles into a wiki accessible only for registered members. This was an amazing idea, one that made the seminar much more sustainable than it was before, and browsing through last year's student wiki articles gave me a good idea about what the students are capable of. Nevertheless, while the ladders are not thrown away now, they are still neatly locked up in a room in the basement.

This winter term, in which I am co-responsible for the course, I'd like to go one step further: Of all the ladders the students climb in this term, together with my colleagues we will select the most useful ones and try to make them fit for others to climb: The best student articles should end up as articles on Wikipedia. The idea is not new, as there have been several studies investigating the success of this method (for example, this one; for the supplementary material you need a subscription).

I'm not sure if we will succeed in this. Honestly, I would not want to write a Wikipedia article. But of the eight topics we are going to provide, at least four will be appropriate for an article, and at least two others could extend existing articles. Just to give you an example: As of Septemer 23rd, 2015, there is no Wikipedia article on information dimension. If such an article appears in Wikipedia by the end of January 2016, then the teaching experiment will have been successful. I'll keep you posted!

Tuesday, September 8, 2015

My Story with Polar Codes

It was the end of August 2010, and the weather was unusually good in Dublin. During the breaks at the IEEE Information Theory Workshop we could easily go outside, sit in the grass, and discuss the previous talks. Well, some people discussed, some just listened. Since this was my first conference in information theory (I did not even have a paper there, my boss allowed me to go there without actively participating -- an opportunity I am extremely grateful for), I clearly belonged to the second group. But then, already on the second day, there were some talks I felt comfortable with: Erdal Arikan gave a lecture on Polar Codes, a new coding technique he invented recently. What followed were several talks related to polar coding, including talks from Emmanuel Abbe (who was my main source of inspiration for my investigation of the fractality of polar codes), Tanaka (who looked at the polar coding procedure from a stochastic dynamical system point-of-view), Ido Tal and Alexander Vardy (whose code construction technique was extremely useful in proving fractality), and Eran Hof. While I did not understand a lot from his talk, I owe Eran big time: Being a total noob, I did not know anybody at this conference, and he was so kind to introduce the "big players" to me. That was absolutely necessary, because during one of the breaks I approached Sergio Verdu and asked him about the polar code construction technique he just presented. Well, apparently I confused him with Alexander Vardy (I have no idea why). Anyways, although this was one of the best conferences I've ever been to (an excursion to Castle Malahide, Whiskey tasting at Jameson's, a Riverdance show during the conference dinner, a pleasant stay at a cute little hotel, great weather, nice people, and all this together with my back-then-future-wife), this was just the beginning of my interest in polar coding.

It was in November 2011 when I presented my first paper at an information theory-related conference, the ISWCS in Aachen (during one of the short coffee breaks I met Gerhard Kramer, my current boss, for the first time; I also met Georg Böcherer, just finishing his PhD and winning the best paper award). As usual for these conferences, the first day featured tutorials in the afternoon. One of these tutorials was on polar codes, held by Emmanuel Abbe. I still have the printed handouts; one of the few paper documents I moved from Graz to Munich. It was an unforgettable experience, partly because of the awesome presentation and Emmanuel's great didactic skills, partly because of Emmanuel's lively discussion with the audience (particularly with Tor Aulin). And this great impressions led me to think about possible research directions on the train ride home to Graz. In a document called "Resarch Ideas.txt" I wrote

If a channel (one of the partitioned channels) is already very good (close to perfect), is it possible to assume all its descendant channels to be good? [...] Can we use this to find an approximation of a "good" polar code?

And then nothing happened for quite some time, as I continued my PhD research on information loss in deterministic systems. All this time, these polar codes stayed in the back of my mind and came to the front whenever I attended a conference (there were ALWAYS talks on polar codes). But time went by fast, I finished my PhD and -- totally unexpected -- started working at the Institute for Communications Engineering in Munich, becoming a member of Gerhard's lab and Georg's colleague. Some months there, not much research done (I was busy writing project proposals), the whole lab took a skiing trip to the alps in South Tirol, disguised under the title "Joint Conference on Communications and Coding". Well, in fact we did more research than skiing, something which I am quite grateful for (there are some very embarrassing photos of me trying cross-country skiing...). Instead of presenting work already done, I presented open problems, among them some ideas about polar codes. The preparation for the talk inspired me to write a blog article, and soon after that all the questions in this article were answered. Emmanuel suggested via email to extend the analysis to Reed-Muller codes, which was completed almost equally fast. It was one of the very few lucky streaks I had in my research career so far, and still some open questions pester me.

Although I am not sure if any of the obtained results is useful, I'm extremely happy to have these things out of my mind (and in a paper on arXiv). I could also present these results at the NEWCOM# Emerging Topics Workshop in Cambridge. One of the participants of this workshop was Erdal Arikan, the father of polar codes, and the very same person we were sitting next to during the whiskey tasting event at Jameson's in Dublin five years ago.

Wednesday, July 8, 2015

Candy Crush Codes

Candy Crush Saga is a game in which... no, I don't think I have to tell you! But I bet you didn't know yet that the game is actually NP-hard! The original proof can be found on arXiv (or, again, on arXiv; these authors claim to prove a little more general result), but there are also plenty of pop-sci articles around. (Now, that is quite interesting: Walsh posted his paper on March 8th, 2014, and the press wrote articles about it only five days later.) Anyways, that's not really what I wanted to tell you.

A typical game field of Candy Crush Saga (copyright with King Inc., taken from Wikipedia)

If you look at above picture, you will see no three candies in a row or in a column having the same flavor. Why? Because they would disappear immediately (that's more-less the only game rule). But this game rule more-less immediately leads to the following question:

If and $n\times m$ field, filled with candies of $k$ different flavors that are randomly, independently and uniformly chosen, what is the probability that there is at least one horizontal or vertical triple of the same flavor?

In other words, what is the probability that, starting a new level, something happens without you actually touching the screen. The question is obviously combinatorial and interesting in itself. But there is even more to it.

If we look at $n\times m$ fields filled with candies of $k$ different flavors, there are in total $k^{nm}$ possible arrangements. The rules of Candy Crush Saga, however, allow only a subset of these, namely, $k^{nm}$ minus the answer to above question. For example, if we take just one row with three columns and two flavors, there are eight possible arrangements, of which two are excluded. Taking two rows allows 36 out of 64 arrangements, and three rows permit 102 out of 512. (At least that's what my first calculations show.) In other words, the set of all possible fields allows for a redundancy: If we want to transmit messages by sending pictures of Candy Crush fields, a $3\times 3$ field allows you transmit only up to 102 different messages, although there are 512 different possible arrangements. If now, during the transmission, an error occurs -- the flavor of a candy changes -- you might end up with an invalid field! Candy Crush Saga can be used as a nonlinear block code! Super-exciting, right?

At that, in turn, allows us to ask several questions:

How should we encode a message efficiently? Is there anything better than just keeping a table of , say,102 fields? (There is.)
How should we decode the message efficiently? That's a more difficult question if you don't want to store a table of all allowed arrangements. Maybe belief propagation can help here, as it does for Sudoku codes.
What are the basic properties of an $(n,m,k)$ Candy Crush Code? What is the maximally possible rate given the game rule, i.e., what is the ratio between the number of possible and allowed arrangements? What is the minimum Hamming distance, etc.? How can we improve our encoder in order to guarantee a positive Hamming distance?
Given that we change the flavor of every candy with a given probability $p \ll 1$ (and choose the new flavor uniformly among the remaining ones), what is the error probability? I.e., what is the probability that the changed arrangement is not detected as erroneous, because it also satisfies the game rules?
What about error propagation of this code? I.e., if a given candy changes its flavor, how many "information bits" are effected? Can we still reconstruct part of the information, or is the whole block destroyed? (I guess the answer is no.)
Connecting to the last question: What happens if we let one dimension of the field go to infinity? We would need sequential encoding and decoding techniques, such as they are usual for convolutional codes.
What happens to all of these topics if we span the field on a cylinder (such that the rules extend over one side of the field: two candies of a given flavor on one side rule out a single candy of the same flavor on the other side) or on a torus?

Most probably, the code thus constructed is crappy. But it is nevertheless a nice topic for a research internship, I guess. And it makes learning nonlinear codes for symmetric channels as fun as Sudoku codes do for erasure channels!

Wednesday, June 17, 2015

Polar Code ARE Fractal!

This is a post about very recent work that has not yet been reviewed, but is publicly available on arXiv. This is not science by press conference, but hopefully encouraging discussions and comments from other researchers in the field.

At least, that is what I found out -- I'm pretty confident of the correctness of the results, but, you know, mistakes happen all the time. If you feel that some of my derivations may have flaws, I'd be happy if you could contact me!

Alright, regarding the questions in one of my previous blog post, what have I found out? Here's the short version, for more details please take a look at the paper on arXiv:

The set of good channel $\mathcal{G}$ is Lebesgue measurable, and its Lebesgue measure is $I(W)$, the capacity of the (unpolarized) channel. This is intuitive in the sense that a fraction of $I(W)$ channels will be polarized to perfect channels. From this immediately follows that the Hausdorff dimension of this set is one.
The dyadic rationals are good and bad. This is because dyadic rationals have two possible binary expansions, one terminating (with infinitely many zeros) and one repeating (with infinitely many ones). While the former polarizes the channel to a useless one, the latter polarizes it to a perfect one.
Even if we do not take into account the dyadic rationals, the good channels still form a dense subset of the unit interval. The reason is that between any two dyadic rationals we can find a rational number (with repeating binary expansion) which leads to a good channel. For the binary erasure channel we can also find a rational number leading to a bad channel, so we can say a little more about BECs.
And finally, the set of good channels is indeed self-similar, in the sense that it is a subset of its (scaled and shifted) right half (and left half for BECs). This can be most easily explained by a picture:

This picture (generated by the following code, polynomial composition from this forum) shows the thresholds on the Bhattacharyya parameter for some values inside the unit interval: If the Bhattacharyya parameter is smaller than the threshold, then the channel will polarize to a perfect channel. It can be seen that the right half (top) is an upper bound on the original set (center), which is (for BECs) an upper bound on its left half (bottom).

The most useful tools to derive these results were results about the binary expansion of real numbers, about fixed points of dynamical systems, and the Tal-Vardy polar code construction technique.

Thursday, April 23, 2015

Convolution with Deltas - A Word on Notation

Mathematical notation is important. This is true even more so if we consider the convolution between two functions or between two sequences. In particular, if $f$ and $g$ are two functions, we write for the convolution

$$ h(x) = (f * g)(x) $$

whereas if $\{a_n\}$ and $\{b_n\}$ are two sequences, the convolution can be written as

$$ {c[n]} = {a[n]}*{b[n]}. $$

What should not be done in any circumstance is to write the convolution of two functions as

$$ h(t) = f(t) * g(t). $$

Why not? Well, because the convolution operator $*$ does not operate on value but on functions; $f(t)$ and $g(t)$, however, are values. That can be easily seen by inserting $t=3$ in above equation. We get

$$ h(3) = f(e) * g(3) $$

which obviously does not make any sense. The equation

$$ h(3) = (f*g)(3) $$

is the only appropriate way of writing this. In stark contrast, take a different unitary operator, like addition. For addition one can easily argue that the value of the sum of two functions at a given point equals the sum of the values of the individual functions: Addition works on values, i.e.,

$$ h(t) = (f+g)(t) = f(t)+g(t). $$

Still, many excellent textbooks on signal processing (even the ones from Oppenheim and Schafer and from Vetterli) make use of this "abuse of notation" for convolution. The reason is that as soon a Dirac (or Kronecker) delta enters the game, the famous convolution property pops up somewhere. In incorrect notation, this property can be stated as follows:

$$ h(t)=f(t) * \delta(t-T) = f(t-T) $$

Finding a correct notation is not easy. I found a few helpful comments on stackexchange. I will add my idea at the bottom, please judge yourself which you like the most. I have to admit that actually none of these is really satisfactory...

One commenter suggested to introduce the shift operator $T_T$, i.e., $T_T$ is the function mapping $t$ to $t-T$. Then, $\delta(t-T)=\delta(T_T(t))=(\delta\circ T_T)(t)$ and we get $h(t)=(f*(\delta\circ T_T))(t)$.
Another commenter proposed to write $h(t) = (f*\delta(\cdot-T))(t)$.
My idea is to define an ensemble of functions $\delta_T(t):=\delta(t−T)$ and then write the convolution as $h(t)=(f*\delta_T)(t)$.

While my comment might give the shortest notation, it is not as general as the one with the shift operator. I would have problems, e.g., to write the convolution of a function with its time-reversal, i.e., (in bad notation), $h(t)=f(t)*f(-t)$. If we introduce a time-reversal operator $R$ for which $R(t)=-t$, we could write the convolution easily as $h(t)=(f*f\circ R)(t)$.

Wednesday, April 1, 2015

Properties of a Special Iterated Function System (open)

This post is loosely coupled with my recent post on the fractality of polar codes; although the problem posed there is already solved (I might write about that later), there are still a few open questions which might be of independent interest.

Assume you are given two functions $g_0{:}\ [0,1]\to[0,1]$ and $g_1{:}\ [0,1]\to[0,1]$, which are defined as
$$ g_0(x)=2x-x^2 $$
and
$$ g_1(x)=x^2. $$

Now assume that we have a sequence $a^k:=(a_1,a_2,\dots,a_k)\in\{0,1\}^k$ which determines the polynomial
$$ p_{a^k}(x) := g_{a_k}(g_{a_{k-1}}(\dots g_1(x)\dots )). $$

We are now interested in the properties of this polynomial. Since the case where all $a_i$ are equal (i.e., all are either zero or one) is trivial to analyze, we will assume that $a^k$ contains zeros and ones. W.l.o.g. (see below) we may assume that $a_1=1$ and $a_2=0$, hence $p_{a^2}(x)=2x^2-x^4$. We are looking for the number of fixed points of $p_{a^k}$ for fixed $k$, i.e., points $x^*\in[0,1]$ which satisfy $x^*=p_{a^k}(x^*)$. Can't be that complicated, can it?

What do we know about $p_{a^k}$?

Only a few things, unfortunately:

Since $g_0$ and $g_1$ are from $[0,1]$ to $[0,1]$, so is $p_{a^k}$.
Since $g_0$ and $g_1$ are strictly monotonously increasing, so is $p_{a^k}$.
Since $g_0$ and $g_1$ have fixed points at 0 and 1, so has $p_{a^k}$.
For $a_1=1$ and $a_2=0$, the derivative of $p_{a^k}$ vanishes at 0 and 1; hence, these two fixed points are attractive.
$p_{a^k}$ is a polynomial in $x$ of order $2^k$; hence, by the fundamental theorem of algebra, it has exactly $2^k$ roots; hence, so has $p_{a^k}(x)-x$, which means that there are exactly $2^k$ fixed points.

What do we expect?

Well, this is easy: I think (and extensive numerical analyses confirmed that) $p_{a^k}$ has exactly one (repelling) fixed point on $(0,1)$. This can be understood intuitively as follows: The fixed point on $(0,1)$ is the intersection of $p_{a^k}$ with the identity function ($f(x)=x$), both of which pass through 0 and 1. Since the slope of $p_{a^k}$ is zero at 0 and 1, it must lie above the identity function for values close to 1 and below it for values close to 0. Hence, there should be one point in between where $p_{a^k}$ crosses the identity function. The properties of $p_{a^k}$ ensure that there can be only an odd number of fixed points on $(0,1)$.

What about the rest of the $2^k$ fixed points? They either have to lie outside $[0,1]$ or have to be complex (and conjugated).

What have I already tried?

Not much, I'm afraid, but I spend a few days on this problem. I tried essentially two different approaches, but neither lead (quickly) to the desired result:

The first approach relied upon mathematical induction: For $p_{a^2}$ it is easy to show that the derivative is a quasi-concave function, which admits the application of certain inequalities. If then $g_0(p_{a^k})$ and $g_1(p_{a^k})$ have quasi-concave derivatives, the problem will be solved. But unfortunately, this cannot be shown (there are counter-examples). The approach could be refined by using the fact that the derivative of $p_{a^2}$ is convex on intervals $[0,a]$ and $[b,1]$, but concave on the interval $[a,b]$ (for $a$ and $b$ chosen appropriately). That this property is preserved under applying $g_0$ or $g_1$ is not as easy to show. And even if we would succeed in that, it is not clear if this delivers the desired result.

This brings us to the second, more direct approach: It relies on counting the number of real roots of $p_{a^2}(x)-x$ in the interval $(0,1)$. While Descartes' rule of signs is of not much use here, and while Sturm's method is too difficult to apply in the general case, Fourier's theorem looks very promising, given that the derivatives need to be evaluated only at the fixed points 0 and 1. Since Fourier's theorem involves higher-order derivatives of a composition of functions, the most straightforward way would be to apply Faa di Bruno's formula to either $g_i(p_{a^k}(x))$ or to $p_{a^k}(g_i(x))$. "Straightforward" in this case does not mean "easy", and I had to quit my efforts soon (mental note: I should offer a student's project...).

A third, related approach is to show that the second derivative of $p_{a^k}$ has exactly one root in $(0,1)$, showing that $p_{a^k}$ has only one inflection point in this interval. This would do the job, but again requires tedious calculations with Faa di Bruno's formula.

What does this have to do with Polar Codes?

I'm glad that you ask. You might remember the example from my previous post:

Example: Consider the binary erasure channel and let $b=2/3$. Then, the binary expansion of $b$ is 101010101.... A binary erasure channel with capacity $C>(\sqrt{5}−1)/2$ will converge to a perfect channel, one with a capacity strictly lower than this number will converge to a useless channel, and one with a capacity exactly equal to this number will stay at this number.

Essentially, all rational numbers have binary expansions that are recurring, i.e., that repeat after a given period. We do not consider dyadic rationals here, although they can be called recurring, too. In general, i.e., if the rational is not dyadic, then the period will contain zeros and ones. And since we can analyze any period we want, we can choose the period such that $a_1=1$ and $a_2=0$.

Polar coding also suggests that what we expect is in fact true: Lemma 11 in this paper states that, for almost every binary expansion $a^\infty$ there is exactly one fixed point of $p_{a^\infty}$ in $(0,1)$. This "almost every" unfortunately does not tell us a little thing about binary expansions of rational numbers, since in this context, rational numbers are almost nothing. Hence, if we were able to show above property for $p_{a^k}$, we could improve our knowledge about the behavior of polar codes, or of the polarization effect in general.

Let me know if you want to work on this -- it is a nice problem, and I would love to see a definite answer to it.

Tuesday, March 10, 2015

Are Polar Codes Fractal?

Alright, this is very preliminary. I don't believe in most of the things I say below myself, but I just hope that they are true. I also hope that in the near future we will be able to prove that these things are true, or that we will prove them false. Let's see. In any case, I will keep you posted!

Polar codes are a beautiful new scheme to achieve the capacity of a binary input memoryless channel (BIMC) without requiring iterative decoders, and without relying on random coding arguments (which are beautiful mathematical methods, too). They were first introduced by Erdal Arikan in 2009, so the topic of polar coding is relatively new and still hot. Although encoding and decoding complexity of these codes is low compared to other methods with similar performance, the construction of polar codes is a difficult (and computationally costly) problem. An approximate code construction method was proposed by Tal and Vardy, which has sub-exponential complexity. That polar codes might be fractal was mentioned in a blog post by Emmanuel Abbe long time ago. Kahraman, Viterbo, and Celebi observed that the coding matrix (very similar -- or identical? -- to a Hadamard matrix on Galois field 2) used there resembles the Sierpinski triangle, a well-known fractal.

What are Polar Codes?

I'd like to keep this one very short, and instead recommend the interested reader to Arikan's paper or to Sasoglu's monograph. But on a very abstract basis, we can interpret these codes in the following way: Assume you have a BIMC with a capacity $0<C<1$. You are interested in a code using the channel multiple times, such that you can transmit with a rate sufficiently close to $C$ and with a very small probability of error.

There are two ways you can use the channel twice: Either as it is leading to a total capacity of $C+C=2C$, or you do preprocessing of the channel inputs. The preprocessing will lead to a channel use with a higher capacity $C^1>C$ and to a channel use with lower capacity $C^0<C$, which satisfy $C^1+C^0=2C$. Similarly, if you want to use the channel four times, you can either take $C+C+C+C=4C$, or $C^1+C^1+C^0+C^0=4C$, or you can preprocess the two better channels $C^1$ to two channels with $C^{11}>C^1>C^{10}$, and the two worse channels $C^0$ to two channels with $C^{01}>C^0>C^{00}$. Note that here you already have

$$C^{11} > C^1 >C>C^0>C^{00}.$$

In other words, out of multiple uses of a given channel with capacity $C$, you can generate multiple channels with different capacities by preprocessing; some of these channels will have a higher capacity, some of the will have a lower capacity. Instead of using the same channel multiple times, you use preprocessing to get better channels and worse channels. And indeed, if we do infinitely many preprocessing steps (doubling the number of channel uses in each step), we end up either with a perfect channel ($C=1$) or with a completely useless one ($C=0$): The channels have been polarized, hence the name for this type of code.

$$C^{110101000101101001101100101010100101011101101000...} \text{ is either 0 or 1}$$

The interesting fact is that if we do these preprocessing steps randomly, i.e., we randomly choose between making the channel better (a 1 in the superscript) or worse (a 0 in the superscript), then with probability $C$ we get a perfect channel, and with probability $1-C$ a useless one. The idea of polar coding is to transmit data on the perfect channels (no error occurs), and to set all the other, bad channels to fixed or "frozen" bit values. Since a fraction of $C$ channels will always be good according to the polar coding theorem, the rate of the code is equal to (or at least close to) the capacity of the original channel $C$.

Which Channels are Good, Which are Bad?

That's the interesting question! But before we go on to talk about fractality, let me mention that in practice we never go all the way to the perfect or useless channels, as this would require using infinite block lengths. In fact, we fix a number $n$ and perform exactly as many preprocessing steps to pairs of channel each, ending up with $2^n$ channel uses or a block length of $2^n$. However, even in this case it is often not easy to decide which of these $2^n$ channels is close to perfect! But I hope to convince you that the true problems appear if we go with $n$ to infinity.

The picture below show the channels obtained by preprocessing a binary erasure channel with a capacity of $C=0.433$. Preprocessing, or polarization, has been performed seven times, leading to 128 channels. As you can see, some of these channels are already quite close to having a capacity of 1, some of them have a capacity close to 0. Intuition would suggest that if we continue polarizing, channels on the left will stay bad, while channels on the right will become perfect.

The capacity of 128 channels derived from a single binary erasure channel with capacity 0.433

To show that this is not the case, let's focus on the channel marked with a red point:

Polarizing a channel with low capacity (shown in red)...

This channel has a very low capacity, and from now on we will start polarizing this channel only. Starting from a very bad channel, even the "better" channel derived from it will be relatively bad (i.e., have a low capacity). But after sufficiently many polarization steps, this channel with a capacity close to zero will have generated several channels with a high capacity, and in the long run some of these channels will even be perfect ($C=1$): Indeed, if we start polarizing this channel with capacity $C^{--\cdots--}$, after infinitely many steps a relative fraction of $C^{--\cdots--}$ channels will be perfect.

...will eventually give rise to good channels.

Thus, even if after some steps a channel appears to be very bad (good), this channel will in the long run give rise to good (bad) channels.

Let's Talk Business!

I hope by now I gave you some intuition on why I believe that polar codes could be fractal: wherever you start, wherever you look, you will always find good and bad channels. But let's make this all a bit more precise: Take a real number $b\in [0,1]$, and let $b_2$ be the fractional part of its binary expansion. In the general case, the binary expansion of $b$ is an infinite sequence of zeros and ones. To be precise, if $b$ is irrational, $b_2$ is non-repeating and infinitely long and if $b$ is rational, then $b_2$ is either terminating or recurring. Let $C^b$ be the capacity of the channel obtained by preprocessing/polarizing the original channel with capacity $C$ according to the sequence specified by $b_2$. Then, the polar coding theorem says that

$$ \text{for almost all } b\in[0,1]{:}\ C^b\in\{0,1\}.$$

The term "for almost all" is well-defined and relates to the fact that there may be $b$ such that $C^b$ is neither zero or one, but that these $b$ will be extremely rare. The quest for a good polar code is the quest for finding good channels, i.e., it is the quest for specifying the set

$$G:=\{b\in[0,1]{:}\ C^b=1\}.$$

Let's see what we can say about this set.

$G$ is Lebesgue-measurable and its Lebesgue measure is $C$?

Why do I think that this is true? Well, I hope that we can replace the standard probability space $(\Omega,\mathcal{A},P)$ by $([0,1],\mathfrak{B}([0,1]),\lambda)$, where $\lambda$ is the Lebesgue measure and where $\mathfrak{B}([0,1])$ is the Borel sigma algebra for the unit interval $[0,1]$. Clearly, $([0,1],\mathfrak{B}([0,1]),\lambda)$ is a probability space. The original proof for channel polarization either relies on the theory of martingales, or on stochastic dynamical systems. As far as I know, both define the (random) sequence of polarizations leading to a specific channel $C^b$ as a realization of an i.i.d. Bernoulli process. One such realization corresponds to one element of the probability space $\Omega$. If we can show that, equivalently, one such realization corresponds to exactly one element $b$ in $[0,1]$, we can use the probability space associated with the unit interval. The claim that $G$ is measurable follows from the fact that $C^b$ is a well-defined random variable (hence measurable). And since we know that

$$ C=Prob[C^b=1]=\lambda(\{b\in[0,1]{:}\ C^b=1\})=\lambda(G)$$

this would show that, indeed, the Lebesgue measure of $G$ would be $C$.

$G^c$ is Dense in $[0,1]$?

I think this should be relatively easy to show: The complement of $G$, $G^c$, contains all $b$ such that the corresponding channel has capacity $C^b<1$. Now assume that the binary expansion of $b$ is terminating, i.e., after finitely many zeros and ones we can be sure that there will be only zeros left, and infinitely many of them. These zeros will polarize the channel to a capacity of zero, hence, all $b$ with a terminating binary expansion will be in $G^c$. But all dyadic rationals, i.e., rationals that can be written as $p/2^q$ for integers $p$ and $q$, have terminating binary expansions. Consequently, the set of dyadic rationals in $[0,1]$ is a subset of $G^c$. Moreover, since the set of dyadic rationals is dense in the set of reals, the set of dyadic rationals in $[0,1]$ is dense in $[0,1]$. Hence, $G^c$ is dense in $[0,1]$.

What does this mean? This means that every interval $I\subset [0,1]$ will contain at least one "bad" channel. More abstractly, if a polar code is a stochastic dynamical system with two attractors, then the basins of attraction (in terms of the number $b$) are no intervals. (I'm not yet sure if this is true, and even if it is, I have no idea what this means.)

$G$ is Self-Similar?

Self-similarity is typical for fractals, so showing that $G$ is self-similar may suggest that $G$ is fractal, too. Fractality is a a concept related to the dimension of a set, and to the fact that the dimension of this set is non-integer. In fact, since there are many measures of dimensionality, the definition of fractality could in fact depend on the measure used. To have a strong statement that (and why!) the construction of polar codes is difficult, I think it suffices to show that $G$ is self-similar. To prove self-similarity, we would need the following result:

Claim: Let $C_1$ and $C_2$ be the capacities of two different channels, and let $C_1<C_2$. Then, for all $b\in[0,1]$, $C_1^b<C_2^b$. Hence, if $G_1$ and $G_2$ are the sets of good channels associated with $C_1$ and $C_2$, respectively, then $G_1 \subseteq G_2$.

In other words, we would need that if a specific sequence polarizes a given channel to zero (one), then it will polarize all worse (better) channels to zero (one) as well. This is equivalent to saying that a dynamical system iterating according to a number $b$ has exactly two attractors and two basins of attraction. In other words, for a given $b$ we can find a number $\theta$, such that for all $C>\theta$, $C^b=1$ and for all $C<\theta$, $C^b=0$. $C=\theta$ may of course be another fixed point, but this fixed point is unstable.

Example: Consider the binary erasure channel and let $b=1/3$. Then, the binary expansion of $b$ is $0101010101...$. A binary erasure channel with capacity $C>\frac{\sqrt{5}-1}{2}$ will converge to a perfect channel, one with a capacity strictly lower than this number will converge to a useless channel, and one with a capacity exactly equal to this number will stay at this number. (This is why the polar coding theorem can only show convergence almost surely, or for almost all $b$: There may be $b$ for which a given channel does not converge to either zero or one.)

If this claim holds, then we can say the following: Clearly $[0,0.5]$ is similar to $[0,1]$ in the sense that the unit interval is obtained by a simple left shift of the binary expansion of $[0,0.5]$, or $[0,1]=2\cdot[0,0.5]$. Similarly, the set $[0.5,1]$ differs from the set $[0,0.5]$ only in the first digit of the binary expansion. Moreover, $[0,1]=2\cdot[0.5,1]-1$. We consider again a channel with capacity $C$, and the set of good channels shall be $G$. Preprocessing the channel once either improves it or worsens it, so we get capacities $C^1>C>C^0$. Let $G_1$ be the set of good channels when starting with channel $C^0$, and let $G_2$ be the set of good channels when starting with $C^1$. Let

$$ G_1:=\{b\in[0,0.5]{:}\ C^b=1\}$$

and

$$ G_2:=\{b\in[0.5,1]{:}\ C^b=1\}.$$

But since the set $[0,0.5]$ is similar to the set $[0,1]$, the set $G_1$ will also be similar to the set $G$: In fact,

$$ 2\cdot G_1=\{b\in[0,1]{:}\ C^{0b}=1\}$$

and

$$ 2\cdot G_2-1=\{b\in[0,1]{:}\ C^{1b}=1\}.$$

Since the first preprocessing changes the capacity of the channel with which we start polarizing, we can apply the claim to get $2\cdot G_1 \subset G \subset 2\cdot G_2-1$. At the same time, however, $G=G_1\cup G_2$. In other words, the set $G$ consists of a better (right) half and a worse (left) half, both of which are similar in the sense of above set inclusion. (If the sets would be equal rather than inclusions of each other, the set $G$ would be composed of exactly two scaled copies of $G$.) Of course, this analysis can be iterated ad infinitum. $G$ is self-similar.

Fractal Dimension of $G$?

Alright, this might be a tough one. Since $G$ will probably be dense, the box-counting dimension will probably be 1. Since $G$ consists of (more-less) two copies of its own, scaled by a factor of $1/2$, the scaling dimension will be 1, too. Hausdorff dimension is beyond my understanding.

Wednesday, February 4, 2015

The 9 Balls Problem (still open)

Recently, my lab mate Patrick came to my office and proposed to give the following question for an exam in information theory: Explain why exactly two weightings on a standard balance scale suffice to determine which out of nine balls is heavier than the other eight (which have all the same weight).

Everybody knows the puzzle, and most know the solution: Split the nine balls in groups of three and weigh them. The heavier triple must contain the heavy ball, so you can weigh two balls of the heavier triple. If they balance, the third ball must be the heavy one. Similarly, if the first two triples have the same weight, the heavy ball must be in the third triple that you did not weigh in the first step.

In fact, it turns out that in this case two weightings are not only the minimal number of weightings necessary, but also the optimal number: Each weighting gives us three results (right heavy, left heavy, balanced), so two weightings give us nine results. And, there are exactly nine balls, one of which we have to determine. Hence, if we want to put it in terms of information theory, the number of bits needed to encode the position of the heavy ball is equal to the number of bits obtained from the scale in two weightings:

$$ H(balls) = \log 9 = 2 \log 3 = 2 H(scale) $$

Things get a bit more difficult if we do not know whether the odd ball is heavier or lighter than the rest. Assume that the ball is in one of the first two triples which we weigh. Then, the scale does not immediately tell us in which of these triples it is: If the odd ball is lighter, then it is in the triple going up, if the odd ball is heavier it is in the triple going down. But there is a very nice scheme which allows us to determine the position of the odd ball and whether it is heavier or lighter in three weightings. And again, information theory tells us how to interpret this. Given that there are two possibilities for the weight, $H(weight)=\log 2$, we have

$$ H(balls,weight) = \log 9 + \log 2 = \log 18 < \log 27 = 3\log 3 = 3H(scale) $$

So, three weightings give us more than enough information about whether the odd ball is heavier or lighter, and where it is. I don't want to reproduce the algorithm to find the odd ball here, just let me tell you that three weightings are sufficient for more than nine balls, too. We only have to guarantee that $H(balls,weight)\le 3H(scale)$. The solution to find the odd ball out of twelve is given in the Wikipedia article about the Twelve-Coin-Problem.

It turns out that if we are not interested in the weight of the odd ball, but just in its position, an ordinary balance scale might not be the right type of measurement device. Although thirteen balls would still satisfy above equation, giving

$$ H(balls,weight) = \log 26 < \log 27 = 3H(scale) $$

three weightings tell us the odd ball, but not whether it is heavier or lighter (see Manish's comment to this blog entry or this website). But why is that so? Raj on Quora wrote that with $n$ weightings you can find the odd one of at most $(3^n-1)/2$ balls and the odd one and its weight from at most $(3^n-3)/2$ balls. For $n=3$ this evaluates to 12 and 13, respectively, so the result seems to be correct.

Still, I'm not totally happy with it. If it's true, then there should be a sequence of results expressing this also in terms of entropy, or in terms of conditional entropy of, say, the position given the results of the first two weightings. There is some work to do to improve intuition about this problem, although Raj already did a very good job. Let me know when you're done!

Pages