I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.
For simplicity, we assume that there are only two files available on Netflix, namely X and Y. The common information of these two random variables (RVs; in information theory, everything is random) is another RV W that minimizes
I(X,Y; W)
subject to X-W-Y, i.e., subject to the fact that, given W, X and Y are conditionally independent. The common randomness of these two RVs is an RV U that maximizes
I(X,Y; U)
subject to the constraint that X-Y-U and Y-X-U. It can be shown that
I(X,Y; U) \le I(X;Y) \le I(X,Y; W).
Common randomness is something that is truly common, in the usual sense of the word. In particular, if A, B, and C are independent RVs, and if X=(A,B) and Y=(A,C), then U=A. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If X and Y are two different episodes of your favorite series, then U is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose X and Y are two different episodes of your favorite series, and suppose the only thing X and Y have in common is an actress called A. In A, some parts of A are seen, while in Y other parts of A are seen. (Say, X shows A running, while Y shows A talking to a person.) Obviously, if you could store all information about actress A as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which A participates. Thus, if your hard drive is large enough, that's what you should store.
Let's take this one step further. Suppose that you are fine with watching not X or Y, but lower-quality versions \hat X or \hat Y, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV W that minimizes
I(X,Y; W)
subject to \hat X - W - \hat Y and (X,Y) - (\hat X,\hat Y) - W. Lossy common randomness is an RV U that maximizes
I(X,Y; U)
subject to the constraints X-Y-U, Y-X-U, X-\hat X - U, and Y-\hat Y-U. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).
The lossy version of common information seems fine to me: I only need to store as much information about actress A in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress A, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes X and Y. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).
The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes X and Y differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).
I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV U' that maximizes
I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')
subject to the constraints \hat X-\hat Y-U', \hat Y-\hat X-U', X-\hat X - U', and Y-\hat Y-U'. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires W to make the reconstructions \hat X and \hat Y conditionally independent, the original definition of lossy common randomness requires that U is a common part of X and Y (the first two Markov constraints). The definition proposed here would require that U' is only a common part of the reconstructions \hat{X} and \hat Y. Given this definition, we may ask the following questions:
- Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
- Is there a communications problem for which this definition is a single-letter characterization?