I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.
For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$
Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.
Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).
The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).
The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).
I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:
- Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
- Is there a communications problem for which this definition is a single-letter characterization?