1. Ιnformation Εntropy
Entropy is a measure of the uncertainty in a random variable. The random variable X has n outcomes {x1, ..., xn}. Each outcome xi has it's own probability pi to appear. The average information per outcome is called information entropy and is denoted with H(X):
H(X) = -Σ pi log2pi where i ∈ [1,n] (1.1)
The entropy H(X) represents the average uncertainty that a priory exists concerning the emission.
Equation (1.1), given by Claude E. Shannon in 1948 in his paper “A Mathematical Theory of Communication”, has a perfect analogy with the entropy in thermodynamics, established by Boltzmann and J.W.Gibbs in 1870, therefore it was called information entropy.
Maximum Entropy
We will not give a detailed explanation of how maximum entropy is calculated, instead we will use an example. If someone wants to calculate the maximum entropy of n symbols, then he simply uses one of each. That way the probability of each symbol is 1/n. So entropy is now:
Hmax(X) = logb(n)
Source: Fundamentals in Information Theory and Coding - By Monica Borda
Source: Shannon, Claude E. (July/October 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27 (3): 379–423. (PDF)
Recommendation systems depend οn a model of a user (generally in the form of user ratings) to provide personalized recommendations. Users who do not have a model, therefore, cannot receive personalized ratings. The challenge of how to serve these users, and how best to learn their preferences, is known as the "new user problem" in recommendation systems.
Source: Joseph A. Konstan, John Riedl - Recommender Systems: from algorithms to user experience
For the following examples we will assume the following: we have a movie database and some users. At some point for whatever reason we want to calculate the entropy of the ratings that the users give for one movie. For the simplicity of this example let's say that we only have twenty users. So that means that we have twenty ratings for that movie and we want to calculate the entropy. A decision that we made is that a movie of our database can only have integer ratings from 1 to 10 inclusive. Notice that zero is not included in the possible symbols. So the symbols in our case are ten. If we included zero the symbols would be eleven. We will calculate some convenient examples of entropy using 10 for the logarithmic base.
As expected the entropy is less than Example 1.
For the sake of this example let's say that we have a table app_favorites where we store the id of the user, the movie url, the rate that the user gave to the movie etc. And another table called app_movies where the details of each movie are stored. The following "SELECT" shows the ratings of 7 users for one movie.
And the following select shows the info of the same movie.
The choice of base (b)
The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J.W.Tukey. If the base 10 is used the units may be called decimal digits.
Source: Shannon, Claude E. (July/October 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27 (3): 379–423. (PDF)
Entropy in Recommendation Systems
Recommendation systems depend οn a model of a user (generally in the form of user ratings) to provide personalized recommendations. Users who do not have a model, therefore, cannot receive personalized ratings. The challenge of how to serve these users, and how best to learn their preferences, is known as the "new user problem" in recommendation systems.
Rashid et al. (2002) explored six algorithms for soliciting ratings from new users. They found that the best performance was a hybrid that selected items (movies in this experiment) that were popular, but that also had high ratings entropy. Entropy, a measure of disagreement among raters, proved to be a good predictor of the amount of information provided to the system by a rating.
2. My Theoretical Examples
For the following examples we will assume the following: we have a movie database and some users. At some point for whatever reason we want to calculate the entropy of the ratings that the users give for one movie. For the simplicity of this example let's say that we only have twenty users. So that means that we have twenty ratings for that movie and we want to calculate the entropy. A decision that we made is that a movie of our database can only have integer ratings from 1 to 10 inclusive. Notice that zero is not included in the possible symbols. So the symbols in our case are ten. If we included zero the symbols would be eleven. We will calculate some convenient examples of entropy using 10 for the logarithmic base.
Example 1
| User | Rate | User | Rate | User | Rate | User | Rate |
|---|---|---|---|---|---|---|---|
| User 01 | 8 | User 06 | 9 | User 11 | 10 | User 16 | 7 |
| User 02 | 8 | User 07 | 9 | User 12 | 10 | User 17 | 7 |
| User 03 | 8 | User 08 | 9 | User 13 | 10 | User 18 | 7 |
| User 04 | 8 | User 09 | 9 | User 14 | 10 | User 19 | 7 |
| User 05 | 8 | User 10 | 9 | User 15 | 10 | User 20 | 7 |
Table 1 for Example 1
The symbols that a user can use to rate a movie are ten, but at Table 1 one can see that the users rated the movie with similar rates. This is a stretched example where each of the symbols that the users used, has the same probability to appear. The probability is 5/20. The probability of the symbols 1 to 6 is zero. So the entropy is:
Example 2
| User | Rate | User | Rate | User | Rate | User | Rate |
|---|---|---|---|---|---|---|---|
| User 01 | 8 | User 06 | 9 | User 11 | 10 | User 16 | 7 |
| User 02 | 8 | User 07 | 9 | User 12 | 10 | User 17 | 7 |
| User 03 | 8 | User 08 | 9 | User 13 | 10 | User 18 | 7 |
| User 04 | 5 | User 09 | 5 | User 14 | 6 | User 19 | 6 |
| User 05 | 5 | User 10 | 5 | User 15 | 6 | User 20 | 6 |
Table 2 for Example 2
For symbols 7, 8, 9, 10 the possibility is 3/20 and for symbols 5 and 6 the possibility to appear is 4/20. So the entropy is:
So the entropy increased. This tells us that the more the users disagree the entropy is higher.
Example 3
From the previous examples we can safely assume that if all the users but three give the same rate to the movie, then the entropy will be less than the entropy of Example 1.
| User | Rate | User | Rate | User | Rate | User | Rate |
|---|---|---|---|---|---|---|---|
| User 01 | 8 | User 06 | 8 | User 11 | 8 | User 16 | 8 |
| User 02 | 8 | User 07 | 8 | User 12 | 8 | User 17 | 8 |
| User 03 | 8 | User 08 | 8 | User 13 | 8 | User 18 | 9 |
| User 04 | 8 | User 09 | 8 | User 14 | 8 | User 19 | 7 |
| User 05 | 8 | User 10 | 8 | User 15 | 8 | User 20 | 6 |
Table 3 for Example 3
So the possibility of the symbol 8 is 17/20 and for 6, 7, 9 the possibility is 1/20. Let's calculate the entropy:
Example 4
Now let's examine how the maximum entropy can be achieved with 20 users.
| User | Rate | User | Rate | User | Rate | User | Rate |
|---|---|---|---|---|---|---|---|
| User 01 | 1 | User 06 | 3 | User 11 | 6 | User 16 | 8 |
| User 02 | 1 | User 07 | 4 | User 12 | 6 | User 17 | 9 |
| User 03 | 2 | User 08 | 4 | User 13 | 7 | User 18 | 9 |
| User 04 | 2 | User 09 | 5 | User 14 | 7 | User 19 | 10 |
| User 05 | 3 | User 10 | 5 | User 15 | 8 | User 20 | 10 |
Table 4 for Example 4
We can see that each symbol has possibility 2/20 = 1/10. So the entropy is:
Example 5
If the users increase very much but their ratings are similar to, hm, let's say example 2, what do we expect to happen to entropy?
Let's say that now for symbols 7, 8, 9, 10 the possibility is 20/134 and for symbols 5 and 6 the possibility to appear is 27/134. So the entropy is:
The entropy was 0.773933246301 when we had 20 users and is 0.7735522392774656 now that we have 134 users that voted similarly with the first twenty. So the entropy didn't change remarkably! That tells us that the extra 114 users didn't add much of information with their rates. Also if the possibilities where 21/140 = 3/20 and 28/140 = 4/20 the entropy would have been the same.
3. Calculating entropy - MySQL
For the sake of this example let's say that we have a table app_favorites where we store the id of the user, the movie url, the rate that the user gave to the movie etc. And another table called app_movies where the details of each movie are stored. The following "SELECT" shows the ratings of 7 users for one movie.
SELECT * FROM app_favorites WHERE mUrl = "http://url1"; +-----------+-------------+------+-----+ | id | mUrl | rate | pos | +-----------+-------------+------+-----+ | 1994 | http://url1 | 0 | 0 | | 1995 | http://url1 | 0 | 0 | | 6666 | http://url1 | 10 | 1 | | 345352 | http://url1 | 10 | 1 | | 20130614 | http://url1 | 9 | 2 | | 20130615 | http://url1 | 7 | 3 | | 201306142 | http://url1 | 3 | 3 | +-----------+-------------+------+-----+
And the following select shows the info of the same movie.
SELECT mtitle,mUrl,entropy,avRate FROM app_movies WHERE mUrl = "http://url1"; +------------+-------------+---------+--------+ | mtitle | mUrl | entropy | avRate | +------------+-------------+---------+--------+ | The Matrix | http://url1 | 0.0000 | 7.8000 | +------------+-------------+---------+--------+
So we need a clever way of calculating the possibility of each rate - symbol. For example how many tens are there? How many nines etc. In times like these "count()" and "GROUP BY" come to the rescue. The following SELECT counts the times each rate appears (excluding zero ratings):
SELECT rate,count(*) AS rateSum FROM app_favorites WHERE mUrl = "http://url1" AND rate > 0 GROUP BY rate; +------+---------+ | rate | rateSum | +------+---------+ | 3 | 1 | | 7 | 1 | | 9 | 1 | | 10 | 2 | +------+---------+
Notice that 3, 7 and 9 appear once. So their possibility will be the same. What if we grouped them again as a human would do when calculating the entropy by hand? It seems like a good idea!
SELECT rate,rateSum,count(*) AS rateSumMult FROM (SELECT rate,count(*) AS rateSum FROM app_favorites WHERE mUrl = "http://url1" AND rate > 0 GROUP BY rate)AS innerTable GROUP BY rateSum; +------+---------+-------------+ | rate | rateSum | rateSumMult | +------+---------+-------------+ | 3 | 1 | 3 | | 10 | 2 | 1 | +------+---------+-------------+
By using "GROUP BY" we are losing some information. Notice that we can conclude (by column rateSumMult) that there are 3 rates, (by rateSum we conclude) that these rates appear 1 time each and (by column rate we conclude) that one of them is the rate - symbol '3', but we don't know which are the other two rates. So we lost the information that the other two rate - symbols that appear only once are '7' and '9'. Nevertheless we don't even care! Actually all we need to calculate the entropy now is the two columns "rateSum" and "rateSumMult" and the total non zero rates.
So the SELECT that we are going to use is the following (it is the same as before but the rate column is removed):
SELECT rateSum,count(*) AS rateSumMult FROM (SELECT count(*) AS rateSum FROM app_favorites WHERE mUrl = "http://url1" AND rate > 0 GROUP BY rate)AS innerTable GROUP BY rateSum; +---------+-------------+ | rateSum | rateSumMult | +---------+-------------+ | 1 | 3 | | 2 | 1 | +---------+-------------+
The SELECT for the total non zero rates is just trivial sql but here it is:
SELECT count(*) FROM app_favorites WHERE mUrl = "http://url1" AND rate > 0; +----------+ | count(*) | +----------+ | 5 | +----------+
Unfortunately this blog post got too long. So I will not get into more details. I'm going to post the code of a stored procedure that gets the url of a movie and then uses cursors to calculate the entropy of that movie. Note that the rates of the users are stored in table app_favorites and that before the procedure ends, entropy is stored in table app_movies.
/** * Testing Procedure Entropy */ /* If a procedure with the name "entropy" * already exists then drop it. */ DROP PROCEDURE IF EXISTS entropy; /* Set the delimiter from ';' to '$' */ DELIMITER $ /* Now create the procedure that will calculate * the entropy of the ratings of a movie. * In our database the primary key for each movie * is the movie url. */ CREATE PROCEDURE entropy(IN mUrlIN VARCHAR(128)) BEGIN /* ------------------- Declarations ---------------------- */ /* done variable for continue handler. * * tot->total this variable gets the * * number of the total non zero rates. */ DECLARE done INT DEFAULT FALSE; DECLARE tot INTEGER; /* s->sum,m->multiplicity of sum and * * e->entropy. */ DECLARE s,m INTEGER; DECLARE en FLOAT(7,4) DEFAULT 0; /* Declare the cursor for the select. This select * * does counting and grouping at the same time. The * * inner select counts how many times a rating * * appears, as rateSum and the second how many times * * the same rateSum appears, as rateSumMult. */ DECLARE curs1 CURSOR FOR SELECT rateSum,count(*) AS rateSumMult FROM (SELECT count(*) AS rateSum FROM app_favorites WHERE mUrl = mUrlIN AND rate > 0 GROUP BY rate)AS innerTable GROUP BY rateSum; /* Declare the continue handler. */ DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE; /* ---------------- End of Declarations --------------- */ /* Count only the non zero rates of the movie and store them into variable tot. */ SELECT count(*) INTO tot FROM app_favorites WHERE mUrl = mUrlIN AND rate > 0; /* Open the cursor. */ OPEN curs1; /* Start a loop to read the cursor. */ sum_loop: LOOP FETCH curs1 INTO s,m; IF done THEN LEAVE sum_loop; END IF; /* Calculate the entropy */ SET en = en + m*(s/tot)*LOG10(s/tot); END LOOP; CLOSE curs1; /* Store the entropy */ SET en = -en; UPDATE app_movies SET entropy = en WHERE mUrl = mUrlIN; /* End of entropy procedure */ END$ DELIMITER ; /* You can then call the procedure like this: */ call entropy("http://url1");




No comments:
Post a Comment