User Tools

Site Tools


gibson:teaching:fall-2014:math445:lab5

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gibson:teaching:fall-2014:math445:lab5 [2014/11/12 12:54]
gibson
gibson:teaching:fall-2014:math445:lab5 [2015/03/09 13:39] (current)
gibson
Line 17: Line 17:
   * Use ''​L = zeros(8,​8)''​ to allocate an 8 x 8 matrix of zeros, and then insert 1s at the required spots with statements like ''​L(4,​3) = 1''​. ​   * Use ''​L = zeros(8,​8)''​ to allocate an 8 x 8 matrix of zeros, and then insert 1s at the required spots with statements like ''​L(4,​3) = 1''​. ​
   * Better yet, store the set of nonzero indices as a ''​N x 2''​ matrix in a file ''​links.asc'',​ then load that matrix into Matlab with ''​load links.asc'',​ and then loop over the ''​N''​ rows of ''​links''​ and assign ones into the specified elements of ''​L''​ with ''​L(links(n,​2),​ links(n,1)) = 1''​ where ''​n''​ is the loop index.   * Better yet, store the set of nonzero indices as a ''​N x 2''​ matrix in a file ''​links.asc'',​ then load that matrix into Matlab with ''​load links.asc'',​ and then loop over the ''​N''​ rows of ''​links''​ and assign ones into the specified elements of ''​L''​ with ''​L(links(n,​2),​ links(n,1)) = 1''​ where ''​n''​ is the loop index.
-  * The transition matrix ''​T''​ has a ''​1/​n''​ everywhere ''​L''​ has a 1, where ''​n''​ is the number of nonzero entries in the same column of ''​L''​. Instead of calculating ''​T''​ yourself and typing it in, use Matlab to calculate it from ''​L''​. Hints: ''​s = sum(L)''​ will give you an 8d row vector ''​s''​ whose elements are the sums of each of the eight columns of ''​L''​. Then the ''​j''​th column of ''​T''​ equals the ''​j''​th column of ''​L''​ divided by ''​s(n)''​.+  * The transition matrix ''​T''​ has a ''​1/​n''​ everywhere ''​L''​ has a 1, where ''​n''​ is the number of nonzero entries in the same column of ''​L''​. Instead of calculating ''​T''​ yourself and typing it in, use Matlab to calculate it from ''​L''​. Hints: ''​s = sum(L)''​ will give you an 8d row vector ''​s''​ whose elements are the sums of each of the eight columns of ''​L''​. Then the ''​j''​th column of ''​T''​ equals the ''​j''​th column of ''​L''​ divided by ''​s(j)''​.
  
 ---- ----
Line 25: Line 25:
  
 **(a)** ​ Go to the website [[http://​www.mathworks.com/​moler/​chapters.html|Numerical Computing with Matlab]] ​ **(a)** ​ Go to the website [[http://​www.mathworks.com/​moler/​chapters.html|Numerical Computing with Matlab]] ​
-by Clive Moler and download the “surfer.m” program. Set ''​N=100''​ and run the command  +by Clive Moler and download the “surfer.m” program. Set ''​M=100''​ and run the command  
-''​[U,​L] = surfer('​http://​www.unh.edu',​N);''​. This might take a few minutes. ​+''​[U,​L] = surfer('​http://​www.unh.edu',​M);''​. This might take a few minutes. ​
  
-**(b)** ​ After some time a vector ''​U''​ of websites will be generated and a matrix ''​L''​ of links will +**(b)** ​ After some time a vector ''​U''​ of websites will be generated and a matrix ''​L''​ of links will be generated. All the entries of ''​L''​ should be either 0 or 1. But ''​L''​ will have 100^2 == 10,000 entries, so you can't check this by eye. Write a short piece of Matlab code that double-checks that all entries of ''​L''​ are either 0 or 1. 
-be generated. All the entries of ''​L''​ should be either 0 or 1. But ''​L''​ will have 100^2 == 10,000 entries, +
-so you can't check this by eye. Write a short piece of Matlab code that double-checks that all entries of +
-''​L''​ are either 0 or 1. +
  
  
-**%%(c)%%** Now generate a ''​T''​ matrix as described in problem 1. You will probably run into an error in  +**%%(c)%%** Now generate a ''​T''​ matrix as described in problem 1. You will probably run into an error in part 2, namely a webpage might contain no links or only links outside the set of pages we are working with. In that case we will just assume that the person randomly selects a new website among all the ''​M''​ pages in our set. Revise your calculation of ''​T''​ so that if a column of ''​L''​ sums to zero, then each of the entries in the same column of ''​T''​ is ''​1/​M''​.
-part 2, namely a webpage might contain no links or only links outside the set of pages we are working with.  +
-In that case we will just assume that the person randomly selects a new website among all the ''​N''​ pages  +
-in our set. Revise your calculation of ''​T''​ so that if a column of ''​L''​ sums to zero, then each of the  +
-entries in the same column of ''​T''​ is ''​1/​N''​.+
  
-**(d)** Suppose that instead of always clicking on a random link within a page, a web surfer sometimes ​ +**(d)** Suppose that instead of always clicking on a random link within a page, a web surfer sometimes (with probability alpha) jumps to a random page chosen from all the pages in the network. We can model this behavior by modifying the transition matrix ''​T''​ as follows
-(with probability alpha) jumps to a random page chosen from all the pages in the network. We can  +
-model this behavior by modifying the transition matrix ''​T''​ as follows+
  
-''​= (1-alpha) T + alpha **1**/N;''​+''​= (1-alpha) T + alpha **1**/M;''​
  
-where **1** is a matrix of ones. Rumor has it that Google uses alpha = 0.15, so use that value +where **1** is a matrix of ones. Rumor has it that Google uses alpha = 0.15, so use that value and calculate ''​G''​. ​
-and calculate ''​T''​. ​+
  
-**(e)** Double-check that the sum of each column of ''​T''​ is 1. Again, be clever and get Matlab to do  +**(e)** Double-check that the sum of each column of ''​G''​ is 1. Again, be clever and get Matlab to do the work, rather than listing out the sums of all the columns and verifying manually that each of the 100 numbers is 1! 
-the work, rather than listing out the sums of all the columns and verifying manually that each of the  +
-100 numbers is 1! +
  
-**(f)** Let’s assume we start at the first webpage, ''​p=zeros(N,​1);​ p(1)=1;''​. If we applying ​ ''​T'' ​ +**(f)** Let’s assume we start at the first webpage, ''​p=zeros(N,​1);​ p(1)=1;''​. If we applying ​ ''​G''​ to ''​p''​ many times (say 1000 times), the resulting vector will give the probability that we end up on a particular website and after a long session of random web surfing. We can do this with a for-loop
-to ''​p''​ many times (say 1000 times), the resulting vector will give the probability that we end up  +
-on a particular website and after a long session of random web surfing. We can do this the slow way +
-by using the code+
  
 <​code>​ <​code>​
-m=1000+N=100
-for j=1:m+for n=1:N
-  p=T*p;+  p=G*p;
 end end
 </​code>​ </​code>​
  
-or a faster equivalent way  
  
-<code+After that is complete the elements of ''​p''​ give the probabilities that a random web surfer will end up at the web pages listed in ''​U'',​ and you can rank the pages in ''​U''​ according to their probabilities in ''​p''​. Look up the ''​sort''​ command in the Help menu and find a way to print the list of websites in order of their rank. Turn in the list of the top 10, the fraction of the time a random web surfer would spend in each one of those 10, and your code to complete the Lab. 
-m=1000; +
-x=(T^m)*x;​ +
-</​code>​+
  
- +**(g)** ​ If you are interested: Our calculation involved two free parameters: the probability alpha of jumping to an entirely random page in the network, and ''​N'',​ the number that specifies the length of our long session of random web surfing. How robust is the page rank algorithm with respect to these two numbers? If you change alpha to 0.1 or 0.05, do the top 10 pages change? 
-After that is complete the values of ''​x''​ give the probabilities that a random web surfer will end  +How about if you change ''​N''​ to 10^3 or 10^4?
-up at the web pages listed in ''​U'',​ and you can rank the pages in ''​U''​ according to their probabilities +
-in ''​x''​. Look up the ''​sort''​ command in the Help menu and find a way to print the list of +
-websites in order of their rank. Turn in the list of the top 10, the fraction of the time a random +
-web surfer would spend in each one of those 10, and your code to complete the Lab. Note: There is  +
-a faster way to compute the solution for very large matrices, but we will not worry about that here.  +
-Our method would work fine until you reached the scale of 8,​000-10,​000 websites. Don't try it!  +
- +
-**(g)** ​ If you are interested: Our calculation involved two free parameters: the probability alpha  +
-of jumping to an entirely random page in the network, and ''​m'',​ the number that specifies the +
-length of our long session of random web surfing. How robust is the page rank algorithm with  +
-respect to these two numbers? If you change alpha to 0.1 or 0.05, do the top 10 pages change? +
-How about if you change ''​m''​ to 10^4?+
  
 ---- ----
  
-**Problem 3:**  Suppose that, instead of representing links between web pages, you have a directed graph that tells who likes whom all students at UNH. You could then use the same algorithm to compute a numerical measure of each UNH student'​s likeability ​at UNH, and an overall ranking of the popularity of all UNH students. ​+**Problem 3:**  Suppose that, instead of representing links between web pages, you have a directed graph that tells who likes whom among all students at UNH. You could then use the same algorithm to compute a numerical measure of each UNH student'​s likeability,​ and an overall ranking of the popularity of all UNH students. ​
  
 What do you think of this proposal? Would the numerical values produced be meaningful measures of likeability or popularity? Why or why not? What does your answer here say about Google Page Rank as a method of ranking web pages? ​ What do you think of this proposal? Would the numerical values produced be meaningful measures of likeability or popularity? Why or why not? What does your answer here say about Google Page Rank as a method of ranking web pages? ​
Line 97: Line 66:
 ---- ----
  
-** Problem 4:** In 2(f) above, it was claimed that  
  
-<code+**Problem 4:** This lab has mentioned three different algorithms for calculating the page rank. For this problem, you will compare the execution time T of these three algorithms as a function of M, the number of pages in your web of links. The end result will be a plot of execution time versus M, with three lines, one for each algorithm.  
-m=1000+ 
-p=(T^m)*p;+It will be easiest to do the comparisons if you convert your page rank script to a Matlab function and then write three different algorithms as three different versions of that function. 
 + 
 + 
 +^function^algorithm^ 
 +| ''​pagerank_fullpower'' ​    | matrix power method with full matrix | 
 +| ''​pagerank_fulliterate'' ​  | iterated matrix multiplication with full matrix |  
 +| ''​pagerank_sparseiterate''​ | iterated matrix multiplication with sparse matrix |  
 + 
 + 
 +** 1. matrix power method with full matrix** 
 + 
 +Write a Matlab function 
 + 
 +  function [i,p] = pagerank_fullpower(L);​ 
 + 
 +that takes a matrix of links L as an input and returns the indices i of webpages sorted by their page ranks, and the page ranks p sorted the same way (so that p(1) >= p(2) etc). 
 + 
 + 
 +To be sure that the function operates with full matrices, make this the first line of the function body 
 + 
 +  L full(L)
 + 
 +Compute T and G as discussed above, and then compute the page rank G^N by the straightforward Matlab code 
 + 
 +  N 100; 
 +  p = G^N p; 
 + 
 + 
 +** 2. iterated matrix multiplication with full matrix ** 
 + 
 +Write a Matlab function  
 + 
 +  function [i,p] = pagerank_fulliterate(L);​ 
 + 
 +just as before, except compute G^N p with the for-loop 
 + 
 +  N=100; 
 +  for n=1:N 
 +    p = G*p 
 +  end 
 + 
 + 
 +** 3. iterated matrix multiplication with sparse matrix ** 
 + 
 +Write a Matlab function  
 + 
 +  function [i,p] = pagerank_sparseiterate(L);​ 
 + 
 +that converts the input L to a sparse matrix in the first line with  
 + 
 +  L = sparse(L) 
 + 
 +and computes G^N p with the for-loop 
 + 
 +  N = 100; 
 +  for n=1:N 
 +    p = (1-alpha)*(T*p) + alpha*ones(M,​1)/​M;​ 
 +  end 
 +   
 +or equivalently 
 + 
 +  N = 100; 
 +  one_alpha_T = (1-alpha)*T;​ 
 +  alpha_ones ​ = alpha*ones(M,​1)/​M;​ 
 +  for n=1:N 
 +    p = one_alpha_T*p + alpha_ones;​ 
 +  end 
 + 
 +Note that the sparse algorithm avoids the matrix of 1's that is present in the other two algorithms, because that matrix is necessarily full. 
 + 
 + 
 +Now write a script that tests and times these algorithms on link matrices L of sizes M = 32, 64, 128, 256, ...., up to whatever power of two your computer can handle. It's smart to develop and test your code on a smaller set of M, perhaps just M = 32, 64, 128. As a starting point, here's the code we developed in class to measure execution time of full versus sparse matrices. 
 + 
 +<code matlab>​ 
 + 
 +% script to compare full versus sparse matrix multiplication 
 + 
 +for n=1:12 
 +   
 +  % set up random M x M Ax=b problem 
 +  m = 2^n;  
 +  A = randomL(m)
 +  x = randn(m,​1);​ 
 + 
 +  % measure execution time of sparse matrix multiplication 
 +  tic; 
 +  b = A*x   
 +  Tsparse(n) = toc; 
 + 
 +  % measure execution time of full matrix multiplication 
 +  A = full(A); 
 +  tic; 
 +  b = A*x;    
 +  Tfull(n) = toc; 
 + 
 +  M(n) = m; 
 +end  ​
 </​code>​ </​code>​
  
-is faster than +You can use this Matlab function to produce synthetic data for testing any given value of M
  
-<​code>​ +<​code ​matlab
-m=1000+function L randomL(M)
-for j=1:m+% return random link matrix for testing pagerank algorithms 
-  p=T*p;+ 
 +  maxlinks = 10;  % maximum of number of links per page 
 +  L = sparse(M,​M);​ 
 +   
 +  ​for j=1:M                     
 +    i = randi(M, maxlinks, 1)% get vector of random numbers between 1 and M  
 +    for n 1:​maxlinks 
 +      L(i(n),j) = 1          % create links from page j to the random pages i 
 +    end 
 +  end 
 +  ​
 end end
 </​code>​ </​code>​
  
-as way of computing ​the steady-state distribution ​of websurfers+ 
 +Produce ​plot of execution time T versus M for each of your three pagerank functions using Matlab'​s ''​tic''​ and ''​toc''​. From the plot, estimate the explicit function that relates T to M for each of the three algorithms. 
 + 
 +Which algorithm is fastest? Why? How long do you estimate each algorithm would take to rank M=10^6 web pages? Provide your answer in easily understandable units of time (days, weeks, years, etc.
 + 
 + 
 + 
  
gibson/teaching/fall-2014/math445/lab5.1415825657.txt.gz · Last modified: 2014/11/12 12:54 by gibson