Friday, December 10, 2010

Fiddler

I was having the toughest time trying to figure out why my http request get was coming back from http://www.networksolutions.com/whois/index.jsp when I was trying to do a whois lookup.

It turned out that I was doing the search correctly, but I needed to have a referrer page or the website knew that it was an automated website request.  How did I figure all this out?  Fiddler.

Fiddler is a program that monitors all of the http request and get traffic on your computer.  It stores all of the information such as the headers sent, the cookies in each website request, and other things.

In order to get past the automatic robot checking software for most sites, all you need to do is make sure your url is formed properly.  In order to do this, open up fiddler.  You can download it here at http://www.fiddler2.com/fiddler2/.  Once you open it up, you can clear out all the traffic by selecting all of the items in the first window, and then deleting them.

Open up a browser, go to the website where you want to make your specific request, and then clear out your fiddler list.  Make your specific request, and then make your request.  When you search google.com on the whois lookup at http://www.networksolutions.com, here is the information that fiddler shows me in the Raw tab.


GET http://www.networksolutions.com/whois-search/google.com HTTP/1.1
Host: www.networksolutions.com
Connection: keep-alive
Referer: http://www.networksolutions.com/whois/index.jsp
Cache-Control: max-age=0
Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: siteId=527-10; _csuid=X162083227e04de; randomstring=; vrsnsf=161b3bac4c69008e3c39e201757c; landing=P99C8S570N0B9A1D670E0000V101; cart="|time=Fri Dec 03 17:19:22 EST 2010|sessionId=e31843b13a3d9c28b992aaedf8e7|cookie=161b3bac4c69008e3c39e201757c|cart={[H(3y)<lavineflowers.biz][DOM_BIZ(5y)<lavineflowers.biz]}{[H(3y)<lavineflowers.com][DOM_COM(5y)<lavineflowers.com]}"; JSESSIONID=2dc0cc11539eb8a2f5c74ee4e538; JROUTE=GMxg; loginSelectorDestination=; __utmz=82970249.1292027830.9.6.utmgclid=CIOI6Nv64qUCFQGe7Qod4WH9VA|utmccn=(not%20set)|utmcmd=(not%20set)|utmctr=network%20solutions; RVTFN=1-877-357-7586; RVRF=nsgooglebrand-network_solutions{night}-exact-101; RVID=1514294; RVNS_SESSID=2dc0cc11539eb8a2f5c74ee4e538; vertigo=false; s_cc=true; __utmv=; __utma=82970249.1039984786.1290123989.1291786510.1292027830.9; __utmc=82970249; __utmb=82970249.3.10.1292027830; test=none; s_sq=netsolglobal%3D%2526pid%253Dnet%25257C%252520whois%25253Eindex.jsp%2526pidt%253D1%2526oid%253Dhttp%25253A%25252F%25252Fwww.networksolutions.com%25252Fimg%25252Fbuttons%25252Fsearch-blue.png%2526ot%253DIMAGE; currency=USD

Pretty cool huh.

You will notice that it shows the host site -- very important.  Most important for this query, it tells the referrer page in the web request get itself.  So if the live site is sending itself that data, it is probably important for the automated request to contain that information too.

This is done by setting the request. referrer field by doing something like:

request.Referer = referer;

Also, the fiddler  contains the information stored in the cookie.  While I didn't need to store the cookie information for this request, or for most requests, some requests will require that.

I ran into that problem when I was doing a whois lookup at domain tools.com.

I had to save the sessionId after I had logged on to the site, otherwise it would only allow a limited number of whois requests.  I also had to save the other cookie information that apparently contained the id for the website I was looking up.  I did this with the following:

1.  First, I created a cookie container in my class called _cookieContainer.

        private CookieContainer _cookieContainer = new CookieContainer();

2.  Then in my web get method I make sure to add any cookies coming back from that uri into my cookie container.




                response = (HttpWebResponse)request.GetResponse();

                response.Cookies = request.CookieContainer.GetCookies(request.RequestUri);

                _cookieContainer.Add(request.RequestUri, response.Cookies);


3.  Whenever I submit a request make sure to set its cookie container to the one you got form the last web request.


                request.CookieContainer = _cookieContainer;

4.  This is like a circular loop.  It keeps storing the relevant cookie information, and sending that out with each subsequent request.


This will set the response cookies correctly, then on your next request

In order to add cookies manually, the following line of code works:


cookieContainer.Add(new Cookie("dtsession", "6bfcc307d2b0bf52f78597ba1f1e50da", "/",                       "domaintools.com"));
cookieContainer.Add(new Cookie("SessionToken", "8ea23c8f35383f75a0fa6c23356337e74f8bd7c4", "/", "domaintools.com"));



Happy web scraping.

No comments:

Post a Comment