Error 403 Request Disallowed By Robots.txt
here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company Business Learn more about hiring developers or posting ads with us Stack Overflow Questions Jobs Documentation Tags Users Badges Ask Question x Dismiss Join the Stack Overflow Community Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute: Sign up Python, Mechanize - request disallowed by robots.txt even after set_handle_robots and add_headers up vote 3 down vote favorite I have made a web crawler which gets all links till the 1st level of page and from them it gets all link and text plus imagelinks and alt. here is whole code: import urllib import re import time from threading import Thread import MySQLdb import mechanize import readability from bs4 import BeautifulSoup from readability.readability import Document import urlparse url = ["http://sparkbrowser.com"] i=0 while i here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company Business Learn more about hiring developers or posting ads with us Stack Overflow Questions Jobs Documentation Tags Users Badges Ask Question x Dismiss Join the Stack Overflow Community Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only http://stackoverflow.com/questions/18096885/python-mechanize-request-disallowed-by-robots-txt-even-after-set-handle-robot takes a minute: Sign up Python Mechanize HTTP Error 403: request disallowed by robots.txt [duplicate] up vote 1 down vote favorite 2 This question already has an answer here: Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt” 7 answers So, I created a Django website to web-scrap news webpages for articles.. Even though i use mechanize, i they http://stackoverflow.com/questions/18821305/python-mechanize-http-error-403-request-disallowed-by-robots-txt still telling me: HTTP Error 403: request disallowed by robots.txt I tried everything, look at my code(Just the part to scrap): br = mechanize.Browser() page = br.open(web) br.set_handle_robots(False) br.set_handle_equiv(False) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] #BeautifulSoup htmlcontent = page.read() soup = BeautifulSoup(htmlcontent) I tried too to use de br.open before the set_hande_robots(Flase) ,etc. It didn't work either. Any way to get trough this sites? python django beautifulsoup mechanize robots.txt share|improve this question asked Sep 16 '13 at 6:02 Julian Slonim 612 marked as duplicate by Rotwang, Lorenz Meyer, WATTO Studios, Aaron Hall, Jan Doggen Mar 10 '14 at 12:53 This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question. They are disallowed because those sites don't want any bot to access their resources. There might be legal terms. You should stay away from them. –Bibhas Sep 16 '13 at 6:33 add a comment| 1 Answer 1 active oldest votes up vote 3 down vote You're setti here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings and policies of this site About Us Learn more about Stack Overflow the company Business Learn http://stackoverflow.com/questions/12193701/http-error-403-request-disallowed-by-robots-txt-generated more about hiring developers or posting ads with us Stack Overflow Questions Jobs Documentation Tags Users Badges Ask Question x Dismiss Join the Stack Overflow Community Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute: Sign up HTTP Error 403: request disallowed by robots.txt' generated? [duplicate] up vote 1 down vote favorite 1 Possible Duplicate: Ethics of Robots.txt I am trying out error 403 Mechanize to automate some work on a site. I have managed to bypass above error by using br.set_handle_robots(False). How ethical it's to use it? If not, then I thought of obeying 'robots.txt', but the site I am trying to mechanize is blocking me from viewing robots.txt, does this means no bots are allowed to it? Whats should be my next steps? Thanks in advance. web html-parsing web-crawler robots.txt mechanize-python share|improve this question edited Aug 30 '12 error 403 request at 13:36 asked Aug 30 '12 at 9:22 Avi 1248 marked as duplicate by Jürgen Thelen, KingCrunch, Clyde Lobo, Celada, rene Aug 31 '12 at 13:03 This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question. add a comment| 1 Answer 1 active oldest votes up vote 1 down vote For your first question, see Ethics of Robots.txt You need to keep in mind the purpose of robots.txt. Robots that are crawling a site can potentially wreck havoc on the site and essentially cause a DoS attack. So if your "automation" is crawling at all or is downloading more than just a few pages every day or so, AND the site has a robots.txt file that excludes you, then you should honor it. Personally, I find a little grey area. If my script is working at the same pace as a human using a browser and is only grabbing a few pages then I, in the spirit of the robots exclusion standard, have no problem scrapping the pages so long as it doesn't access the site more than once a day. Please read that last sentence carefully before judging me. I feel it is perfectly logical. Many people may disagree with me there though. For your second question, web se