public class HttpRobotRulesParser extends RobotRulesParser
RobotRulesParser class and contains Http protocol
specific implementation for obtaining the robots file.| Modifier and Type | Field and Description |
|---|---|
protected boolean |
allowForbidden |
agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES| Constructor and Description |
|---|
HttpRobotRulesParser(Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
protected static java.lang.String |
getCacheKey(java.net.URL url)
Compose unique key to store and access robot rules in cache for given URL
|
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol http,
java.net.URL url)
Get the rules from robots.txt which applies for the given
url. |
getConf, getRobotRulesSet, main, parseRules, setConfpublic HttpRobotRulesParser(Configuration conf)
protected static java.lang.String getCacheKey(java.net.URL url)
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, java.net.URL url)
url.
Robot rules are cached for a unique combination of host, protocol, and
port. If no rules are found in the cache, a HTTP request is send to fetch
{{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the
rules are cached to avoid re-fetching and re-parsing it again.getRobotRulesSet in class RobotRulesParserhttp - The Protocol objecturl - URL robots.txt applies toBaseRobotRules holding the rules from robots.txtCopyright © 2019 The Apache Software Foundation