public abstract class AbstractFetchSchedule extends Configured implements FetchSchedule
FetchSchedule.| Modifier and Type | Field and Description |
|---|---|
protected int |
defaultInterval |
protected int |
maxInterval |
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN| Constructor and Description |
|---|
AbstractFetchSchedule() |
AbstractFetchSchedule(Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
long |
calculateLastFetchTime(WebPage page)
This method return the last fetch time of the CrawlDatum
|
void |
forceRefetch(java.lang.String url,
WebPage page,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
|
java.util.Set<WebPage.Field> |
getFields() |
void |
initializeSchedule(java.lang.String url,
WebPage page)
Initialize fetch schedule related data.
|
void |
setConf(Configuration conf) |
void |
setFetchSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
void |
setPageGoneSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.
|
void |
setPageRetrySchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due
to transient errors.
|
boolean |
shouldFetch(java.lang.String url,
WebPage page,
long curTime)
This method provides information whether the page is suitable for selection
in the current fetchlist.
|
getConfclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetConfpublic AbstractFetchSchedule()
public AbstractFetchSchedule(Configuration conf)
public void setConf(Configuration conf)
setConf in interface ConfigurablesetConf in class Configuredpublic void initializeSchedule(java.lang.String url,
WebPage page)
fetchTime and fetchInterval. The default
implementation sets the fetchTime to now, using the default
fetchInterval.initializeSchedule in interface FetchScheduleurl - URL of the page.page - WebPage object relative to the URLpublic void setFetchSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
fetchInterval and fetchTime on a
successfully fetched page. NOTE: this implementation resets the retry
counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.setFetchSchedule in interface FetchScheduleurl - url of the pagepage - WebPage object relative to the URLprevFetchTime - previous value of fetch time, or -1 if not availableprevModifiedTime - previous value of modifiedTime, or -1 if not availablefetchTime - the latest time, when the page was recently re-fetched.modifiedTime - last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available.state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be
"changed" before the fetchTime, if
FetchSchedule.STATUS_NOTMODIFIED then the content is known to be
unchanged. This information may be obtained by comparing page
signatures before and after fetching. If this is set to
FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was
changed; implementations are free to follow a sensible default
behavior.public void setPageGoneSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
maxInterval.setPageGoneSchedule in interface FetchScheduleurl - URL of the pagepage - WebPage object relative to the URLpublic void setPageRetrySchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
setPageRetrySchedule in interface FetchScheduleurl - URL of the pagepage - WebPage object relative to the URLprevFetchTime - previous fetch timeprevModifiedTime - previous modified timefetchTime - current fetch timepublic long calculateLastFetchTime(WebPage page)
calculateLastFetchTime in interface FetchSchedulepublic boolean shouldFetch(java.lang.String url,
WebPage page,
long curTime)
fetchTime, if it is higher than the current time
it returns false, and true otherwise. It will also check that
fetchTime is not too remote (more than maxInterval),
in which case it lowers the interval and returns true.shouldFetch in interface FetchScheduleurl - URL of the pagepage - WebPage object relative to the URLcurTime - reference time (usually set to the time when the fetchlist
generation process was started).public void forceRefetch(java.lang.String url,
WebPage page,
boolean asap)
forceRefetch in interface FetchScheduleurl - URL of the pagepage - WebPage object relative to the URLasap - if true, force refetch as soon as possible - this sets the
fetchTime to now. If false, force refetch whenever the next fetch
time is set.public java.util.Set<WebPage.Field> getFields()
getFields in interface FetchScheduleCopyright © 2019 The Apache Software Foundation