public class RegexURLNormalizer extends Configured implements URLNormalizer
This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
This class also supports different rules depending on the scope. Please see
the javadoc in URLNormalizers for more details.
X_POINT_ID| Constructor and Description |
|---|
RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory
(normalizerClass.newInstance()) in method: getNormalizer()*
|
RegexURLNormalizer(Configuration conf) |
RegexURLNormalizer(Configuration conf,
java.lang.String filename)
Constructor which can be passed the file name, so it doesn't look in the
configuration files for it.
|
| Modifier and Type | Method and Description |
|---|---|
java.util.HashMap<java.lang.String,java.util.List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> |
getScopedRules() |
static void |
main(java.lang.String[] args)
Spits out patterns and substitutions that are in the configuration file.
|
java.lang.String |
normalize(java.lang.String urlString,
java.lang.String scope) |
java.lang.String |
regexNormalize(java.lang.String urlString,
java.lang.String scope)
This function does the replacements by iterating through all the regex
patterns.
|
void |
setConf(Configuration conf) |
getConfclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetConfpublic RegexURLNormalizer()
public RegexURLNormalizer(Configuration conf)
public RegexURLNormalizer(Configuration conf, java.lang.String filename) throws java.io.IOException, java.util.regex.PatternSyntaxException
java.io.IOExceptionjava.util.regex.PatternSyntaxExceptionpublic java.util.HashMap<java.lang.String,java.util.List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
public void setConf(Configuration conf)
setConf in interface ConfigurablesetConf in class Configuredpublic java.lang.String regexNormalize(java.lang.String urlString,
java.lang.String scope)
public java.lang.String normalize(java.lang.String urlString,
java.lang.String scope)
throws java.net.MalformedURLException
normalize in interface URLNormalizerjava.net.MalformedURLExceptionpublic static void main(java.lang.String[] args)
throws java.util.regex.PatternSyntaxException,
java.io.IOException
java.util.regex.PatternSyntaxExceptionjava.io.IOExceptionCopyright © 2019 The Apache Software Foundation