public class DOMContentUtils
extends java.lang.Object
| Modifier and Type | Class and Description |
|---|---|
static class |
DOMContentUtils.LinkParams |
| Constructor and Description |
|---|
DOMContentUtils(Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
java.net.URL |
getBase(org.w3c.dom.Node node)
If Node contains a BASE tag then it's HREF is returned.
|
void |
getOutlinks(java.net.URL base,
java.util.ArrayList<Outlink> outlinks,
org.w3c.dom.Node node)
This method finds all anchors below the supplied DOM
node, and
creates appropriate Outlink records for each (relative to the
supplied base URL), and adds them to the outlinks
ArrayList. |
void |
getText(java.lang.StringBuilder sb,
org.w3c.dom.Node node)
This is a convinience method, equivalent to
getText(StringBuilder, Node, boolean) which passes false as third argument |
boolean |
getText(java.lang.StringBuilder sb,
org.w3c.dom.Node node,
boolean abortOnNestedAnchors)
This method takes a
StringBuilder and a DOM Node, and will
append all the content text found beneath the DOM node to the
StringBuilder. |
boolean |
getTitle(java.lang.StringBuilder sb,
org.w3c.dom.Node node)
This method takes a
StringBuilder and a DOM Node, and will
append the content text found beneath the first title node to
the StringBuilder. |
void |
setConf(Configuration conf) |
public DOMContentUtils(Configuration conf)
public void setConf(Configuration conf)
public boolean getText(java.lang.StringBuilder sb,
org.w3c.dom.Node node,
boolean abortOnNestedAnchors)
StringBuilder and a DOM Node, and will
append all the content text found beneath the DOM node to the
StringBuilder.
If abortOnNestedAnchors is true, DOM traversal will be aborted
and the StringBuilder will not contain any text encountered
after a nested anchor is found.
public void getText(java.lang.StringBuilder sb,
org.w3c.dom.Node node)
getText(StringBuilder, Node, boolean) which passes false as third argumentpublic boolean getTitle(java.lang.StringBuilder sb,
org.w3c.dom.Node node)
StringBuilder and a DOM Node, and will
append the content text found beneath the first title node to
the StringBuilder.public java.net.URL getBase(org.w3c.dom.Node node)
public void getOutlinks(java.net.URL base,
java.util.ArrayList<Outlink> outlinks,
org.w3c.dom.Node node)
node, and
creates appropriate Outlink records for each (relative to the
supplied base URL), and adds them to the outlinks
ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
Copyright © 2019 The Apache Software Foundation