public class DOMContentUtils
extends java.lang.Object
| Constructor and Description |
|---|
DOMContentUtils(Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
void |
getOutlinks(java.net.URL base,
java.util.ArrayList<Outlink> outlinks,
org.w3c.dom.Node node)
This method finds all anchors below the supplied DOM
node, and
creates appropriate Outlink records for each (relative to the
supplied base URL), and adds them to the outlinks
ArrayList. |
void |
getText(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
This is a convinience method, equivalent to
getText(sb, node, false). |
boolean |
getTitle(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
This method takes a
StringBuffer and a DOM Node, and will
append the content text found beneath the first title node to
the StringBuffer. |
void |
setConf(Configuration conf) |
public DOMContentUtils(Configuration conf)
public void setConf(Configuration conf)
public void getText(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
getText(sb, node, false).public boolean getTitle(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
StringBuffer and a DOM Node, and will
append the content text found beneath the first title node to
the StringBuffer.public void getOutlinks(java.net.URL base,
java.util.ArrayList<Outlink> outlinks,
org.w3c.dom.Node node)
node, and
creates appropriate Outlink records for each (relative to the
supplied base URL), and adds them to the outlinks
ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
Copyright © 2019 The Apache Software Foundation