lxml.html.clean module
A cleanup tool for HTML.
Removes unwanted tags and content. See the Cleaner class for details.
- class lxml.html.clean.Cleaner(**kw)
- Bases: - object- Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. - scripts:
- Removes any - <script>tags.
- javascript:
- Removes any Javascript, like an - onclickattribute. Also removes stylesheets as they could contain Javascript.
- comments:
- Removes any comments. 
- style:
- Removes any style tags. 
- inline_style
- Removes any style attributes. Defaults to the value of the - styleoption.
- links:
- Removes any - <link>tags
- meta:
- Removes any - <meta>tags
- page_structure:
- Structural parts of a page: - <head>,- <html>,- <title>.
- processing_instructions:
- Removes any processing instructions. 
- embedded:
- Removes any embedded objects (flash, iframes) 
- frames:
- Removes any frame-related tags 
- forms:
- Removes any form tags 
- annoying_tags:
- Tags that aren’t wrong, but are annoying. - <blink>and- <marquee>
- remove_tags:
- A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag. 
- kill_tags:
- A list of tags to kill. Killing also removes the tag’s content, i.e. the whole subtree, not just the tag itself. 
- allow_tags:
- A list of tags to include (default include all). 
- remove_unknown_tags:
- Remove any tags that aren’t standard parts of HTML. 
- safe_attrs_only:
- If true, only include ‘safe’ attributes (specifically the list from the feedparser HTML sanitisation web site). 
- safe_attrs:
- A set of attribute names to override the default list of attributes considered ‘safe’ (when safe_attrs_only=True). 
- add_nofollow:
- If true, then any <a> tags will have - rel="nofollow"added to them.
- host_whitelist:
- A list or set of hosts that you can use for embedded content (for content like - <object>,- <link rel="stylesheet">, etc). You can also implement/override the method- allow_embedded_url(el, url)or- allow_element(el)to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance)- embedded.- Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning. - Note that you may also need to set - whitelist_tags.
- whitelist_tags:
- A set of tags that can be included with - host_whitelist. The default is- iframeand- embed; you may wish to include other tags like- script, or you may want to implement- allow_embedded_urlfor more control. Set to None to include all tags.
 - This modifies the document in place. - _has_sneaky_javascript(style)
- Depending on the browser, stuff like - e x p r e s s i o n(...)can get interpreted, or- expre/* stuff */ssion(...). This checks for attempt to do stuff like this.- Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts. 
 - _kill_elements(doc, condition, iterate=None)
 - _remove_javascript_link(link)
 - _substitute_comments(string, count=0)
- Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 
 - allow_element(el)
- Decide whether an element is configured to be accepted or rejected. - Parameters:
- el – an element. 
- Returns:
- true to accept the element or false to reject/discard it. 
 
 - allow_embedded_url(el, url)
- Decide whether a URL that was found in an element’s attributes or text if configured to be accepted or rejected. - Parameters:
- el – an element. 
- url – a URL found on the element. 
 
- Returns:
- true to accept the URL and false to reject it. 
 
 - allow_follow(anchor)
- Override to suppress rel=”nofollow” on some anchors. 
 - clean_html(html)
 - kill_conditional_comments(doc)
- IE conditional comments basically embed HTML that the parser doesn’t normally see. We can’t allow anything like that, so we’ll kill any comments that could be conditional. 
 - _tag_link_attrs = {'a': 'href', 'applet': ['code', 'object'], 'embed': 'src', 'iframe': 'src', 'layer': 'src', 'link': 'href', 'script': 'src'}
 - add_nofollow = False
 - allow_tags = None
 - annoying_tags = True
 - comments = True
 - embedded = True
 - forms = True
 - frames = True
 - host_whitelist = ()
 - inline_style = None
 - javascript = True
 - kill_tags = None
 - links = True
 - meta = True
 - page_structure = True
 - processing_instructions = True
 - remove_tags = None
 - remove_unknown_tags = True
 - safe_attrs = frozenset({'abbr', 'accept', 'accept-charset', 'accesskey', 'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing', 'char', 'charoff', 'charset', 'checked', 'cite', 'class', 'clear', 'color', 'cols', 'colspan', 'compact', 'coords', 'datetime', 'dir', 'disabled', 'enctype', 'for', 'frame', 'headers', 'height', 'href', 'hreflang', 'hspace', 'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'media', 'method', 'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 'readonly', 'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'selected', 'shape', 'size', 'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type', 'usemap', 'valign', 'value', 'vspace', 'width'})
 - safe_attrs_only = True
 - scripts = True
 - style = False
 - whitelist_tags = {'embed', 'iframe'}
 
- lxml.html.clean._break_text(text, max_width, break_character)
- lxml.html.clean._find_image_dataurls(string, pos=0, endpos=9223372036854775807)
- Return a list of all non-overlapping matches of pattern in string. 
- lxml.html.clean._has_javascript_scheme(s)
- lxml.html.clean._insert_break(word, width, break_character)
- lxml.html.clean._is_unsafe_image_type(string, pos=0, endpos=9223372036854775807)
- Scan through string looking for a match, and return a corresponding match object instance. - Return None if no position in the string matches. 
- lxml.html.clean._link_text(text, link_regexes, avoid_hosts, factory)
- lxml.html.clean._looks_like_tag_content(string, pos=0, endpos=9223372036854775807)
- Scan through string looking for a match, and return a corresponding match object instance. - Return None if no position in the string matches. 
- lxml.html.clean._possibly_malicious_schemes(string, pos=0, endpos=9223372036854775807)
- Return a list of all non-overlapping matches of pattern in string. 
- lxml.html.clean._replace_css_import(repl, string, count=0)
- Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 
- lxml.html.clean._replace_css_javascript(repl, string, count=0)
- Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 
- lxml.html.clean._substitute_whitespace(repl, string, count=0)
- Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 
- lxml.html.clean.autolink(el, link_regexes=[re.compile('(?P<body>https?://(?P<host>[a-z0-9._-]+)(?:/[/\\-_.,a-z0-9%&?;=~]*)?(?:\\([/\\-_.,a-z0-9%&?;=~]*\\))?)', re.IGNORECASE), re.compile('mailto:(?P<body>[a-z0-9._-]+@(?P<host>[a-z0-9_.-]+[a-z]))', re.IGNORECASE)], avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[re.compile('^localhost', re.IGNORECASE), re.compile('\\bexample\\.(?:com|org|net)$', re.IGNORECASE), re.compile('^127\\.0\\.0\\.1$')], avoid_classes=['nolink'])
- Turn any URLs into links. - It will search for links identified by the given regular expressions (by default mailto and http(s) links). - It won’t link text in an element in avoid_elements, or an element with a class in avoid_classes. It won’t link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1). - If you pass in an element, the element’s tail will not be substituted, only the contents of the element. 
- lxml.html.clean.autolink_html(html, *args, **kw)
- Turn any URLs into links. - It will search for links identified by the given regular expressions (by default mailto and http(s) links). - It won’t link text in an element in avoid_elements, or an element with a class in avoid_classes. It won’t link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1). - If you pass in an element, the element’s tail will not be substituted, only the contents of the element. 
- lxml.html.clean.clean_html(html)
- lxml.html.clean.word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character='\u200b')
- Breaks any long words found in the body of the text (not attributes). - Doesn’t effect any of the tags in avoid_elements, by default - <textarea>and- <pre>- Breaks words by inserting ​, which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space. - See http://www.cs.tut.fi/~jkorpela/html/nobr.html for a discussion 
- lxml.html.clean.word_break_html(html, *args, **kw)