FAQ: Untrusted users and HTML

Jacob Kaplan-Moss

February 24, 2009

An input form that takes raw HTML. It’s a pretty common thing to see in web apps these days: many comment forms allow HTML, or some subset thereof; many social-network-style applications allow end-users to enter HTML in their profiles; etc. Unfortunately, allowing untrusted users to enter raw HTML is incredibly dangerous; read up on XSS if you don’t know why.

So a common question that comes up in web developer circles deals with how best to “escape” user-entered HTML so that is safe for presentation. Though this seems easy, it’s actually incredibly difficult — see Whitelist, Don’t Blacklist for an introduction. I’ve literally seen hundreds of recipes for stripping unsafe HTML that are about as effective as a screen door on a submarine.

I’d like to answer the question once and for all:

No method of displaying untrusted HTML is 100% safe.

Really. Given the bewildering array of browsers and their bugs as soon as you open up HTML input you’ve exposed yourself to an arms race against XSS (and related) attacks.

Put another way, the only 100% safe form of HTML protection is abstinence: if you can avoid allowing raw HTML input, do so.

One of the great features of alternative markup like Markdown, reStructuredText, bbCode, and their ilk is that they can be transformed into safe HTML. For example, python-markdown has a safe_mode argument that prevents anything dangerous from appearing in your output.

Now, I’ve always thought that abstinence-only education is a crock. I’ve always felt that consenting adults who know the risks and want to proceed anyway should be taught about the most effective forms of protection.

In this case, the most effective protection comes in the form of html5lib specfiically html5lib.filters.sanitizer. This uses a well-tested, centrally-maintained whitelist of safe HTML elements. Because of the quality of that whitelist, html5lib is the safest form of protection against malicious HMTL.

Just remember that abstinence is the only 100% effective method of protection, and non-HTML markup is more fun, anyway.

Comments:

Dougal Matthews:

The link to python-markdown doesn't work for whatever reason if there isn't a / at the end.

David Moss:

Here is a project I came across recently that looks interesting as one potential solution to evaluate WRT problems raised in this post :-

http://devsuki.com/pottymouth/

Jeremy Dunck:

David,
Maybe I'm missing something.
Try this on the demo for PottyMouth:
document.write("&#003c;scr" + "ipt>alert('nope');&#003c;/scr" + "ipt>");

Brian Neal:

I'd like to use html5lib in python to sanitize user input from say an HTML form. Is there an example somewhere of how to do this? I can see how to parse it, but what do I do with the tree? I want to get it back into a string. Thanks!

Brian Neal:

I think I figured something out, see my link. Comments welcome. Thanks for the blog post.

Nate Abele:

In the PHP world we have HTML Purifier. And to answer the obvious question, yes, it is 100% safe (unless of course you configure it retardedly).

Of course, it goes through a painstaking tokenization process to verify that every single character of the input won't do anything evil.

Leave a comment:

Use your real name, or risk deletion.

Optional.

No markup allowed. Linebreaks will be converted; links will be linkified.

Be nice; don't be that guy.