Strip HTML tags from text but keep the format of new lines and new paragraphs with JSoup.
You also have the option to unescape HTML symbols with StringEscapeUtils, delete all URLs with regex, and delete unwanted characters.
My thought process is detailed in here http://cindyxiaoxiaoli.wordpress.com/2014/02/05/html-to-plain-text-with-java/