A Wordcloud in Python

Last week I was at Pycon DE, the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while:
doing a wordl-like word cloud.

I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.

So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since).

While I was bored in the train last week, I came up with this code.
A little today-themed taste:





The first step is to get some document. I used the constitution of the united states for the above.
    with open("constitution.txt") as f:
        lines f.readlines()                                                                            
    text = "".join(lines)             

The next step is to extract words and give the words some weighting - for example how often they occur in the document. I used scikit-learn's CountVectorizer for that as it is convenient and fast, but you could also use nltk or just some regexp.
I get the counts of the 200 most common non-stopwords and normalize by the maximum count (to be somewhat invariant to document size).

cv = CountVectorizer(min_df=0, charset_error="ignore",                                               
                         stop_words="english", max_features=200)
counts = cv.fit_transform([text]).toarray().ravel()                                                  
words = np.array(cv.get_feature_names()) 
# normalize                                                                                                                                             
counts = counts / float(counts.max())


Now the real work starts. The basic idea is to randomly sample a place on the canvas and draw a word with a size related to its importance (frequency).
We have to take care not to make the words overlap, though.

There seems to be no good alternative to the Python image library (PIL), which is really, really horrible. There are no docstrings. You specify colors using strings. There is a weird module structure. There are no docstrings.

Any way, we can get a canvas and a drawing object like this:
img_grey = Image.new("L", (width, height))
draw = ImageDraw.Draw(img_grey)
We can then write in the image using
font = ImageFont.truetype(font_path, font_size)
draw.setfont(font)
draw.text((y, x), "Text that will appear in white", fill="white")
The font_path here is an absolute path to a true type font on your system. I found now way to get around this (didn't look very hard, though).

Ok, now we could draw random positions and see if we could draw there without touching any other words.
There is a handy function in ImageDraw.textsize, which tells you how large a piece of text will be once rendered. We can use that to test if there is any overlap.

Unfortunately, random sampling any place in the image turns out to be very inefficient: if a lot of the room is already taken, we have to try quite often to find some space.

My next idea was first to find out all possible free places in the image and then sample randomly from those. The easiest way to find free positions is to convolve the current image with a box of size ImageDraw.textsize(next_word). The places where the result is zero are exactly the places that have enough room for the text.
Using scipy.ndimage.uniform_filter that worked quite nicely.

But what do we do if there is not enough room to draw a word in the size we want?
Then we have to make the font smaller and try again. Which means convolving the image again, this time with a somewhat smaller box.

The code wasn't very fast and this seemed pretty wasteful, so I wanted to use another approach: integral images! Integral images are a way to pre-compute a simple 2d structure from which it is possible to extract the sum over arbitrary rectangles in the image in constant time.
The integral image is basically a 2d cumulative sum and can be computed as integral_image = np.cumsum(np.cumsum(image, axis=0), axis=1). This can be done once, and then we can look up rectangles of any size very fast. If we are interested in windows of size (w, h) we can find the sum over all possible windows of this size via
area = (integral_image[w:, h:] + integral_image[:w, :h]
        - integral_image[w:, :h] - integral_image[:w, h:])
This is a combination of the integral image query (see wikipedia) and my favorite numpy trick to query all positions simulataneuosly.
So basically this does the same as the convolution above, only it precomputes a structure so that we can query for all possible windows sizes.

After drawing a word, we have to compute the integral image again.
Unfortunately, the fancy indexing with the integral image was a bit sluggish.

On the other hand, that was a great opportunity to try out typed memory views in cython, which I learned about from Stefan Behnel at Pycon DE :)
def query_integral_image(unsigned int[:,:] integral_image, int size_x, int size_y):
    cdef int x = integral_image.shape[0]
    cdef int y = integral_image.shape[1]
    cdef int area, i, j
    x_pos, y_pos = []
    for i in xrange(x - size_x):
        for j in xrange(y - size_y):
            area = integral_image[i, j] + integral_image[i + size_x, j + size_y]
            area -= integral_image[i + size_x, j] + integral_image[i, j + size_y]
            if not area:
                x_pos.append(i)
                y_pos.append(j)
Awesome! Easy to write down and straight to C-Speed.

Except for the last two lines ... lists are not fast.
I couldn't get that much faster (the array module doesn't have a C API afaik).

I wanted to sample from all possible positions any way, so I just rand the above code twice: once counting how many possible positions there are, then sampling, then going to the position that I sampled.
Using C++ lists would probably be easier but I was to lazy to try...

Anyhow, now I had pretty decent integral images :)
The building still took some time, though... so I lazily recomputed only the part that is changed after I draw a new word.
Check out the full code on github.
It is not very pretty but I think should be quite readable.

Less talk more pictures:

  

To scale the fonts I used some arbitrary logarithmic dependency on the frequency, that I felt looked decent.
It is also possible just to become smaller if there is no more room.

Oh and of course I allowed flipping of the words :) I also played with using arbitrary colors. I didn't see anything like colormaps in PIL, so I just used the HSL space and just sampled the hue. More elaborate schemes are obviously possible.

Again, I used a slight trick for a bit more speed: I first computed everything in grey-scale, saved all the positions and then re-did it in color.

One more, this time a bit more with the theme of the blog (can you guess what this is?)

And with less saturation:



There is definitely some room for improvement w.r.t. the look of it, but I feel this is already a nice start if you want to play around.

One last comment: I though about improving performance (apparently the only thing on my mind during this little project) by doing the whole thing at a lower resolution and then recreating it at a higher one.
This has two problems: if you use a too small resolution, some text might actually become invisible as it is too small. The other problem is that PIL's font sizes don't scale linearly. So it is not possible to say "I want this font 4 times larger".
You can work around that but it's not pretty.
So I went with the cython / integral image way, which I think is kind of cool :)

 If you scrolled down for the code, it is here.

PS: yes, this doesn't generate css / html4. But as you get the text sizes and positions, it should be easy to use this as a backend to generate a html page. PR welcome ;)

Comments

  1. Very nice!

    As an alternative to PIL, what about using PyQt / PySide and paint into a QPixmap? It may need a bit more code but I guess more people have PyQt / PySide than PIL.

    Thomas

    ReplyDelete
    Replies
    1. Thanks.
      I'm not really familiar with PyQt and I wanted a short simple piece of code (sort of).

      The real work is done in numpy and as long as the you can easily get the data out of the QPixmap into a numpy array, replacing PIL should be easy.

      Delete
  2. Great job Andreas ... I did an implementation of wordly cloud in Python years back using PyQt and it was great fun ... You output is much better then mine. It's truly a fun exercise to do is what I can recall. http://uptosomething.in

    ReplyDelete
    Replies
    1. Thanks :) Did you use rectangles to model the place where a word is or the rendered word, as I do it?

      Delete
  3. Hi Andreas, Thanks for the Python based word-cloud. Looks indeed nice :)
    http://pycloud.blogspot.com/2012/11/worldcloud-for-ccnworks.html

    ReplyDelete
  4. Hi Andreas,

    Really cool one. I tried with non-english text also it wirks. Earlier I use PyTagClou but it misses the multilingual word-cloud facility. https://github.com/atizo/PyTagCloud

    ReplyDelete
    Replies
    1. Hi,

      I am trying to generate wordCloud for unicode text, but I only get the boxes when I use text=text.decode("utf-8").
      Can you give a tip how to get my unicode text to the image?

      Thanks!!

      Delete
    2. Check if the font that is used supports the symbols you want to show.

      Delete
    3. Hi Andreas,

      Thanks for the package and for your reply. I am trying to generate word cloud for 'nepali' text. I installed font 'preeti' and gave the path to the font in wordcloud function. Now I the characters in the image is similar to that in the text. But the words in the image are random; they are not present in the text I supplied. Can you suggest something with this? I can send you my text and the output I got if it is not clear to you.
      Thanks!

      Delete
    4. Please open an issue on the github issue tracker: https://github.com/amueller/word_cloud
      Include your code and maybe a short snippet of the text.

      Delete
  5. Very Nice, I unfortunately once - in 1987 - had to implement a postcript word-cloud. Now I'm using Jason Davis' d3 version.

    ReplyDelete
  6. I have a Javascript version of a WordCloud at https://github.com/indyarmy/jQuery.awesomeCloud.plugin - not directly comparable, but it will do clouds in shapes other than a parallellogram.

    ReplyDelete
    Replies
    1. Great Russ! So simple and so beatiful and powerful! Thanks a lot!!

      Delete
    2. Pretty cool :) My code can now (well nearly now) do other shapes, too! https://github.com/amueller/word_cloud/pull/24

      Delete
  7. Instead of choosing a random place on the image and then drawing a specific word, why don't you start filling up the image in a orderly fashion with random words?

    ReplyDelete
    Replies
    1. It is not clear to me how to do that. The words have different sizes and shapes, so if you start from, say, the top right, the shape will become "unorderly" very soon and the collision detection will be as hard as it is with random assignments, I would guess.

      Delete
  8. Hi Andres!
    Thank you for the great post!
    I tried your script and I got this error message, I tried to google it but no luck.
    any idea?
    def query_integral_image(unsigned int[:,:] integral_image, int size_x, int size_y):
    ^
    SyntaxError: invalid syntax

    the arrow was under int[:,:]

    Thanks a lot!

    ReplyDelete
    Replies
    1. Hi Karin. I would guess that your cython is too old. Try "pip install --user --upgrade cython" to get a newer version.

      Delete
    2. Thank you Andreas for the quick reply! I run the command line you suggested and it upgrade the cython.
      when I run the word cloud script I got the same error.
      Any suggestion?

      Thank you very much!!!!

      Delete
    3. How did you run the file? Compile using "make" or "python setup.py build_ext -i" as stated in the readme, and then call "python wordcloud.py".

      Delete
  9. I run "python setup.py build_ext -i" and I get this message :"running build_ext" then I run "python wordcloud.py" and I still get the message. ,maybe something to do with my configuration ubuntu system ?

    ReplyDelete
    Replies
    1. That is pretty odd. Can you give the exact error? The error is in the cython file, which should not be called by python. Having a syntax error in cython during runtime is ... weird..

      Delete
  10. sure!
    when I run : "python wordcloud.py" I get this bellow:

    Traceback (most recent call last):
    File "wordcloud.py", line 13, in
    from query_integral_image import query_integral_image
    File "/var/www/word_cloud-master/query_integral_image.py", line 7
    def query_integral_image(unsigned int[:,:] integral_image, int size_x, int size_y):
    ^
    SyntaxError: invalid syntax

    ReplyDelete
    Replies
    1. There should be no file query_integral_image.py, only query_integral_image.pyx.

      Delete
  11. my bad :( I copied the file to the my server and I rerun it again.
    now I get a different error message when I run make or "python setup.py build_ext -i" :

    python setup.py build_ext -i
    Compiling query_integral_image.pyx because it changed.
    Cythonizing query_integral_image.pyx

    Error compiling Cython file:
    ------------------------------------------------------------
    ...
    # cython: wraparound=False
    import array
    import numpy as np


    def query_integral_image(unsigned int[:,:] integral_image, int size_x, int size_y):
    ^
    ------------------------------------------------------------

    query_integral_image.pyx:7:38: Expected an identifier or literal
    Traceback (most recent call last):
    File "setup.py", line 7, in
    ext_modules=cythonize("*.pyx"),
    File "/usr/lib/pymodules/python2.7/Cython/Build/Dependencies.py", line 517, in cythonize
    cythonize_one(pyx_file, c_file, quiet, options)
    File "/usr/lib/pymodules/python2.7/Cython/Build/Dependencies.py", line 540, in cythonize_one
    raise CompileError(None, pyx_file)
    Cython.Compiler.Errors.CompileError: query_integral_image.pyx
    make: *** [all] Error 1

    ReplyDelete
    Replies
    1. And which version of Cython are you calling there? Can you try ``cython --version`` and ``python -c "import Cython; print(Cython.__version__)`` ? I would guess you have an older cython somewhere in your path.

      Delete
  12. Hi Andreas,

    Great work.
    I tried running your code and I get error message that I don't know where it comes from.
    On the Windows in the CMD windows here is what I run and get:

    ...\wordcloudPython\trunk>python setup.py build_ext -i

    running build_ext

    ...\wordcloudPython\trunk>python wordcloud.py
    C:\Python33\lib\site-packages\sklearn\feature_extraction\text.py:615: Deprecatio
    nWarning: The charset_error parameter is deprecated as of version 0.14 and will
    be removed in 0.16. Use decode_error instead.
    DeprecationWarning)
    Traceback (most recent call last):
    File "wordcloud.py", line 183, in
    counts = make_wordcloud(words, counts, output_filename)
    File "wordcloud.py", line 102, in make_wordcloud
    box_size = draw.textsize(word)
    File "C:\Python33\lib\site-packages\PIL\ImageDraw.py", line 281, in textsize
    return font.getsize(text)
    File "C:\Python33\lib\site-packages\PIL\ImageFont.py", line 189, in getsize
    w, h = self.font.getsize(text)[0]
    TypeError: 'int' object is not iterable

    what is the reason for the error? And how should I run the code so it gets the constitution.txt as input? (sorry I am new in Python).

    ReplyDelete
    Replies
    1. That error is weird as it is inside PIL. Did you change the font path in the file? You need to set "FONT_PATH" to a true-type font that exists on your system. The default will only work under Linux. The code uses the constitution by default but you can just pass another text file as command line argument.
      Hth,
      Andy

      Delete
    2. Thanks Andy. After a lot of Google search I found this that resolved the error:
      To get it to work change line 189 in from C:\Python33\Lib\site-packages\PIL\ImageFont.py:
      w, h = self.font.getsize(text)[0]
      to:
      w, h = self.font.getsize(text)

      Do you know if your code works with Persian (Farsi language) as well?

      Delete
    3. So that is a bug in PIL under Python3? For Persian: basically yes. if:
      1) you pick a font that supports the signs,
      2)your text is properly encoded (utf8 and hopefully my code reads that correctly)
      3) the regular expression in the scikit-learn Vectorizer makes sense for the language (which is probably fine). The vectorizer tokenizes the text into words based on a simple regular expression that basically separates words at whitespaces and punctuation iirc. For languages where that is not meaningful you would need to adjust the regular expression (an optional argument to the Vectorizer).

      Delete
    4. Yes, that is a bug in PIL for Python3.

      Thanks for the explanation for Persian language.
      I used a Persian font, and I debugged the code. It reads a persian text fine and in the code it creates correct "words" and "counts"but at the end the generated image is just a bunch of rectangles! do you know what should I do to create in image with Persian words in it? Thanks again for all your help.

      Delete
    5. So do the extracted "words" make sense? And what is their encoding? The code just renders the words using PIL. I am not very familiar with PIL, sorry. You could try writing a stand-alone script that tries to render some word using PIL and see if the problem persists.

      Delete
  13. What does it mean to "make" this file? The install and use instructions could be improved. I'm on windows.

    ReplyDelete
    Replies
    1. It means running the program "make", the way most software is build on most operating systems. You can just run "python setup.py build_ext -i" as I said above. Feel free to send a PR improving the Readme.

      Delete
  14. Hi Andy,

    I tried to re-install Python 3.3 and while your code was working before, now I get this error:

    from query_integral_image import query_integral_image
    ImportError: DLL load failed: %1 is not a valid Win32 application.

    Do you know what could be the reason?
    Thanks for the help.

    ReplyDelete
  15. This is really interesting. Though my brain can't compered the stuff about integral images. I've been playing with making word clouds using bash scripting and ImageMagick, starting from a state of pretty much total ignorance on how to do it. Rather than randomly selecting points in the canvas and trying to put a word there I've been starting off by putting the most common word in the centre of the canvas and then checking for free space spiralling out from the centre.

    Your post provides an answer to a question I've been wondering about which is how do people get clouds to fit a specified shape, even just a simple rectangle:

    "But what do we do if there is not enough room to draw a word in the size we want? Then we have to make the font smaller and try again."

    However, this seems to conflict with the premise of a word cloud. As you put it:

    "…draw a word with a size related to its importance (frequency)."

    If you're fitting words in to spaces by way of shrinking their size then aren't you destroying the relationship between the size of the word and it's frequency? Especially because as I read it if a word won't fit in a space you just shrink it until it fits. Doesn't this approach mean that you can potentially end up with a word of frequency N being drawn larger than one with frequency 2N? Or I have misunderstood something?

    ReplyDelete
    Replies
    1. Hey. I think my approach to wordclouds is very non-standard. I also started from ignorance and tried something out. There is a paper about the wordl way, which I can't find at the moment. I think this java-script implementation uses the same algorithm: it also relies on a spiral and a dynamic that moves the words apart if they overlap.

      Actually, the way I present the algorithm here (and the way it is implemented) it is true that the size does not correspond to the frequency. BUT the ranking of the words is preserved. I sort the words by frequency before I start drawing, and the size will only decrease. Maybe that wasn't clear from my description.

      Hth,
      Andy

      Delete
  16. How difficult would it be to create an image where the background is white? I've tried playing around in the code- specifically adding color="white" parameter when all of the images are created, but was unsuccessful.

    ReplyDelete
    Replies
    1. it works for me with background_color="white"

      Delete
  17. This comment has been removed by the author.

    ReplyDelete
  18. This comment has been removed by the author.

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. I have found out that the color names from this list http://www.w3schools.com/HTML/html_colorgroups.asp are accepted as background_color="AntiqueWhite" values.

    ReplyDelete
  21. Hi Andreas,

    I've been using word cloud and enjoying the results. I was curious if it's possible to enable a HD mode that would support zooming without a loss of detail? Or if this isn't a current feature would it be possible to add it?

    Thanks,
    Scott

    ReplyDelete
    Replies
    1. Hi. Currently it only produces a bitmap, not a vector graphic, so there is no loss-less zooming. You can set "scale" to a higher number to get out a higher resolution image at no extra computational cost (setting width and height to different values makes the computation slower). I'm planning to rewrite the code to create vectorgraphics and html but don't hold your breath.

      Cheers,
      Andy

      Delete
  22. Great job Andreas ... I did an implementation of wordly cloud in Python years back using PyQt and it was great fun ... You output is much better then mine. It's truly a fun exercise to do is what I can recall.
    love spells

    ReplyDelete
  23. hi,
    when i am trying to install the package on redhat machine i encounter the following gcc error:

    ]$ python setup.py install

    running install

    running bdist_egg

    running egg_info

    writing requirements to wordcloud.egg-info/requires.txt

    writing top-level names to wordcloud.egg-info/top_level.txt

    writing wordcloud.egg-info/PKG-INFO

    writing dependency_links to wordcloud.egg-info/dependency_links.txt

    reading manifest file 'wordcloud.egg-info/SOURCES.txt'

    writing manifest file 'wordcloud.egg-info/SOURCES.txt'

    installing library code to build/bdist.linux-x86_64/egg

    running install_lib

    running build_py

    running build_ext

    building 'wordcloud.query_integral_image' extension

    gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/eitano/anaconda3/envs/condamain/include/python3.5m -c wordcloud/query_integral_image.c -o build/temp.linux-x86_64-3.5/wordcloud/query_integral_image.o

    unable to execute 'gcc': No such file or directory

    error: command 'gcc' failed with exit status 1


    thanks
    Eitan

    ReplyDelete
    Replies
    1. Please report issues with the package on the github issue tracker. It seems you don't have gcc installed.

      Delete
  24. This is very nice. Is there a way to provide the code a list of words and its associated weights? If you can provide a simple example, it would be great.
    Thanks

    ReplyDelete
    Replies
    1. Yeah check out the docs. you want generate_from_frequencies

      Delete
  25. Hi Andreas,

    Thanks for the wonderful package! I want to use it to display topic model results for an academic paper (i.e., the LDA and Dynamic Topic Model of the Gensim package), but unfortunately that's not ideal with the current wordcloud package. Specifically, I would like 1 wordcloud with the top 30 words of each of the 3 topics in a different color. The 'color by group' example on your website is great for that type of thing, weren't it that topic models like LDA allow the words to occur in all the topics. Hence, there's overlap between the top 30 words of the 3 topics. As the words and frequencies are included as dictionary items, it is not possible to include the same word (with a different probability) twice. The only way to work around it now is to omit words that appear in the top 30 of more than 1 of the topics before computing the dictionary. I was wondering whether you might know how to work around this issue. If you could adjust the code to make this possible I think many people will use it to display topic model results this way.

    Thank you very much!

    ReplyDelete
    Replies
    1. Hi Myrthe. Feel free to send a PR to allow a word to appear multiple times. Personally, to visualize LDA I would either color a word according to the topic it is most strongly associated with, or color it using a mixture of the topic colors. I think showing a word multiple times will make it hard to see the correspondences.

      Delete
  26. Hello Andrew,

    I am trying to create custom colors for my word cloud and found the below function in the documentation. But this has hsl values for the single color we choose. As in the official documentation (for grey).

    I tried the changing the colormap parameter, but the colors were too bright.

    wordcloud = WordCloud(width=400, height=175,colormap = "plasma",scale = 2.0, max_words=150,normalize_plurals = False)

    How do we define our own choice of colors for the word cloud package or alter color brightness for the colormap parameter.

    ReplyDelete
    Replies
    1. You can use a more muted colormap. Check out https://matplotlib.org/examples/color/colormaps_reference.html
      Or you can write your own function to return rgb values.

      Delete
  27. hi could you please tell me how to do it but in arabic text i try this code but it dosn`t work with any languge but english

    ReplyDelete
  28. To everyone who's been having transformation issues, using wand via imagemagik makes things a lot easier.

    image = Image.open("file.png")
    image.convert("RGBA") # Convert this to RGBA if possible

    canvas = Image.new('RGBA', image.size, (255,255,255,255)) # Empty canvas colour (r,g,b,a)
    canvas.paste(image, mask=image) # Paste the image onto the canvas, using it's alpha channel as mask
    #canvas.thumbnail([width, height], Image.ANTIALIAS)
    canvas.save('file.png')

    from wand.image import Image

    with Image(filename='file.png') as img:
    img.format='jpeg'
    img.save(filename='file.jpg')

    The first half provides a white background (a lot of png images are transparent, so when you convert to jpg, or try to do the transformations using the previous script, you'll come across some problems). The 2nd half is converting the png to a jpeg (getting around all the transformations above).

    ReplyDelete
  29. Hello and thank you for the great program,

    I had a quick question. I've noticed my word cloud clips the very edges and top of my mask. The word cloud itself is fine, but the outline of the mask is gone (almost looks like its out of view). I initially thought it might just be a size thing, and thus I changed the height and width to 500, but it looked the exact same way with the sides still clipped. I've also tried different shapes and sizes, and they are all clipped at the very ends (i.e. whichever is the longest point is always clipped at its end). My jpgs that I am using are not clipped, so its not an image issue. So I don't quite know what the issue is.

    Any help would be appreciated!

    ReplyDelete
    Replies
    1. Please open an issue with a reproducible example on the issue tracker.

      Delete

Post a Comment

Popular posts from this blog

Machine Learning Cheat Sheet (for scikit-learn)

MNIST for ever....

Python things you never need: Empty lambda functions