PhantomJS with GhostDriver on OpenShift

phantomjs_logotl;dr

PhantomJS’ GhostDriver still binds localhost only, which makes it unusable on OpenShift (there are many complains about it, like this). The patch is very simple and already there, but it hasn’t still been merged on PhantomJS main source tree. I’ve patched and compiled PhantomJS 1.9.8 on a CentOS 6.6 x86_64. The executable runs smoothly on OpenShift’s machines.

To run GhostDriver on OpenShift you must specify both an IP address and a port. For example on my machine the variable OPENSHIFT_JBOSSEWS_IP contains the IP address (depending on your type of machine, the variable’s name may vary, but you can easily find it with a “env | grep IP”).

$ ./phantomjs --webdriver=$OPENSHIFT_JBOSSEWS_IP:15002

That’s all… Enjoy!

The long, boring story

From time to time I love to write some web scraping stuff, both for fun and profit. I used to host all my scrapers on my Raspberry PI, but the frequent power failures and SD card corruptions made me rethink that strategy: now they’re all running happily on OpenShift.

Since “consolidation through virtualization” (yo! cloud!) wasn’t enterprise enough, I decided to rewrite my scrapers in Java: my old C/Python/JS/Go/whatever little helpers become part of a single Java web application (ah the horror, the pain!).

Most of the scraping is done quite handily by jsoup, but a few nasty .NET based websites required something more extreme, like a fully fledged headless browser: PhantomJS. I use PhantomJS from my Java app via Selenium‘s RemoteWebDriver.

The RemoteWebDriver requires a running instance of PhantomJS with the GhostDriver module listening on a specified IP address and port; you can copy a phantomjs executable (Linux x86_64) on your OpenShift machine (the app-root/data directory is writable) and start it via SSH, for example. The canonical example is this:

$ ./phantomjs --webdriver=15002

After issuing this command GhostDriver is listening on http://localhost:15002, ready to be used via RemoteWebDriver on your Java app.

This doesn’t work on OpenShift! You cannot bind anything on localhost, you need to specify the machine’s IP address. For example, the IP address on my OpenShift machine is contained in the environment variable OPENSHIFT_JBOSSEWS_IP (depending on your type of OpenShift instance the variable may have other names, but it’s easy to find out with a “env | grep IP”).

In order to run GhostDriver on OpenShift you should be able to issue a command similar to this, that should bind the correct IP:

$ ./phantomjs --webdriver=$OPENSHIFT_JBOSSEWS_IP:15002

Unluckily this still binds localhost. There is a patch that implements this behaviour correctly, but it still hasn’t made its way in the PhantomJS official source tree, let alone the precompiled binaries. The good news is that I’ve found a bit of time to recompile PhantomJS with that patch:

It’s an x86_64 build, compiled on CentOS 6.6, that runs flawlessly on OpenShift (you may download it using wget… also, make sure to “chmod +x” it before running it).

For completeness, if you want to start GhostDriver whenever you deploy your app, you can use OpenShift’s post-deploy hook. Just write in your .openshift/action_hooks/post_deploy something like this:

#!/bin/bash
nohup ${OPENSHIFT_DATA_DIR}/phantomjs/bin/phantomjs --webdriver=$OPENSHIFT_JBOSSEWS_IP:15002 &

(that’s my location of choice for the phantomjs binary, but it’s not mandatory)

Now I can finally use RemoteWebDriver:

DesiredCapabilities caps = new DesiredCapabilities();
caps.setJavascriptEnabled(true);

String ip = System.getenv("OPENSHIFT_JBOSSEWS_IP");
String url = String.format("http://%s:15002", ip);
WebDriver driver = new RemoteWebDriver(new URL(url), caps);

Happy scraping!

12 thoughts on “PhantomJS with GhostDriver on OpenShift

  1. Lex Tsang Reply

    Hi, thanks a lot for your build of phantomjs. I tried to download your binary into my openshift server. However, every time when I run it, it always gives me a “Segmentation Fault” error. Do you know how I can fix it? Thanks.

  2. Jonny Reply

    Hi Paolo,

    Nice post. while I found “Segmentation fault” in my DIY cartridge…
    compile issue?

  3. Chris Maillard Reply

    Hi Paolo,
    Nice post! I’m really interested into this solution, but I have not been able to make it work.
    When running “./phantomjs –webdriver=$OPENSHIFT_JBOSSEWS_IP:23456”, I also catch a segmentation fault. My cartridges: Tomcat 7 (JBoss EWS 2.0), MongoDB 2.4 and RockMongo 1.1.
    $ uname -a
    Linux ex-std-node431.prod.rhcloud.com 2.6.32-504.3.3.el6.x86_64 #1 SMP Fri Dec 12 16:05:43 EST 2014 x86_64 x86_64 x86_64 GNU/Linux
    $ ldd phantomjs
    statically linked
    Any idea why this error and how to solve it?
    Thanks 🙂

  4. wl Reply

    I get Segmentation fault too. So I want to do compile myself. But I do not know which project to compile and how. Would you show me some light?

    • Paolo Bernardi Post authorReply

      I think you can just follow the instructions on their build page (I did too, patching excluded): http://phantomjs.org/build.html

      Be careful about the external libraries, at least libfreetype, libfontconfig and libexpat: the phantomjs binary requires these, and they must be binary compatible with the environment that you will run phantomjs on. You’re probably getting a segfault because of binary library incompatibility.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.