PhantomJS with GhostDriver on OpenShift

Posted on 25 February 2015 by Paolo Bernardi

tl;dr

PhantomJS’ GhostDriver still binds localhost only, which makes it unusable on OpenShift (there are many complains about it, like this. The patch is very simple and already there, but it hasn’t still been merged on PhantomJS main source tree. I’ve patched and compiled PhantomJS 1.9.8 on a CentOS 6.6 x86_64. The executable runs smoothly on OpenShift’s machines.

phantomjs 1.9.8 with the GhostDriver patch (33 Mb)

To run GhostDriver on OpenShift you must specify both an IP address and a port. For example on my machine the variable OPENSHIFT_JBOSSEWS_IP contains the IP address (depending on your type of machine, the variable’s name may vary, but you can easily find it with a “env | grep IP”).

$ ./phantomjs --webdriver=$OPENSHIFT_JBOSSEWS_IP:15002

That’s all… Enjoy!

The long, boring story

From time to time I love to write some web scraping stuff, both for fun and profit. I used to host all my scrapers on my Raspberry PI, but the frequent power failures and SD card corruptions made me rethink that strategy: now they’re all running happily on OpenShift.

Since “consolidation through virtualization” (yo! cloud!) wasn’t enterprise enough, I decided to rewrite my scrapers in Java: my old C/Python/JS/Go/whatever little helpers become part of a single Java web application (ah the horror, the pain!).

Most of the scraping is done quite handily by jsoup, but a few nasty .NET based websites required something more extreme, like a fully fledged headless browser: PhantomJS. I use PhantomJS from my Java app via Selenium’s RemoteWebDriver.

The RemoteWebDriver requires a running instance of PhantomJS with the GhostDriver module listening on a specified IP address and port; you can copy a phantomjs executable (Linux x86_64) on your OpenShift machine (the app-root/data directory is writable) and start it via SSH, for example. The canonical example is this:

$ ./phantomjs --webdriver=15002

After issuing this command GhostDriver is listening on http://localhost:15002, ready to be used via RemoteWebDriver on your Java app.

This doesn’t work on OpenShift! You cannot bind anything on localhost, you need to specify the machine’s IP address. For example, the IP address on my OpenShift machine is contained in the environment variable OPENSHIFT_JBOSSEWS_IP (depending on your type of OpenShift instance the variable may have other names, but it’s easy to find out with a “env | grep IP”).

In order to run GhostDriver on OpenShift you should be able to issue a command similar to this, that should bind the correct IP:

$ ./phantomjs --webdriver=$OPENSHIFT_JBOSSEWS_IP:15002

Unluckily this still binds localhost. There is a patch that implements this behaviour correctly, but it still hasn’t made its way in the PhantomJS official source tree, let alone the precompiled binaries. The good news is that I’ve found a bit of time to recompile PhantomJS with that patch:

It’s an x86_64 build, compiled on CentOS 6.6, that runs flawlessly on OpenShift (you may download it using wget… also, make sure to “chmod +x” it before running it).

For completeness, if you want to start GhostDriver whenever you deploy your app, you can use OpenShift’s post-deploy hook. Just write in your .openshift/action_hooks/post_deploy something like this:

#!/bin/bash
nohup ${OPENSHIFT_DATA_DIR}/phantomjs/bin/phantomjs --webdriver=$OPENSHIFT_JBOSSEWS_IP:15002 &

(that’s my location of choice for the phantomjs binary, but it’s not mandatory)

Now I can finally use RemoteWebDriver:

DesiredCapabilities caps = new DesiredCapabilities();
caps.setJavascriptEnabled(true);
String ip = System.getenv("OPENSHIFT_JBOSSEWS_IP");
String url = String.format("http://%s:15002", ip);…
WebDriver driver = new RemoteWebDriver(new URL(url), caps);

Happy scraping!

Get in touch

Thank you for contacting me, I will be in touch with you as soon as possible.
There was an error while trying to send the comment, please try again later.