Blogs

Geolocating tweets using Free/Open Source Components and Data

In this posting, I will describe the sample application I created using all free/open source software and free/open data sources to create a completely client-side application for searching for a keyphrase on Twitter within a geographic boundary and clusters tweets for a given lat/long coordinate. Furthermore, I will demonstrate some HTML5 capabilities, namely the Local Storage API.

To start, I am using several free/open components:

  1. jQuery -- the immensely popular cross-browser Javascript library.
  2. OpenLayers -- an open source Javascript library for displaying map data in browsers.
  3. OpenStreetMap -- a collaborative project to create a free editable map of the world.
  4. Nominatim -- a free service for converting addresses to coordinates.
  5. Twitter Search API -- Twitter's API for searching recent tweets

You likely recognize a number of these components, though Nominatim is probably the most obscure. Nominatim is actually a very important part of this application because most tweets are not geolocated, so we need to infer the location by using their "Location" tag. This is why you will see a large number of tweets clustered in one area...the users likely shared a common location name (e.g., Rochester, N.Y.).

If you would like to just see the application now and/or view the source code, you can do so at http://code.brianmokeefe.com/TweetMap.html. I will explain key parts this application below.

HTML Local Storage

	if(typeof(Storage)!=="undefined") {
		// in IE9, you cannot use local storage for a file system webpage (file://),
		// so this is a workaround to keep it from crashing
		var storage = {};
		if (localStorage) {
			storage = localStorage;
		}

		_cache = storage;
	}

Basically, this section says (1) does the browser support HTML5 Local Storage, (2) If yes, the store the geolocation lookups for future use (e.g., Rochester, N.Y. will probably always be in the same place), (3) if no, then just use a local object (not seen here, called _cache) to store the geolocation. There is a little workaround here...IE9 does not support Local Storage for file:/ urls. The var storage={}; if (localStorage)... block takes care of this.

Twitter Search API

	$.ajax({
		url: 'http://search.twitter.com/search.json',
		type: 'GET',
		dataType: 'jsonp',
		data: {
			q: query,
			lang: 'en',
			result_type: 'recent',
			rpp: 100,
			geocode: _lat + ',' + _lon + ',' + _distance + 'mi'
		},
		success: function(data, textStatus, xhr) {
			_results = data.results;

			// if data came back, then start processing the tweets
			if (data.results.length > 0) {
				setTimeout("processTweet();", 10);
			}
		}
	});
}

This block uses the jQuery ajax function to call the Twitter Search API. There are a few variables missing in this snippet, such as the query (the keyphrase), and the lat, lon, and distance used as a geofence around the search. This isn't foolproof...for instance Twitter commonly thinks posts from York, England fall within this area. In essence, we are finding all recent tweets (max 100, English language) in the geofenced area. If this succeeds, we are storing the data in a local variable and asynchronously calling the "processTweet()" function in 10ms.

Geolocation

For each tweet, I encoded within this code block the rules for geolocating it...

	        // if the tweet was geotagged, then nothing to do but queue the tweet
		if (tweet.geo) {
	  		lat = tweet.geo.coordinates[0];
			lon = tweet.geo.coordinates[1];
			geoCode = lat + ',' + lon;
			clusterTweet(geoCode, tweet.text);
		} else if (tweet.location){
			// if the location field was really geocoordinates, then parse them and queue the tweet
			var geocoords = /-?\d{1,2}\.\d+,-?\d{1,3}\.\d+/.exec(tweet.location);
			if (geocoords) {
				lat = geocoords[0].split(',')[0];
				lon = geocoords[0].split(',')[1];
				geoCode = lat + ',' + lon;
				clusterTweet(geoCode,tweet.text);
			} else {
				// otherwise, we are going to look up the coordinates from the location name
				geoCodeLookup(tweet, _boundingBox);
			}			
		}

Basically, if the tweet was geocoded, then use those coordinates. Otherwise, if the location property of the user contains coordinates, then use those coordinates. Finally, if the location is a text name for a location, then we will translate that name into geocoordinates (see below). In any case, the tweet is clustered by coordinates via the clusterTweet call.

function clusterTweet(geoCode, text) {
	
	// if we haven't seen this geocode yet, then create an array for it
	if (!_tweetClusters[geoCode])
		_tweetClusters[geoCode] = [];
	// add the tweet text to the array of tweets for the geocoordinate
	_tweetClusters[geoCode].push(text);
	// process the next tweet "asynchronously"
	setTimeout("processTweet();", 10);
}

This function says (1) If we do not have any tweets for these coordinates, then create a bucket (cluster) for those coordinates, (2) add the tweet to the bucket, (3) process the next tweet "asynchronously". Why asynchronously you say? The reason is that if the exact coordinates could not be determined from the geo or location properties, then we need to go to the Nominatim service to look it up...which we do asynchronously. It also helps prevent overflowing the stack if you keep calling "processNextTweet()" each time the web service returns.

Translating addresses to coordinates using Nominatim

Translating addresses to coordinates using Nominatim is a little tricky even if you are used to using jQuery to make AJAX calls. First, it does not use the standard way of supporting JSONP (read about it here if you aren't familiar). To support this, we specify the properties jsonp: false to tell jQuery to not add the "?callback=" string to the url (a standard way of doing JSONP). We also specify the callback name to use with jsonpCallback: 'json_callback' + tweet.id_str. This generates a unique callback function name (via the tweet unique identifier), and allows us to share the callback name with Nominatim, as required, using the querystring parameter "json_callback" as seen under the "data" block below.

		$.ajax({
			url: 'http://nominatim.openstreetmap.org/search',
			type: 'GET',
			dataType: 'jsonp',
			jsonp: false,
			jsonpCallback: 'json_callback' + tweet.id_str,
			data: {
				format: 'json',
				q: address,
				limit: 1,
				viewbox: boundingBox[1] + ',' + boundingBox[2] + ',' + boundingBox[3] + ',' + boundingBox[0],
				json_callback: 'json_callback' + tweet.id_str			
			},
			success: function(data, textStatus, xhr) {
				var coords = null;
				// if there was a response, then cache the [location name,geocoordinate] pair
				if (data && data.length && data.length > 0) {
					coords = data[0].lat + ',' + data[0].lon;
					addGeocode(address, coords);
				}
				clusterTweet(coords, tweet.text);
			}
		});

You may notice the function "addGeocode()". Another step omitted in this example occurs before the Nominatim call is even made...basically, if we have already looked up this exact address, then we cache the coordinates locally rather than overload the web service unnecessarily. This cache uses the HTML5 Local Storage mentioned above. These functions are very simple:

// is the location name in the geocode cache?
function lookupGeocode(address) {
	return (!_cache[address]) ? null : _cache[address];
}

// add the [location, geocoordinate] pair to the cache
function addGeocode(address, value) {
	_cache[address] = value;
}

Map the Tweets

Once we have clustered all the tweets, we then need to place them on the map. In between these steps, we iterate through each geocoordinate, consolidate the text into one giant string (headed by the count of the number of tweets in the cluster), and send it all to the "addMarker()" function.

	if (geoCode) {
	    var lonLat = new OpenLayers.LonLat( geoCode.split(',')[1] ,geoCode.split(',')[0] ).transform(
	      	new OpenLayers.Projection("EPSG:4326"), // transform from WGS 1984
	      	map.getProjectionObject() // to Spherical Mercator Projection
	    );
		var feature = new OpenLayers.Feature(_markerLayer, lonLat);
        feature.closeBox = true;
        feature.popupClass = AutoSizeFramedCloud;
        feature.data.popupContentHTML = tweetText;
        feature.data.overflow = 'auto';
        var marker = feature.createMarker();
        var markerClick = function (evt) {
            if (this.popup == null) {
                this.popup = this.createPopup(this.closeBox);
                map.addPopup(this.popup);
                this.popup.show();
            } else {
                this.popup.toggle();
            }
            OpenLayers.Event.stop(evt);
        };
        marker.events.register("mousedown", feature, markerClick);
        
	    _markerLayer.addMarker(marker);
	}
}

This function creates an OpenLayers Feature (a marker and popup combination) which assigns the consolidated tweet text as the popup content and then adds a function stating to show the popup text if the marker is clicked on ("mousedown" event to be specific). It then adds the marker to the OpenLayers marker layer (referenced as _markerLayer). I won't get into the OpenLayers mapping API too much here, but it should be pretty easy to follow if you look at the source code.

Conclusion

I hope adding this example is useful to you. I licensed the code under the Creative Commons license to make it easy to use while still recognizing my effort. There is some more code in the example I didn't speak about at http://code.brianmokeefe.com/TweetMap.html, primarily around OpenLayers configuration and helper functions for converting miles to kilometers and calculating distance to figure out the bounding box for mapping and geolocation. I know there are a lot of details I omitted, so feel free to contact me via @brianmokeefe or using the email below and I'll be glad to help.

HTML5 Canvas Demo -- Graph Viewer/Editor

I do a lot of work with graphs on a day to day basis, and with my interest in HTML5 and the platform independence around it, I decided to take a bit of my free time to try to design a very simple graph viewer/editor that would work cross platform in HTML5. The code below was tested in Chrome 16, IE 9, and Firefox 10.
Goals:

  • Be able to display 100 nodes connected with edges using the HTML5 Canvas element
  • Be able to drag and move nodes
  • Be able to add nodes
  • Be able to add edges
  • Concentrate on key concepts and not worry so much about tertiary functionality or looks (mostly z-indexing and where the edges connect to the node)

The results of this experiment can be viewed on my code page at http://code.brianmokeefe.com/Graph.html. Most of the functionality is encapsulated in the Javascript file at http://code.brianmokeefe.com/Graph.js. Note: for the latter file, you probably want to right-click and save target, as it is a scripting file. The JS file is licensed under the Apache License, Version 2.0.

Some key components to note:

  1. There are two main objects used here: (a) the StateMachine to manage the states of the page (Select, Add Node, Add Edge) as controlled by the toggle buttons at the top, and (b) the Graph.Manager object that controls the graph functionality, including events received from the DOM.
  2. The HTML5 canvas object is created dynamically in the Graph.Manager constructor.
  3. The Graph.Manager.paint() call manages drawing the node and edges (via the draw() function in the Graph.Node and Graph.Edge objects)
  4. It doesn't work on my Android phone yet, except for adding a node (probably because I didn't do anything with touch).

NOTE: In the code snippets below, anything denoted as "this" is scoped by the object that contains it. It is not intended as cut-and-paste code without understanding the entire class from the Javascript file above.

In the Graph.Manager constructor, this section creates the HTML5 canvas object, and its resulting 2D context.

	// Set the HTML5 Canvas object
	this.canvas = document.createElement('canvas');
	this.canvas.height = height;
	this.canvas.width = width;
	this.canvas.style.position = 'relative';
	this.context = this.canvas.getContext('2d');

	// Add the canvas to the container element
	this.container.appendChild(this.canvas);

The Context object is the important piece to drawing in the Canvas object. It is used here to draw a node (in this case a circle centered at x,y with a given radius. 0, 2*Math.PI signifies drawing a full circle -- an arc from 0 to 360 degrees).

	context.save();
	
	// Draw node
	context.strokeStyle = 'rgb(0,0,0)';
	context.fillStyle = this.fillColor;
	context.beginPath();
	context.arc(this.position.x, this.position.y, this.radius, 0, 2 * Math.PI, false);
	context.closePath();
	context.stroke();
	context.fill();
		
	context.restore();

This code draws an edge between the two nodes. It has very simple logic to connect the edge to the node at 0, 90, 180, or 270 degrees depending upon the position of the nodes in relation to each other (e.g., basically, if the two nodes are further apart horizontally, connect them on the left and right sides; if they are further apart vertically, connect them on the top or bottom, as appropriate).

Graph.Edge.prototype.draw = function(context) {
	context.save();

	var deltaX = this.parentNode.getPosition().x - this.childNode.getPosition().x;
	var deltaY = this.parentNode.getPosition().y - this.childNode.getPosition().y;
	var coordStart, coordStop;
	
	if (Math.abs(deltaX) >= Math.abs(deltaY)) {
		coordStart = (deltaX < 0) ? this.parentNode.getRight() : this.parentNode.getLeft();
		coordStop = (deltaX < 0) ? this.childNode.getLeft() : this.childNode.getRight();
	} else {
		coordStart = (deltaY < 0) ? this.parentNode.getTop() : this.parentNode.getBottom();
		coordStop = (deltaY < 0) ? this.childNode.getBottom() : this.childNode.getTop();		
	}
	
	// Draw line and arrow from parent to child
	context.strokeStyle = 'rgb(0,0,0)';
	context.fillStyle = 'rgb(0,0,0)';
	
	// Draw the line 
	context.beginPath();
	context.moveTo(coordStart.x,coordStart.y);
	context.lineTo(coordStop.x,coordStop.y);
	context.closePath();
	context.stroke();
	context.fill();
	
	context.restore();
};

There is also code in the HTML page to visualize the edge as two nodes are being connected. The paintWithCallback function does the same as the paint() function above, except it calls the function defined as the parameter immediately after the paint. In this case, it allows us to draw the edge as the mouse drag from the start to the end occurs. The code is very similar to the Edge code above, except it is drawing from the node to the mouse pointer, instead of from one node to the next.

manager.paintWithCallback(function(context) {
	var deltaX = _stateMachine.data.getNode().getPosition().x - evt.offsetX;
	var deltaY = _stateMachine.data.getNode().getPosition().y - evt.offsetY;
	var coordStart;
	var coordStop = new Graph.Coordinates(evt.offsetX, evt.offsetY);
			
	if (Math.abs(deltaX) >= Math.abs(deltaY)) {
		coordStart = (deltaX < 0) ? _stateMachine.data.getNode().getRight() : _stateMachine.data.getNode().getLeft();
	} else {
		coordStart = (deltaY < 0) ? _stateMachine.data.getNode().getTop() : _stateMachine.data.getNode().getBottom();
	}

	context.save();

	// Draw line and arrow from parent to child
	context.strokeStyle = 'rgb(0,0,0)';
	context.fillStyle = 'rgb(0,0,0)';
	context.beginPath();
	context.moveTo(coordStart.x,coordStart.y);
	context.lineTo(coordStop.x,coordStop.y);
	context.closePath();
	context.stroke();
	context.fill();
			
	context.restore();		
});

Finally, because the varying Javascript implementations differ in the way that they create the event DOM object, this class normalizes the implementations for simplification. Basically, if the event object doesn't have offsetX, then the same value is created by subtracting clientX from the event target's left and top offset.

Graph.BrowserEvent = function(evt) {
	this.offsetX = evt.offsetX ? evt.offsetX : (evt.clientX - evt.currentTarget.offsetLeft);
	this.offsetY = evt.offsetY ? evt.offsetY : (evt.clientY - evt.currentTarget.offsetTop);
	this.type = evt.type;
	this.whoAmI = "Graph.BrowserEvent";
};

I hope this is useful. If you have any questions or comments, please drop me an email at the address below, or via Twitter (@brianmokeefe).

Installing Cassandra as a Windows Service

Recently, I have been using Cassandra on some big data processing projects. While running it in the foreground was fine for early development, I needed to get it running as a Windows service for more mature development use. Luckily, this is easier than it first seems if you look closely. Here is a list of steps for Cassandra 1.0.X:

  1. I assume you have Cassandra working fine on Windows, but if you do not, it probably has something to do with the Unix-style paths in conf\Cassandra.yaml.
  2. Download Apache Common Daemon -- Procrun. I suggest getting the Native Binaries, which are a little tricky to find from the website, but can be found directly at http://www.apache.org/dist/commons/daemon/binaries/windows/.
  3. Unzip the commons-daemon...-windows.zip file on your hard drive.
  4. Create an environment variable called PRUNSRV and point it to the full path of prunsrv.exe from the previous step. IMPORTANT NOTE: If you are running Windows 64 bit, make sure you reference the prunsrv.exe in the amd64 subdirectory! Otherwise, the service will fail to start.
  5. Navigate to the bin\ directory in your Cassandra installation.
  6. At a command prompt, type cassandra.bat install
  7. Open the Services application under Administrator Settings. You should see an entry called cassandra. Start that service.
  8. Use the cassandra-cli to test that Cassandra is working.

Troubleshooting:

  • If the service fails to start and you are running 64 bit Windows, see the note above about using the amd64\prunsrv.exe. If you look in Cassandra\bin\daemon\commons-daemon.log and see the line Commons Daemon procrun (1.0.8.0 32-bit) started, you are using the wrong one.
  • Make sure you are using the proper 32 or 64 bit JVM.
  • I'll add more as I encounter them.

References:

  1. http://svn.apache.org/repos/asf/cassandra/trunk/README.txt

HTML5 Reaction Time Test

As I try to update my knowledge with more about the HTML5 spec and apply the knowledge to a cognitive readiness assessment project I am working on, I have created a simple HTML5/AJAX/CSS Reaction Time Test (not verified as a scientific instrument) for demonstration purposes (link below). It is amazing what you can do now in a browser with HTML5, as this could be developed without creating any static images or using a third party programming tool or middleware server. I am posting this work under the Creative Commons Attribution 3.0 Unported License (CC BY 3.0) to share the knowledge.

NOTE: You will need a recent HTML5 compatible browser such as Firefox, Chrome, or Internet Explorer 9 to view this.
http://code.brianmokeefe.com/reactiontime.htm

Meeting Announcement for my presentation at IEEE-CS Dayton Chapter Meeting

Dayton Chapter Meeting
Tuesday, June 21, 2011
What is a Bayesian Network and why would a Computer Scientist care?

TOPIC: Bayesian networks
SPEAKER: Brian O’Keefe, InfoSciTex Corp
DATE: Tuesday, June 21, 2011
TIME: 11:30-12:30 PM
PLACE: InfoSciTex (Different location from 2010)
4027 Colonel Glenn Hwy, Suite 210
Dayton, OH 45431-1672
Just West of intersection with National Rd/Grange Hall Rd, in Signal Hill TechneCenter (building 4027 is the left-most of the 3 buildings)
RSVP: Dave Perez, david.c.perez@computer.org or (937) 904-5486. Pizza and soft drinks provided for $3/person - RSVP (Name, e-mail, phone, citizenship, IEEE member?, CompSoc?, pizza?) required. Meeting open to all.

ABSTRACT: Artificial Intelligence, Data Mining, Cognitive Systems, Simulation…all terms that you come across in the computing world often and all heavily reliant upon statistical reasoning. Beyond computing, University of Michigan Psychologist Richard E. Nisbett wrote an article in Science magazine demonstrating that even relatively brief training on statistical reasoning may improve the quality of everyday decision making.

Bayes’ Theorem is widely accepted as the theoretically sound approach to represent and reason under uncertainty. An application of Bayes’ Theorem, Bayesian Networks, are probabilistic graphical models used across a wide variety of applications such as bioinformatics, medicine, information retrieval and classification, data fusion, decision support systems, and gaming.

The goal of this presentation is to show that and “directed acyclical graphs” are much simpler to understand than you might believe. A Bayesian Network’s graphical presentation is very intuitive and even a limited knowledge of statistics provides enough foundation to use them for your own decision analysis and planning.

This presentation will begin with a very understandable demonstration of Bayes’ theorem, will follow with an intuitive Bayesian Network example to demonstrate the theory, and conclude with real world Bayesian examples, both from the speaker’s past work, and other common applications you may not even realize touch your life every day, such as spam filtering.

BIO: Brian O’Keefe is a Team Leader of Software Systems and IT for the National Security Systems division of Infoscitex Corporation. Mr. O’Keefe received a B.S. in Computer Science in 1999 and taught himself the world of Bayesian Networks despite having no Mathematics degree, demonstrating that it isn’t as hard as it looks! Having researched the application of Bayesian Networks in a wide assortment of applications ranging from route planning to biosurveillance to counterinsurgency, and having spoken and trained others in a variety of topics including data warehousing, product SDKs, object-oriented programming, and Bayesian Networks, he hopes not to disappoint!

You reap what you sow: An observation of policy gone awry

In a social modeling and public policy state of mind due to traveling to the OSD HSCB Focus 2011 conference, I had an interesting experience of policy gone awry. As everyone is aware, most airlines have gone to the a la carte baggage fees, which has seem to lead to significantly more (and larger) carry-on bags and significantly less checked bags. My flight the morning of the conference was full, and just before boarding, they made the announcement that there would be no gate check service on the flight as was their policy with the larger commuter jets (not sure if it was the regional or parent airline that had this policy).

At this point, I'm sure you can imagine where this was going to go...especially if you look around the terminal and see all the giant carry-ons. Sure enough, as you can imagine, the plane was chaotic with people trying to fit carry-on bags everywhere. At one point, the flight attendents were asking people to put all smaller carry-ons under the seats -- problematic with the minimal leg room as it is, and to wear coats to make room --as winter in Rochester involves heavy coats, which are also uncomfortable to wear on a plane. Not surprisingly, people didn't seem eager to comply with these requests...I know I wasn't jumping up to comply as I (ok, my company) paid the fee check my bag, so I thought I did my part. The situation itself was uncomfortable, but upon hearing the flight attendant walking up and down the aisle complaining that everyone was only looking out for themselves, I was really annoyed. I know she probably had no say in these policies, but it was her employer that (in my opinion) created this situation, whether she liked it or not. I just wonder if anyone in the airline management was even aware of the situations they were causing. When I got off the flight at Reagan National though, I can tell you that there were only two bags in the carousel from a full regional jet!

So do I have a better solution or would I have paid X more dollars to allow the airline to avoid this situation? I don't have an answer -- the beauty of calling it merely an observation. But I do know that I notice this situation happening more often, whether due to the economy, a greater awareness, or some other factor. After all, it will be interesting to see what happens in two years with my mobile service now that my unnamed carrier is getting rid of a token discount on a new phone (with new contract) that was enough incentive to keep me from switching.

My ideal developer machine running RHEL 5.5 and Windows 7

UPDATED 12/9/2010: As I suspected, I do have some file writing corruption writing large files and folders to the NTFS partition via the Windows 7 virtual machine. I added the details below
********************************************************************************************************************************

Translating VMAP data

Recently, I was tasked with utilizing vector data supplied by NIMA in VMAP format for a project. If you have ever dealt with VMAP data (vector product format), as used by NIMA, you know that the format is not for the feint of heart. As a result, I decided to import the data into a PostgreSQL database and utilize PostGIS to easily query the data. As I could not find a simple tutorial online, and since there were some gotchas to look out for, I decided to put one here.

Sony earns another chance by doing right...eventually.

It was a long, drawn out, and frustrating process, but I am finally satisfied with my response from Sony regarding my LCD TV. In my previous post, I detailed my problems with my Sony Bravia KDL46V3000 television. As the problem seemed to be widespread among many models of Sony TVs, although I did not get the extended warranty, I thought I had a strong case under the "Implied Warranty of Merchantability" as this seemed to be a design flaw. With this information in hand and a visit and problem summary from a Sony Platinum support technician, I started out to state my case.

I contacted Sony's support and customer relations, sent them my information, and later received an unacceptable token offer of a hundred dollars or so off retail on a model they were phasing out. I called again and pleaded my case to no avail. I wrote a letter to their Executive Review Committee that they acknowledged receiving but stated that their offer did not change. At this point, I filed a complaint with the New York State Consumer Protection Board and began to look at other options, including filing a lawsuit in small claims court.

After week or so passed with no response to the Consumer Protection Board letter to Sony, and my preliminary investigation into small claims court appearing that New York state is not consumer friendly in that you need to file against a corporation in the county in which they are registered, which in Sony's case is Albany County...a 3.5 hour drive from my home. Anger began when I saw an article on The Consumerist about a person with the same problems as me and who had blogged and Twittered about it (as I had) and Sony agreed to replace his television for free with a brand new television! Obviously followers and daily readers matter a lot when it comes to your web presence (his being significantly better than mine).

Given this information, I began to give up hope for a speedy resolution and started shopping for a new television (I eventually bought the Insignia NS-L46X-10A), as the Super Bowl and Olympics were just around the corner. Soon after, I received a welcome surprise in a letter from Sony received via the New York State Consumer Protection Board. While my initial subjective response was not of joy given the process I had been through and the greater value of a cash judgment in small claims court (given I already purchased a replacement television), an objective review of the offer convinced me that it is a fair attempt by Sony to resolve my dissatisfaction and the issue with the television. As such, I graciously accepted a settlement and I look forward to giving the replacement Sony television a chance to redeem the Sony brand name and make me consider Sony products again in the future.

ASP.NET AJAX Library: When they say exact signature, they mean EXACT SIGNATURE

Here is a little gotcha to look out for when using ASP.NET AJAX Library controls, such as the AutoComplete control. For any control that makes a call to a web service, the web service must have the EXACT signature listed for the call...to the point that even the parameter names and case of said parameter names must match EXACTLY. This tripped me up, as my controls were not calling my web service (with no error message or feedback as well) because I called the first parameter "String prefix" instead of "String prefixText".

Syndicate content