29 03/11
13:08

Local public transportation in my pocket

I’ve lately spent some time developing a webapp for mobile devices to interact with some of the data published by the Gijón City Council. More specifically, data about local public land transportation schedule and live arrivals. The way they are presenting that information for mobile devices at the moment is very very heavy and slow, so I thought it may be useful to do something simpler for personal usage.

Basically, it is a simple web service that intensively caches data (to avoid stressing the data origin with many requests) and a fancy AJAX-powered frontend with some CSS with mobile browsers in mind (works flawlessly on Android’s browser and Mobile Safari). Additionally, if you add it as a bookmark to your iPhone’s home screen it behaves like a native application (you know, splash screen, custom icon, taskbar and so on).

I’m now working on client-side caching using HTML5 caching for offline usage. This way the application will boot way faster. It’s almost done, but it still needs some debugging.

I don’t intend to make it public for now. However, if you find it useful feel free to drop me a line. Beta testers are always welcome (but unfortunately won’t be rewarded).

This is how it looks like at the moment. The source will be released soon.

Update (23:26): Android screenshots provided by Javier Pozueco. Thanks buddy!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

24 03/11
12:31

Search term completion using a search tree

Google search box completion

*lol*

Nowadays it’s very usual to find websites offering hints while you’re typing on a search box. Google is a pretty good example of it. But, how could it be implemented?

This feature could be implemented either in the client side or in the server side. If the word list is big (usually it is), it’s recommended to keep the lookup logic in the server side to save some bytes while transferring the page to the client and also to save some computing power using server-side caches (cool when you plan to serve many requests).

Either way, there should be a data structure somewhere containing the word list and an algorithm to do the lookup. The simplest approach may be to use a list to store the words and issue something like this when you want to get a list of hints for a given prefix:

filter(lambda x: x.startsWith(prefix), word_list)

That’s Python’s filter, but it works the same way the well-known Haskell’s first-order function filter does. It builds a new list with the elements of the original list (word_list) that match the predicate (the lambda function).

Although the results can (and should) be cached, the very first lookup (or when the cache expires) would be very inefficient because the entire list must be traversed and that operation will take linear time. Not bad, but when the size of the problem gets bigger (i.e. more and more words in the database) the lookup process may be too slow, especially whether you’re serving several users at the same time. If the list was sorted, the execution time could be improved a little bit by writing a more sophisticated algorithm, but let’s keep it that way for now.

Fortunately, there are better and faster ways to face the problem. If you don’t want to write code (usually the best choice) you may use some high-performance indexing engine such as Apache Lucene. But if you prefer the ‘do-it-yourself’ way (for learning purposes), a search tree (more specifically, a trie or a prefix tree) is a good approach.

I’ve poorly benchmarked both alternatives (the list and the tree) and as expected the tree is pretty quicker generating hints. What I did was to feed both data structures with the content of an American English word list holding ~640k words (debian package wamerican-insane).

So, assuming four is a reasonable minimum prefix length, I measured the time it would take to get a list of words prefixed by hous (yes, just one, remember I said this was a poor benchmark? ;). Unsurprisingly, it took around 230 times longer for the list alternative to generate the hints (438.96 ms vs 1.92 ms). Wow.

My implementation of the tree is as follows. The API is quite straightforward, the “hot” methods are put and get_hints. I’ve stripped off the test suite for space reasons.

Usage example:

>>> tree = HintSearchTree()
>>> tree.put("nacho")
>>> tree.put("nachos")
>>> tree.put("nachete")
>>> tree.get_hints("nach")
['nachete', 'nacho', 'nachos']
>>> tree.get_hints("nacho")
['nacho', 'nachos']
>>> tree.delete("nacho")
>>> tree.get_hints("nacho")
['nachos']
>>> tree.count_words()
2
>>> tree.get_hints("n")
['nachete', 'nachos']
>>> tree.is_indexed("nachete")
True
>>> tree.is_indexed("nach")
False
>>> tree.empty()
False
class HintSearchTreeNode(object):
class HintSearchTreeNode(object):
  def __init__(self, parent=None, terminal=False):
    self._children = {}
    self._terminal = terminal
    self._parent = parent
 
  @property
  def children(self):
    return self._children
 
  @property
  def terminal(self):
    return self._terminal
 
  @terminal.setter
  def terminal(self, value):
    self._terminal = value
 
  @property
  def parent(self):
    return self._parent
 
class HintSearchTree(object):
  def __init__(self):
    self._root = HintSearchTreeNode()
 
  def put(self, word):
    """Adds a word to the tree."""
    # TODO: Sanitize 'word'
    if len(word) > 0:
      self._put(self._root, word)
 
  def count_words(self):
    """Retrieves the number of indexed words in the tree."""
    return self._count_words(self._root)
 
  def is_indexed(self, word):
    """Returns True if 'word' is indexed."""
    node = self._find(self._root, word)
    return node is not None and node.terminal is True
 
  def get_hints(self, prefix):
    """Returns a list of words prefixed by 'prefix'."""
    return self._match_prefix(self._root, prefix)
 
  def delete(self, word):
    """Deletes 'word' (if exists) from the tree."""
    terminal = self._find(self._root, word)
    if terminal is not None:
      terminal.terminal = False
      self._prune(terminal.parent, word)
 
  def empty(self):
    """Returns True if the tree contains no elements."""
    return len(self._root.children) == 0
 
  def _put(self, node, word, depth=0):
    next_node = node.children.get(word[depth])
    if next_node is None:
      next_node = HintSearchTreeNode(parent=node)
      node.children[word[depth]] = next_node
    if len(word)-1 == depth:
      next_node.terminal = True
    else:
      self._put(next_node, word, depth+1)
 
  def _count_words(self, node):
    words = 1 if node.terminal is True else 0
    for k in node.children:
      words += self._count_words(node.children[k])
    return words
 
  def _match_prefix(self, node, prefix):
    terminal = self._find(node, prefix)
    if terminal is not None:
      return self._harvest_node(terminal, prefix)
    else:
      return []
 
  def _harvest_node(self, node, prefix, path=""):
    hints = []
    if node.terminal is True:
      hints.append(prefix + path)
    for k in node.children:
      hints.extend(self._harvest_node(node.children[k], prefix, path+k))
    return hints
 
  def _find(self, node, word, depth=0):
    if depth == len(word):
      return node
    else:
      child = node.children.get(word[depth])
      if child is not None:
        return self._find(child, word, depth+1)
      else:
        return None
 
  def _prune(self, node, word):
    if self._count_words(node.children[word[-1]]) == 0:
      del node.children[word[-1]]
      if len(node.children) == 0 and node.parent is not None \
          and node.terminal is not True:
        self._prune(node.parent, word[:-1])

The code is released in the public domain.

24 02/11
22:55

Some Perl to redirect HTTP requests

After almost a year without publishing a single post, it seems this week I’m going to beat all my records.

A week ago, I wanted to prank my brother for a while. Nothing sophisticated… just some Iptables rules, Tinyproxy and HTTP magic. To go ahead with my evil plans, I needed “something” able to redirect a HTTP request. Actually, there are several ways to do that: Apache redirects, Tornado, Netcat* and so on. These alternatives are fast, bulletproof and time-saving, but not fun.

As many of you probably know, I didn’t get a job yet. That necessary means that I’ve got plenty of free time to waste. So… what did I do? I wrote some Perl and today I’m publishing the source code just in case someone finds it useful somehow. Like the previous entry, it’s published in the public domain.

The script just collects connections, issues 301 back (Moved Permanently) and sets Location to the URI specified as a command line argument (option -u). It lacks some security checks (left as an exercise to the reader) but it does what it is supposed to do. You may likely spot some silly bugs as I haven’t spent much time reading it again. Reports are welcome!

For those wondering, the prank was a big success. I’m afraid I can’t spare any detail by now but it turns out my bro is still thinking that his computer has been cracked.

Example invocation:

$ perl redir.pl -p 7070 -v -t 3 -u http://31337.pl
2011/02/24 21:41:54 Listening on port 7070
2011/02/24 21:41:54 Redirecting HTTP requests to: ‘http://31337.pl’
2011/02/24 21:41:54 3 thread(s) working under the hood

And finally the source code:

use warnings;
use threads;
 
use Thread::Queue;
use POSIX;
 
use IO::Socket::INET;
use HTTP::Request;
use HTTP::Status qw(:constants status_message);
 
use Getopt::Long;
use DateTime::Format::HTTP;
use Data::Validate::URI qw(is_http_uri);
use Log::Log4perl qw(:easy);
 
use constant MAX_THREADS => 10;
use constant MAX_LEN_HEADERS_BUFFER => 8*1024;
use constant DEFAULT_REDIRECT_URI => "http://www.example.org";
use constant DEFAULT_PORT => 80;
use constant DEFAULT_POOL_SIZE => 3;
 
my $redir_uri = DEFAULT_REDIRECT_URI;
my $server_port = DEFAULT_PORT;
my $thread_pool_size = DEFAULT_POOL_SIZE;
my $verbose;
 
GetOptions('url=s' => \$redir_uri, 
           'port=i' => \$server_port,
           'threads=i' => \$thread_pool_size,
           'verbose'  => \$verbose) or exit -1;
 
die "Invalid redirect URI (e.g. http://www.example.org)\n" unless is_http_uri($redir_uri);
die "Invalid port (e.g. 8080)\n" unless 0 < $server_port && $server_port < 2**16;
die "Invalid pool size (should be in [1..".MAX_THREADS."])\n" 
            unless 0 < $thread_pool_size && $thread_pool_size <= MAX_THREADS;
 
Log::Log4perl->easy_init( level => $verbose? $DEBUG : $INFO );
 
my $pending = Thread::Queue->new(); 
 
my $lsock = IO::Socket::INET->new( LocalPort => $server_port,
                                   Proto => 'tcp',
                                   Listen => 1,
                                   Reuse => 1 ) or die "Couldn't bind listening socket ($!)\n"; 
 
INFO("Listening on port $server_port\n");
INFO("Redirecting HTTP requests to: '$redir_uri'\n");
 
my @workers = ();
for (1..$thread_pool_size) {
    if ($thread = threads->create("worker")) {
        push(@workers, $thread);
    }
}
 
DEBUG(sprintf("%d thread(s) working under the hood\n", $#workers+1));
 
# Set a tidy shutdown just in case an external agent SIG{INT,TERM}s the process
$SIG{'INT'} = $SIG{'TERM'} = sub {
    # Dirty hack. threads->kill() does not wake up the thread :(
    for (1..@workers) {
        $pending->enqueue(-1);
    }
    for (@workers) {
        DEBUG(sprintf("Worker %d terminated: %d clients served\n", $_->tid, $_->join())); 
    }
    close($lsock); 
    exit 0; 
};
 
while(1) {
    my $csock = $lsock->accept() or next;
    $pending->enqueue(POSIX::dup(fileno $csock));
    DEBUG(sprintf("New client enqueued: %s:%s\n", $csock->peerhost, $csock->peerport));
    close($csock);
}
 
sub worker {
    my $clients_served = 0;
    while(my $fd = $pending->dequeue) { # API promises thread safety :-)
        if ($fd == -1) {
            return $clients_served;
        }
 
        my $sock = IO::Socket::INET->new_from_fd($fd, "r+");
        DEBUG(sprintf("Dequeued client %s:%d by worker %d.\n", $sock->peerhost,
                            $sock->peerport, threads->tid()));
 
        my $buf = "";
        while(<$sock>) {
            # CAUTION: there isn't any self protection against very long lines
            last if /^\r\n$/;
            $buf .= $_;
            goto BYE if length $buf > MAX_LEN_HEADERS_BUFFER;
        }
 
        if (my $request = HTTP::Request->parse($buf)) {
            INFO(sprintf("[%s] %s {%s}\n", $request->method, $request->uri, $sock->peerhost));
        }
 
        printf $sock "HTTP/1.1 %d %s\r\n", 
            HTTP_MOVED_PERMANENTLY, status_message(HTTP_MOVED_PERMANENTLY);
        printf $sock "Date: %s\r\n", DateTime::Format::HTTP->format_datetime;
        print $sock "Location: $redir_uri\r\n";
        print $sock "Server: Simple HTTP Redirection/0.1 ($^O)\r\n";
        print $sock "Connection: close\r\n";
        print $sock "\r\n";
 
BYE:  
        $clients_served++;
        close($sock);
    }
}

(*) just an approach, may drop connections:

while [ 1 ]; 
 do echo -e "HTTP/1.1 301 Moved Permanently\r\nLocation: http://31337.pl\r\n\r\n" | nc -l 7070; 
done

23 02/11
01:03

Reverse Polish Notation Evaluation in Python

This introduction is followed by some Python code (function evaluate_postfix_expr) to evaluate expressions (only integers, but may be extended with ease) in Reverse Polish Notation (RPN). Some simple tests are also included in the bundle.

I agree it’s a little useless, but I thought it might be useful for someone (CS students maybe?). If you want to examine the stack in each iteration you only have to turn debugging on. That can be accomplished by changing logging.INFO to logging.DEBUG (line 7).

Copy, distribute or do whatever you want with it. It’s released in the public domain.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#!/usr/bin/env python
 
import logging
import re
import unittest
 
logging.basicConfig(level=logging.INFO)
 
operators_table = {'+': int.__add__, 
             '-': int.__sub__,
             '*': int.__mul__,
             '/': int.__div__,
             '^': int.__pow__}
 
class ExpressionError(Exception):
    def __init__(self, message):
        self._message = "Expression error: %s" % message
    def _get_message(self): 
        return self._message
    message = property(_get_message)
 
class TestEvaluation(unittest.TestCase):
    def test_correct(self):
        self.assertEqual(666, evaluate_postfix_expr("666"))
        self.assertEqual(2+3-6, evaluate_postfix_expr("2 3 + 6 -"))
        self.assertEqual(2*3+4, evaluate_postfix_expr("2 3 * 4 +"))
        self.assertEqual(2*(3+4), evaluate_postfix_expr("2 3 4 + *"))
        self.assertEqual(3**4, evaluate_postfix_expr("3   3  *     3  *      3 *"))
        self.assertEqual((7/2)**4, evaluate_postfix_expr("7 2 / 4 ^"))
        self.assertEqual((2**3)**4, evaluate_postfix_expr("2 3 ^ 4 ^"))
        self.assertEqual(5+((1+2)*4)-3, evaluate_postfix_expr("5 1 2 + 4 * 3 - +"))
 
    def test_malformed(self):
        self.assertRaises(ExpressionError, evaluate_postfix_expr, "+")
        self.assertRaises(ExpressionError, evaluate_postfix_expr, "2 +")
        self.assertRaises(ExpressionError, evaluate_postfix_expr, "+ 2 2")
        self.assertRaises(ExpressionError, evaluate_postfix_expr, "2 2")
        self.assertRaises(ExpressionError, evaluate_postfix_expr, "2 2 + -")
        self.assertRaises(ExpressionError, evaluate_postfix_expr, "a 2 -")
 
def evaluate_postfix_expr(expr):
    atoms = re.split(r"\s+", expr)
    stack = [] 
    for atom in atoms:
        if atom in ["+", "-", "*", "/", "^"]:
            try:
                op2 = stack.pop()
                op1 = stack.pop()
            except IndexError:
                raise ExpressionError("Too few operands (unbalanced)")
            logging.debug("Calculating %d %s %d" % (op1, atom, op2))
            atom = operators_table[atom](op1, op2)
        else:
            try:
                atom = int(atom)
            except ValueError:
                raise ExpressionError("Unable to parse '%s' as integer" % atom)
 
        try:
            stack.append(atom)
        except MemoryError:
            raise ExpressionError("Too long expression")
 
        logging.debug("Pushed element %d. Stack status: %s" % (atom, stack))
 
    if len(stack) == 1:
        return stack.pop()
    else:
        raise ExpressionError("Too many operands (unbalanced)")
 
if __name__ == "__main__":
    unittest.main()

29 01/09
01:19

Music spotted!

Doh! A new piece of software I couldn’t live without: Spotify.

28 12/08
14:42

Free 1Password Licenses!

1Password is a very powerful application for OSX to keep safe your passwords and identities. It works as a standalone application and as a series of plugins for several Internet browsers. It is capable of, among other things, autofill registration forms and save you a lot of time filling in the same personal information again and again. To autofill a login form, just click the 1P browser menu bar icon, choose a previously recorded identity and voila, 1Password logins in automagically with no further interaction.

All the harvested information is stored in a different keychain unlocked by a master key placed in the login keychain (See “Keychain access.app” for more information).

Macheist, in collaboration with Agile Web Solutions, is offering free licenses as a Christmas present here. You only have to create and account and click in the links below the tree to get your free license.