29 03/11
13:08

Local public transportation in my pocket

I’ve lately spent some time developing a webapp for mobile devices to interact with some of the data published by the Gijón City Council. More specifically, data about local public land transportation schedule and live arrivals. The way they are presenting that information for mobile devices at the moment is very very heavy and slow, so I thought it may be useful to do something simpler for personal usage.

Basically, it is a simple web service that intensively caches data (to avoid stressing the data origin with many requests) and a fancy AJAX-powered frontend with some CSS with mobile browsers in mind (works flawlessly on Android’s browser and Mobile Safari). Additionally, if you add it as a bookmark to your iPhone’s home screen it behaves like a native application (you know, splash screen, custom icon, taskbar and so on).

I’m now working on client-side caching using HTML5 caching for offline usage. This way the application will boot way faster. It’s almost done, but it still needs some debugging.

I don’t intend to make it public for now. However, if you find it useful feel free to drop me a line. Beta testers are always welcome (but unfortunately won’t be rewarded).

This is how it looks like at the moment. The source will be released soon.

Update (23:26): Android screenshots provided by Javier Pozueco. Thanks buddy!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

24 03/11
12:31

Search term completion using a search tree

Google search box completion

*lol*

Nowadays it’s very usual to find websites offering hints while you’re typing on a search box. Google is a pretty good example of it. But, how could it be implemented?

This feature could be implemented either in the client side or in the server side. If the word list is big (usually it is), it’s recommended to keep the lookup logic in the server side to save some bytes while transferring the page to the client and also to save some computing power using server-side caches (cool when you plan to serve many requests).

Either way, there should be a data structure somewhere containing the word list and an algorithm to do the lookup. The simplest approach may be to use a list to store the words and issue something like this when you want to get a list of hints for a given prefix:

filter(lambda x: x.startsWith(prefix), word_list)

That’s Python’s filter, but it works the same way the well-known Haskell’s first-order function filter does. It builds a new list with the elements of the original list (word_list) that match the predicate (the lambda function).

Although the results can (and should) be cached, the very first lookup (or when the cache expires) would be very inefficient because the entire list must be traversed and that operation will take linear time. Not bad, but when the size of the problem gets bigger (i.e. more and more words in the database) the lookup process may be too slow, especially whether you’re serving several users at the same time. If the list was sorted, the execution time could be improved a little bit by writing a more sophisticated algorithm, but let’s keep it that way for now.

Fortunately, there are better and faster ways to face the problem. If you don’t want to write code (usually the best choice) you may use some high-performance indexing engine such as Apache Lucene. But if you prefer the ‘do-it-yourself’ way (for learning purposes), a search tree (more specifically, a trie or a prefix tree) is a good approach.

I’ve poorly benchmarked both alternatives (the list and the tree) and as expected the tree is pretty quicker generating hints. What I did was to feed both data structures with the content of an American English word list holding ~640k words (debian package wamerican-insane).

So, assuming four is a reasonable minimum prefix length, I measured the time it would take to get a list of words prefixed by hous (yes, just one, remember I said this was a poor benchmark? ;). Unsurprisingly, it took around 230 times longer for the list alternative to generate the hints (438.96 ms vs 1.92 ms). Wow.

My implementation of the tree is as follows. The API is quite straightforward, the “hot” methods are put and get_hints. I’ve stripped off the test suite for space reasons.

Usage example:

>>> tree = HintSearchTree()
>>> tree.put("nacho")
>>> tree.put("nachos")
>>> tree.put("nachete")
>>> tree.get_hints("nach")
['nachete', 'nacho', 'nachos']
>>> tree.get_hints("nacho")
['nacho', 'nachos']
>>> tree.delete("nacho")
>>> tree.get_hints("nacho")
['nachos']
>>> tree.count_words()
2
>>> tree.get_hints("n")
['nachete', 'nachos']
>>> tree.is_indexed("nachete")
True
>>> tree.is_indexed("nach")
False
>>> tree.empty()
False
class HintSearchTreeNode(object):
class HintSearchTreeNode(object):
  def __init__(self, parent=None, terminal=False):
    self._children = {}
    self._terminal = terminal
    self._parent = parent
 
  @property
  def children(self):
    return self._children
 
  @property
  def terminal(self):
    return self._terminal
 
  @terminal.setter
  def terminal(self, value):
    self._terminal = value
 
  @property
  def parent(self):
    return self._parent
 
class HintSearchTree(object):
  def __init__(self):
    self._root = HintSearchTreeNode()
 
  def put(self, word):
    """Adds a word to the tree."""
    # TODO: Sanitize 'word'
    if len(word) > 0:
      self._put(self._root, word)
 
  def count_words(self):
    """Retrieves the number of indexed words in the tree."""
    return self._count_words(self._root)
 
  def is_indexed(self, word):
    """Returns True if 'word' is indexed."""
    node = self._find(self._root, word)
    return node is not None and node.terminal is True
 
  def get_hints(self, prefix):
    """Returns a list of words prefixed by 'prefix'."""
    return self._match_prefix(self._root, prefix)
 
  def delete(self, word):
    """Deletes 'word' (if exists) from the tree."""
    terminal = self._find(self._root, word)
    if terminal is not None:
      terminal.terminal = False
      self._prune(terminal.parent, word)
 
  def empty(self):
    """Returns True if the tree contains no elements."""
    return len(self._root.children) == 0
 
  def _put(self, node, word, depth=0):
    next_node = node.children.get(word[depth])
    if next_node is None:
      next_node = HintSearchTreeNode(parent=node)
      node.children[word[depth]] = next_node
    if len(word)-1 == depth:
      next_node.terminal = True
    else:
      self._put(next_node, word, depth+1)
 
  def _count_words(self, node):
    words = 1 if node.terminal is True else 0
    for k in node.children:
      words += self._count_words(node.children[k])
    return words
 
  def _match_prefix(self, node, prefix):
    terminal = self._find(node, prefix)
    if terminal is not None:
      return self._harvest_node(terminal, prefix)
    else:
      return []
 
  def _harvest_node(self, node, prefix, path=""):
    hints = []
    if node.terminal is True:
      hints.append(prefix + path)
    for k in node.children:
      hints.extend(self._harvest_node(node.children[k], prefix, path+k))
    return hints
 
  def _find(self, node, word, depth=0):
    if depth == len(word):
      return node
    else:
      child = node.children.get(word[depth])
      if child is not None:
        return self._find(child, word, depth+1)
      else:
        return None
 
  def _prune(self, node, word):
    if self._count_words(node.children[word[-1]]) == 0:
      del node.children[word[-1]]
      if len(node.children) == 0 and node.parent is not None \
          and node.terminal is not True:
        self._prune(node.parent, word[:-1])

The code is released in the public domain.

24 02/11
22:55

Some Perl to redirect HTTP requests

After almost a year without publishing a single post, it seems this week I’m going to beat all my records.

A week ago, I wanted to prank my brother for a while. Nothing sophisticated… just some Iptables rules, Tinyproxy and HTTP magic. To go ahead with my evil plans, I needed “something” able to redirect a HTTP request. Actually, there are several ways to do that: Apache redirects, Tornado, Netcat* and so on. These alternatives are fast, bulletproof and time-saving, but not fun.

As many of you probably know, I didn’t get a job yet. That necessary means that I’ve got plenty of free time to waste. So… what did I do? I wrote some Perl and today I’m publishing the source code just in case someone finds it useful somehow. Like the previous entry, it’s published in the public domain.

The script just collects connections, issues 301 back (Moved Permanently) and sets Location to the URI specified as a command line argument (option -u). It lacks some security checks (left as an exercise to the reader) but it does what it is supposed to do. You may likely spot some silly bugs as I haven’t spent much time reading it again. Reports are welcome!

For those wondering, the prank was a big success. I’m afraid I can’t spare any detail by now but it turns out my bro is still thinking that his computer has been cracked.

Example invocation:

$ perl redir.pl -p 7070 -v -t 3 -u http://31337.pl
2011/02/24 21:41:54 Listening on port 7070
2011/02/24 21:41:54 Redirecting HTTP requests to: ‘http://31337.pl’
2011/02/24 21:41:54 3 thread(s) working under the hood

And finally the source code:

use warnings;
use threads;
 
use Thread::Queue;
use POSIX;
 
use IO::Socket::INET;
use HTTP::Request;
use HTTP::Status qw(:constants status_message);
 
use Getopt::Long;
use DateTime::Format::HTTP;
use Data::Validate::URI qw(is_http_uri);
use Log::Log4perl qw(:easy);
 
use constant MAX_THREADS => 10;
use constant MAX_LEN_HEADERS_BUFFER => 8*1024;
use constant DEFAULT_REDIRECT_URI => "http://www.example.org";
use constant DEFAULT_PORT => 80;
use constant DEFAULT_POOL_SIZE => 3;
 
my $redir_uri = DEFAULT_REDIRECT_URI;
my $server_port = DEFAULT_PORT;
my $thread_pool_size = DEFAULT_POOL_SIZE;
my $verbose;
 
GetOptions('url=s' => \$redir_uri, 
           'port=i' => \$server_port,
           'threads=i' => \$thread_pool_size,
           'verbose'  => \$verbose) or exit -1;
 
die "Invalid redirect URI (e.g. http://www.example.org)\n" unless is_http_uri($redir_uri);
die "Invalid port (e.g. 8080)\n" unless 0 < $server_port && $server_port < 2**16;
die "Invalid pool size (should be in [1..".MAX_THREADS."])\n" 
            unless 0 < $thread_pool_size && $thread_pool_size <= MAX_THREADS;
 
Log::Log4perl->easy_init( level => $verbose? $DEBUG : $INFO );
 
my $pending = Thread::Queue->new(); 
 
my $lsock = IO::Socket::INET->new( LocalPort => $server_port,
                                   Proto => 'tcp',
                                   Listen => 1,
                                   Reuse => 1 ) or die "Couldn't bind listening socket ($!)\n"; 
 
INFO("Listening on port $server_port\n");
INFO("Redirecting HTTP requests to: '$redir_uri'\n");
 
my @workers = ();
for (1..$thread_pool_size) {
    if ($thread = threads->create("worker")) {
        push(@workers, $thread);
    }
}
 
DEBUG(sprintf("%d thread(s) working under the hood\n", $#workers+1));
 
# Set a tidy shutdown just in case an external agent SIG{INT,TERM}s the process
$SIG{'INT'} = $SIG{'TERM'} = sub {
    # Dirty hack. threads->kill() does not wake up the thread :(
    for (1..@workers) {
        $pending->enqueue(-1);
    }
    for (@workers) {
        DEBUG(sprintf("Worker %d terminated: %d clients served\n", $_->tid, $_->join())); 
    }
    close($lsock); 
    exit 0; 
};
 
while(1) {
    my $csock = $lsock->accept() or next;
    $pending->enqueue(POSIX::dup(fileno $csock));
    DEBUG(sprintf("New client enqueued: %s:%s\n", $csock->peerhost, $csock->peerport));
    close($csock);
}
 
sub worker {
    my $clients_served = 0;
    while(my $fd = $pending->dequeue) { # API promises thread safety :-)
        if ($fd == -1) {
            return $clients_served;
        }
 
        my $sock = IO::Socket::INET->new_from_fd($fd, "r+");
        DEBUG(sprintf("Dequeued client %s:%d by worker %d.\n", $sock->peerhost,
                            $sock->peerport, threads->tid()));
 
        my $buf = "";
        while(<$sock>) {
            # CAUTION: there isn't any self protection against very long lines
            last if /^\r\n$/;
            $buf .= $_;
            goto BYE if length $buf > MAX_LEN_HEADERS_BUFFER;
        }
 
        if (my $request = HTTP::Request->parse($buf)) {
            INFO(sprintf("[%s] %s {%s}\n", $request->method, $request->uri, $sock->peerhost));
        }
 
        printf $sock "HTTP/1.1 %d %s\r\n", 
            HTTP_MOVED_PERMANENTLY, status_message(HTTP_MOVED_PERMANENTLY);
        printf $sock "Date: %s\r\n", DateTime::Format::HTTP->format_datetime;
        print $sock "Location: $redir_uri\r\n";
        print $sock "Server: Simple HTTP Redirection/0.1 ($^O)\r\n";
        print $sock "Connection: close\r\n";
        print $sock "\r\n";
 
BYE:  
        $clients_served++;
        close($sock);
    }
}

(*) just an approach, may drop connections:

while [ 1 ]; 
 do echo -e "HTTP/1.1 301 Moved Permanently\r\nLocation: http://31337.pl\r\n\r\n" | nc -l 7070; 
done

20 03/08
15:38

2 weeks, 6 days, 16 hours wasted!

Although Dato has recently talked about it, it is worth to drop again a few more bits about MyEpisodes.com as a try to increase its popcon among TV shows addicts.

MyEpisodes.com is an easy way to track your activity watching TV shows. Using a fancy interface it is possible, for instance, to tag episodes as acquired/watched, be aware of upcoming premieres and, taking advantage of the provided RSS channels, stay up to date of new episodes without even visiting the website.

Have you ever asked yourself how many hours have you spent in front of your screen watching TV shows? MyEpisodes.com has got the answer.

Cool, isn’t it? ;)

05 01/08
16:47

Filesharing milestones

In order to wake my blog up, let’s talk about p2p-based filesharing and stuff like that.

RevolutionTT is a well-known invite-only bittorrent tracker with high speed transfers in mind where it’s quite easy getting the maximum transfer speed (here, 10240 kbps) in a few seconds since the download engages. OTOH, if something is brand new and shareable, it’s in there.

Member for almost forty weeks, I’ve just reached a few hours ago the amount of six hundred uploaded gigabytes with an aprox. 4.0 ratio (numerical relationship among uploaded and downloaded bytes). I’ve been user of a bunch of trackers (both public and private) along my life as p2p user and I’m pretty sure this one is the best.

If you’re an intensive p2p user and you’re still stuck in some crappy public tracker or even using ed2k please consider getting an invite somewhere.

RevolutionTT, the revolution has begun

09 08/07
15:00

~nacho v6.0 is out!

Another redesign of my website featuring a completely rewritten content is out!

Please, leave whatever suggestion as an attached comment to this post.

17 07/07
11:03

Hello Slicehost!

As you might know, I’ve been sharing a dedicated server with three more guys. Since several months ago, we have been thinking about change the provider because the plans offered by JVDS were very price-outdated compared with other providers. Dreamhost didn’t fit our requirements (no root access and shared databases., WTF?!), so we had to choose between either Linode or Slicehost.

After waiting for about a month in the prospective customer queue, we finally moved to Slicehost. Slicehost is a fast, made-from-geeks-to-geeks and XEN powered virtual dedicated servers provider.

We’re quite happy with the change, we got a new machine with four times more memory for the same price and we also moved from Apache to Lighttpd. All the references from Slicehost we read were good and, at least by now, I second those thoughts.

Skyhusker made an awesome work migrating the stuff and all the services are already moved. Of course, we’re running Etch.

17 06/07
20:08

The #10000

I’ve just submitted my song number 10,000 to last.fm!.

06 04/07
00:38

OpenSPF ha llegado a España

Tarde, como es habitual con estas cosas en España, pero parece que los ISPs españoles se empiezan a poner las pilas…

Bandaancha nos cuenta

Según explica Mercè Molist en El País, Telefónica, ONO, Orange o Jazztel son algunos de los 26 proveedores de internet que se suman a un pacto para adoptar el SPF (sender policy framework) de manera conjunta.

19 03/07
17:23

Hello again my old dude!

My mail agent is Mutt again! (again mainly because it was my first MUA). Let’s fetchmail and procmail do the hard work while I happily browse my mailboxes!

Good bye Pine, Evolution, Sylpheed Claws… and now, Kmail!. (note Evolution link…)

BTW, xfmail2mbox.sh works fine, my mails moved smoothly to Mutt as well, so thanks to Jörg Reinhardt.

03 01/07
14:14

Relicensing

I’ve just relicensed my website, after some time thinking about it I moved my website contents from a Creative Commons BY-SA 2.0 to a GPLv2 license once and for all.

The main reason is that my concept of free and free software is really closer to the Debian Free Software Guidelines, and, unfortunately, CC licenses terms don’t match with them very well (issue discussed since months ago in debian-legal and summed up in this document).

08 11/06
12:27

Asturian websites: Top 25

If El Comercio says it, it must be true. It could be interesting to know what is the general criterion to judge how much better than other a web is.

Painful.

05 10/06
00:03

Extending my Firefox’s capabilities!

This is just another list of Firefox extensions I’m using at the moment:

  • Google Toolbar for Firefox – Useful Google utils pack (includes search, of course).
  • del.icio.us – All I need to manage my social bookmarks.
  • ColorZilla – Interesting extension to extract colour codes directly from websites.
  • VideoDownloader – Well known add-on.
  • Forecastfox – And if it rains tomorrow?.
  • PDF Download – Just a dialog to choose between open, download, or show as HTML the downloaded PDF files.
  • Download Statusbar – Enhanced and interesting download manager.
  • Colorful Tabs – Put colours in your life!
  • Flashblock – Flash really sucks, prevent it for me!
  • Extended Statusbar – Adds more information about the loaded website, for example spent time or download speed.
  • Tab X – Adds an independent kill button to each tab.
  • Link Alert – Prevents me to download rubbish, like Microsoft Word documents.

Excuse me, I forgot all the links to these extensions, feel free to search them yourself ;)

This ultraextended Firefox makes my life easier!

14 09/06
22:05

X-Facing!

Finally, after two or three attempts, i built a new X-Face (the old one was a bit “noisy”):

X-Face: h;*J`D,fz$pX,X$-{‘bN,@\9[a?`~d343Zg+4gy=gcpE(]“;flj/
XOdFs*)p3y9k[sG9AyN"qp^LYJ!Wu
bUXxXcm?QZ}XqI@C?\~DSUAv/DAajj;wr(|@BZgigOFX< {d=X
TT/GX0`,![]"`iZ$WC

View it in action or verbatimize it skipping line jumps :)