GreggHz

Syft.co and Image Search

Yes. Syft.co. Not Syft.com. Now that we have that out of the way, we can move on to more interesting issues.

Syft is a joint project between myself and Keenan Cummings. I handle the implementation and technical aspects of the site. In the current state, Syft is relatively simple (by most any standard). It uses the Google Image Search API and runs a user-supplied query against a number of pre-selected sites and returns the results. So right now, not much all that interesting is going on. It seems to be a pretty standard use of Google's API. Unfortunately Google recently announced that they are deprecating this API. And they aren't providing an alternative. I believe that gives me three years to come up with something else.

Alternative APIs exist, but I haven't yet found any that provide the flexibility and freedom that Syft will demand (especially as it grows). As a result, I have decided to create my own image search engine from scratch. I have absolutely no experience with this kind of project, so it will be quite the adventure. Below I'm going to outline my plans. This will provide me an opportunity to refine the idea into something tangible and also to receive feedback from anyone who may be interested.

The search engine will be made up of 3 key components (since I don't have any experience with this, I'm just making up the names of the various components).

  1. Collector
  2. Parser
  3. Ranker

Let's go over each of these one at a time.

1. Collector

The Collector is probably the simplest component of Syft Search. Its job is to take a pre-generated list of URLs and simply download and store the markup for each of those pages. Initially this list of URLs will be the list of hand selected sites that have a high likelihood of containing high quality images. Later, this list will be updated by the Parser to keep everything updated. Each downloaded page will be stored (likely using some sort of compression) in a datastore for the parser to process later.

2. Parser

The Parser sifts through the pages stored by the Collector and generates raw data about the page. It finds and saves all anchors that point to relative URLs or URLs on the same domain and stores them to be processed later by the Collector. This allows the Collector to always have more URLs to process while the Collector sends more pages to the Parser to be processed. In addition to finding new URLs to be processed, the parser is also responsible for finding all images on the page and storing information like their location in an images info datastore.

3. Ranker

The Ranker processes the image information stored by the Parser. It uses this information to find the best keywords and phrases for each image. The keywords are ranked by relative importance to their respective image. This ranking is based mostly on where each keyword or phrase appears in relation to each image. Things like the alt and title tags would be worth a lot more than text found in a footer or far away from the image. When a user runs a search, this information is used to determine what the best result set for the users specific query is and which images are most relevant.

This is a very high level overview of the architecture of the Syft search engine. Much of the implementation details still need to be worked out. The datastore will likely be hosted by Google App Engine, but Amazon provides a possible alternative depending on the specific needs. As far as scalability is concerned, Syft Search has a very specific use case and won't require the type of crawling and indexing that a general purpose search engine requires. Each site that we'll be crawling is hand selected, so by the very nature of this process, it will be an extremely small index in comparison to a general purpose internet search engine. This should work to my advantage. That said, careful attention will need to be paid to the efficiency of actually running the queries. As much data as possible will need to be cached, and the algorithms used should be very finely tuned. We don't want users waiting around for results to show up.

I'm very excited about this project. To get a sneak peak of the progress, head over to syft.co and sign up for the beta by entering your email address. We'll be sending out invites shortly.

Comments

You must login to post comments:

contact

google+: Greggory Hernandez
twitter: @gregghz
email: greggory.hz@gmail.com