Due to recent changes made by the local real estate boards, I have had the dubious pleasure of re-writing the MLS listing data import and processing routines that we use to feed data to the real estate web sites that are managed by my employer. First, I'll say that I believe it is a miracle that the MLS system is functional at all. Second, if you are ever asked to integrate MLS listing data into a real estate web site, double your budget estimate, and then double it again. You'll thank me in the end.
Unfortunately, this is the third time I have had to integrate essentially the same listing data for our clients. The local real estate board has changed vendors as many times in as many years. Each MLS vendor has their own preferred data formats and distribution methods.
The first vendor provided FTP access to a secure server where we downloaded pipe-delimited text files.
Unfortunately their documentation consisted of the column headers within the data file. We were left to pour
over nearly 250 columns with cryptic names that in some cases defied logic. We managed to figure out which
ones we needed for the local database and weed out the rest upon import. For the images, they provided a
Windows application that would read their data file, then go fetch the images for the listings in that file. We
wanted an automated process, so we ended up using an HTTP proxy to watch the traffic it generated and
reverse-engineered the file naming schema on their image server. The only good thing about this particular
MLS was that we could predict the image names and generate the photo URLs during the import for use later
in display. This prevented us from having to keep a local copy of all the images. It was clunky, but it worked
great for about a year.
Round two came just before Christmas in 2006. The new MLS vendor was hooked up with MLXchange to
provide their IDX service. This meant a completely different system with a new data format to work with. I
should also mention that they gave us almost no notice before the switch, just a couple of days. I will say that
MLXchange was a step up from the previous provider in some areas, but still far from ideal.
To start with, their export interface only provides options for full or incremental, and data only, images only,
or both. The problem with this is that with an incremental data feed, you never know which listings are
deleted, and trying to keep the core listing data clean and accurate is essentially impossible. The only way to
ensure clean data is to clear out the local database and re-import all listings each day. Unfortunately, pulling
the full set of images just isn't practical given their vast numbers. 60,000 listings times an average of five
photos, multiplied by an average photo size of 20KB works out to about 6GB of images. Frankly, we don't
have time to process that every day.
To make it work, we employed two IDX accounts. One was configured to download the full listing data, and
the other set to bring down an incremental image set. We did have to bring down an initial full image set to
get things started, and it took a couple of days to get everything in place. As previously mentioned, with an
incremental feed there is no way to know when something is removed, so over time the image store just keeps
growing, and when images are deleted from MLS we have no way of knowing and they continue to show up
on the web site. Unfortunately the online versions of the images at MLXchange are distributed throughout a
seemingly random folder structure that make predicting image location there impossible, making this daily
dance with the images necessary.
Another annoyance with MLXchange is their delivery method. Previously we could pull the data file and
images directly over FTP and HTTP, respectively. With MLXchange, the delivery is all FTP, though their
FTP server offers no external security. Anyone can connect and download whatever they want. To protect
the data, they bundle the listings file and images into password protected ZIP files. Unfortunately,
ColdFusion did not have native ZIP file processing at the time, so we had to purchase a third-party utility that
could handle protected ZIP files to get at the data once it had downloaded. To make matters worse,
occasionally one of their ZIP files would be corrupted and therefore unable to be opened. This wasn't a
problem with the listing data as we could just wait until the next day, but for incremental images we got one
chance to get them. Once they were exported we would not see them again until an agent uploaded a
replacement photo. Their system also had a nasty habit of running the daily incremental export twice in a
row. The first run would bring out the 1,000 or so new images for the day, and then it would run again a few
minutes later, so we would end up with a ZIP file that included maybe three images, and the other 1,000 new
and updated photos would be lost. That's not easy to explain to a broker when their agent is wondering why
the latest and greatest photos are not displayed on the web site.
The actual listing data in MLXchange was more organized, and includes schema files with the data. We still
have to figure out what some of the fields are, but it was relatively simple to create a parser for the schema
file that would in turn generate a SQL script that creates the necessary tables in our SQL Server. We then
created a parser to bring in the listing data. An image indexing script would crawl through the image folders
and keep the database updated with the locations of the images for each listing. Due to the way the images
are named, displaying images on a listing also requires additional effort. When looking at the database for the
images we just tell it to sort them by the filename. Unfortunately, this puts the "front exterior" photo, the
main image for the listing, second or third in line. To get around this we have to do some coding magic to
re-sort the images using a custom algorithm to get them in the right positions. This is way more work than any
person should have to do just to get the images in the right order.
The good news is that once the data has been downloaded, parsed, and processed into the local SQL Server,
we can do all sorts of nice things with it without a lot of additional effort. We can create search interfaces
and display the listings to the public. With the images being local on the server we don't have to rely on the
MLS vendor's servers being online to see photos (the first vendor had a couple of outages that prompted calls
from our clients wondering where their images went).
Despite a couple of minor issues here and there, this was all very happy until July 2008. The local real estate
board once again changed MLS vendors. This time they gave us a month of lead time about the change.
Unfortunately, they are just as busy as we are and we did not start getting details about the change in data for
a couple of weeks.
The new data was to be provided through a RETS server, and they suggested an application that would handle
all of the downloading of the data and images. At first glance, it looked very promising. The Windows GUI
application would allow us to define searches and then download the listings and images to the server on a
schedule. Great! It didn't take long before we started having problems. First, RETS operates on metadata
and can download the data into an XML file, or any one of several formats you request. It takes a while to
download and process the metadata for every action in the program which makes things take far longer than
they should.
Second, you can select which fields you want in the export. Our first task was to assemble a list of all of the
fields we needed, and to create a search and download for each property type. The first issue arose when the
system repeatedly gave us "field access denied [xxxx]" errors with a numerical code. To find out which field
was denied, we would have to use the metadata browser to crawl through the records looking for that ID
code. For each instance, we would have to contact IT support at the local real estate board and ask them to
open up access for those fields. Many phone calls later, we had that part working.
Next, we opened up the searches to download everything, and another problem reared its ugly head. The
RETS server had a limit of 5,000 results per search. This affected the listings and the images. With tens of
thousands of listings, we would have had to create dozens of intricate searches to cover everything, and even
then there was no guarantee that a particular search would not ever have more than 5,000 results. Each
search would have to be scheduled to avoid overlap, and we would then have to process dozens of different
files that would contain the data. There had to be a better way.
After more calls to IT support, they finally told us that the new MLS vendor also works with our old friends
MLXchange to get access to IDX data. I almost cried. Here we had just spend nearly 20 hours in
development time trying to integrate a new system that we know little to nothing about when the vendor we're
already connecting with and have code written for is an available option. Later that day I was configuring my
full daily data feed just as we had used before.
The following day I nearly cried again when I realized that the data format was the same, but the fields and
schema were not. This meant another run through the schema parser to generate the database tables (and the
hours of tweaking it to normalize the things in this schema that the other one didn't have), more hours to
update the actual data parser and import routine, and yet more time to schedule the full image dump and
process nearly 200,000 images into the system. All told, the latest update took nearly a month to implement.
Most of that time was spent waiting for support from outside vendors. We rely on the local IT support at the
real estate board for nearly any change that is needed, then another day for the change to be reflected in the
data files. To start with, basic information such as the listing broker office name, which is required to be
displayed with the listing information, was not included. To make matters worse, there are five different data
tables, and each has their own set of information that has to be reviewed and updated. Where a field is
present on one it was missing on another, and so on. It took nearly a week to get the data files to include
everything we needed. There are actually more changes we will be asking for now that it's up and running,
and we will continue to have the problems with the images as we have had in the past (we don't know when
images are deleted from a listing, and some can go missing if their export ZIP file is corrupted).
So, my point is that the real estate industry can do better, and must do better, to remedy these issues. I would
call on the National Association of Realtors to implement the following:
- Develop and impose a standard data schema for IDX. Gather up all of the possible fields that cover the
vast majority of listings and develop a standard list of fields. They should use a standard naming
convention that is compatible with all major database systems. Each field should be well documented
as to what data it can contain and how that data can be used subject to additional limitations by local
boards (remember, we’re developers, not brokers, so we need some guidance on what your language
means). There should only be one field for each piece of data. There is no reason we should have 12
different “public remarks” fields that have to be reassembled in the proper order. A reasonable number
of "extra" fields should be included for localized use, and some "reserved" fields should be included for
future growth without having to change anything. The RETS protocol does a good job of standardizing
the format (XML), but uses metadata to define the fields and their contents. Frankly, from a
development standpoint, all this does is create more work for already overworked developers to sort
through the mess to get at the data. Once the schema is defined, it should be standardized over all of
the IDX vendors so that they essentially become interchangeable from a programming standpoint.
They will have to find other ways to provide value.
- Insist that IDX vendors create a central image repository accessible over HTTP. The images for real
estate listings are vast in number and change often. Why should each broker have to download tens or
hundreds of thousands of image files to their web servers for local hosting? Of course I believe that
they should be available for download if the broker chooses as they are now, but for those that only
need them for their web site, there is no reason the IDX vendor cannot put them online and publish
details about how to predict and format the image URLs for each listing. This would solve the "deleted
image" problem that incremental updates create. The biggest argument against this that I can see would
be that it would use significant bandwidth to serve the images. I would argue that it should actually
result in less bandwidth overall. Presently, many brokers are downloading full image sets on a weekly
basis, or incremental image updates every day. These include thousands of images that may or may not
ever be requested by a web site user. The broker has to download ALL of them, just in case they are
needed. If they were hosted locally, only images for listings people actually want to see are served up.
Through proper caching and HTTP compression, the overall bandwidth can be reduced or stay the same
since brokers would no longer need such large exports all the time.
- Standardize the delivery format and method. Personally, I prefer pipe-delimited text files available
directly over FTP with a username and password on the FTP account itself. What good is a password
protected ZIP file when anyone can download it and then run a cracking utility against it offline, where
nobody can monitor the attempts. Eventually an attacker would find the password and then download
whatever they wanted from a folder without detection. Trying to attack a password on the FTP server
itself is far more difficult and can be monitored, detected, and blocked.
- Require listing addresses to be normalized upon input. Whenever a listing is entered into the MLS it
should be checked against the postal service database (or some other source) and normalized according
to a standard. Zip codes would be required to be valid and match a known city name before being
accepted, etc. All of the rules for IDX say that the data cannot be modified before it is displayed, but
when the IDX feed includes the city names as entered by the agents, it borders on negligence not to
correct and normalize them before display. With the current IDX data I am working with there are no
less than eight different ways that "St. Petersburg" is indicated as a city. You have "St. Pete", "Saint
Pete", "St Petersburg", and on and on. For every misspelling and variation on every city we have to
pick which one is "right" and then create an exception to normalize every other option so the search
functions will work properly. If this were done at the source, where the data is entered, it would
normalize the data everywhere at once and ensure more accurate listing information.
Given these changes, real estate listing data could be more easily distributed to brokers for use on their
custom web sites, and it would save countless development hours and frustration in dealing with dozens of
schemas, formats, and delivery methods. Developers could write software to handle the nuts and bolts of
downloading and processing the data and make it easy to implement no matter which MLS or IDX vendor is
being used.
Given the data hoarding history of real estate agents and brokers, and the glacial pace at which the industry
seems to move, I do not expect that these suggestions will be implemented any time soon. In the mean time, I
will continue to monitor the daily feeds, explain to brokers why some of their images are missing, and keep
my fingers crossed that there will not be another MLS vendor change any time in the near future.
For those that are using ColdFusion to import IDX data from MLXchange, please feel free to ask any
questions you might have in the comments.
Recent Comments