Due to recent changes made by the local real estate boards, I have had the dubious pleasure of re-writing the MLS listing data import and processing routines that we use to feed data to the real estate web sites that are managed by my employer. First, I'll say that I believe it is a miracle that the MLS system is functional at all. Second, if you are ever asked to integrate MLS listing data into a real estate web site, double your budget estimate, and then double it again. You'll thank me in the end.
Unfortunately, this is the third time I have had to integrate essentially the same listing data for our clients. The local real estate board has changed vendors as many times in as many years. Each MLS vendor has their own preferred data formats and distribution methods.
The first vendor provided FTP access to a secure server where we downloaded pipe-delimited text files. Unfortunately their documentation consisted of the column headers within the data file. We were left to pour over nearly 250 columns with cryptic names that in some cases defied logic. We managed to figure out which ones we needed for the local database and weed out the rest upon import. For the images, they provided a Windows application that would read their data file, then go fetch the images for the listings in that file. We wanted an automated process, so we ended up using an HTTP proxy to watch the traffic it generated and reverse-engineered the file naming schema on their image server. The only good thing about this particular MLS was that we could predict the image names and generate the photo URLs during the import for use later in display. This prevented us from having to keep a local copy of all the images. It was clunky, but it worked great for about a year.
Round two came just before Christmas in 2006. The new MLS vendor was hooked up with MLXchange to provide their IDX service. This meant a completely different system with a new data format to work with. I should also mention that they gave us almost no notice before the switch, just a couple of days. I will say that MLXchange was a step up from the previous provider in some areas, but still far from ideal.
To start with, their export interface only provides options for full or incremental, and data only, images only, or both. The problem with this is that with an incremental data feed, you never know which listings are deleted, and trying to keep the core listing data clean and accurate is essentially impossible. The only way to ensure clean data is to clear out the local database and re-import all listings each day. Unfortunately, pulling the full set of images just isn't practical given their vast numbers. 60,000 listings times an average of five photos, multiplied by an average photo size of 20KB works out to about 6GB of images. Frankly, we don't have time to process that every day.
To make it work, we employed two IDX accounts. One was configured to download the full listing data, and the other set to bring down an incremental image set. We did have to bring down an initial full image set to get things started, and it took a couple of days to get everything in place. As previously mentioned, with an incremental feed there is no way to know when something is removed, so over time the image store just keeps growing, and when images are deleted from MLS we have no way of knowing and they continue to show up on the web site. Unfortunately the online versions of the images at MLXchange are distributed throughout a seemingly random folder structure that make predicting image location there impossible, making this daily dance with the images necessary.
Another annoyance with MLXchange is their delivery method. Previously we could pull the data file and images directly over FTP and HTTP, respectively. With MLXchange, the delivery is all FTP, though their FTP server offers no external security. Anyone can connect and download whatever they want. To protect the data, they bundle the listings file and images into password protected ZIP files. Unfortunately, ColdFusion did not have native ZIP file processing at the time, so we had to purchase a third-party utility that could handle protected ZIP files to get at the data once it had downloaded. To make matters worse, occasionally one of their ZIP files would be corrupted and therefore unable to be opened. This wasn't a problem with the listing data as we could just wait until the next day, but for incremental images we got one chance to get them. Once they were exported we would not see them again until an agent uploaded a replacement photo. Their system also had a nasty habit of running the daily incremental export twice in a row. The first run would bring out the 1,000 or so new images for the day, and then it would run again a few minutes later, so we would end up with a ZIP file that included maybe three images, and the other 1,000 new and updated photos would be lost. That's not easy to explain to a broker when their agent is wondering why the latest and greatest photos are not displayed on the web site.
The actual listing data in MLXchange was more organized, and includes schema files with the data. We still have to figure out what some of the fields are, but it was relatively simple to create a parser for the schema file that would in turn generate a SQL script that creates the necessary tables in our SQL Server. We then created a parser to bring in the listing data. An image indexing script would crawl through the image folders and keep the database updated with the locations of the images for each listing. Due to the way the images are named, displaying images on a listing also requires additional effort. When looking at the database for the images we just tell it to sort them by the filename. Unfortunately, this puts the "front exterior" photo, the main image for the listing, second or third in line. To get around this we have to do some coding magic to re-sort the images using a custom algorithm to get them in the right positions. This is way more work than any person should have to do just to get the images in the right order.
The good news is that once the data has been downloaded, parsed, and processed into the local SQL Server, we can do all sorts of nice things with it without a lot of additional effort. We can create search interfaces and display the listings to the public. With the images being local on the server we don't have to rely on the MLS vendor's servers being online to see photos (the first vendor had a couple of outages that prompted calls from our clients wondering where their images went).
Despite a couple of minor issues here and there, this was all very happy until July 2008. The local real estate board once again changed MLS vendors. This time they gave us a month of lead time about the change. Unfortunately, they are just as busy as we are and we did not start getting details about the change in data for a couple of weeks.
The new data was to be provided through a RETS server, and they suggested an application that would handle all of the downloading of the data and images. At first glance, it looked very promising. The Windows GUI application would allow us to define searches and then download the listings and images to the server on a schedule. Great! It didn't take long before we started having problems. First, RETS operates on metadata and can download the data into an XML file, or any one of several formats you request. It takes a while to download and process the metadata for every action in the program which makes things take far longer than they should.
Second, you can select which fields you want in the export. Our first task was to assemble a list of all of the fields we needed, and to create a search and download for each property type. The first issue arose when the system repeatedly gave us "field access denied [xxxx]" errors with a numerical code. To find out which field was denied, we would have to use the metadata browser to crawl through the records looking for that ID code. For each instance, we would have to contact IT support at the local real estate board and ask them to open up access for those fields. Many phone calls later, we had that part working.
Next, we opened up the searches to download everything, and another problem reared its ugly head. The RETS server had a limit of 5,000 results per search. This affected the listings and the images. With tens of thousands of listings, we would have had to create dozens of intricate searches to cover everything, and even then there was no guarantee that a particular search would not ever have more than 5,000 results. Each search would have to be scheduled to avoid overlap, and we would then have to process dozens of different files that would contain the data. There had to be a better way.
After more calls to IT support, they finally told us that the new MLS vendor also works with our old friends MLXchange to get access to IDX data. I almost cried. Here we had just spend nearly 20 hours in development time trying to integrate a new system that we know little to nothing about when the vendor we're already connecting with and have code written for is an available option. Later that day I was configuring my full daily data feed just as we had used before.
The following day I nearly cried again when I realized that the data format was the same, but the fields and schema were not. This meant another run through the schema parser to generate the database tables (and the hours of tweaking it to normalize the things in this schema that the other one didn't have), more hours to update the actual data parser and import routine, and yet more time to schedule the full image dump and process nearly 200,000 images into the system. All told, the latest update took nearly a month to implement. Most of that time was spent waiting for support from outside vendors. We rely on the local IT support at the real estate board for nearly any change that is needed, then another day for the change to be reflected in the data files. To start with, basic information such as the listing broker office name, which is required to be displayed with the listing information, was not included. To make matters worse, there are five different data tables, and each has their own set of information that has to be reviewed and updated. Where a field is present on one it was missing on another, and so on. It took nearly a week to get the data files to include everything we needed. There are actually more changes we will be asking for now that it's up and running, and we will continue to have the problems with the images as we have had in the past (we don't know when images are deleted from a listing, and some can go missing if their export ZIP file is corrupted).
So, my point is that the real estate industry can do better, and must do better, to remedy these issues. I would call on the National Association of Realtors to implement the following:
-
Develop and impose a standard data schema for IDX. Gather up all of the possible fields that cover the vast majority of listings and develop a standard list of fields. They should use a standard naming convention that is compatible with all major database systems. Each field should be well documented as to what data it can contain and how that data can be used subject to additional limitations by local boards (remember, we’re developers, not brokers, so we need some guidance on what your language means). There should only be one field for each piece of data. There is no reason we should have 12 different “public remarks” fields that have to be reassembled in the proper order. A reasonable number of "extra" fields should be included for localized use, and some "reserved" fields should be included for future growth without having to change anything. The RETS protocol does a good job of standardizing the format (XML), but uses metadata to define the fields and their contents. Frankly, from a development standpoint, all this does is create more work for already overworked developers to sort through the mess to get at the data. Once the schema is defined, it should be standardized over all of the IDX vendors so that they essentially become interchangeable from a programming standpoint. They will have to find other ways to provide value.
-
Insist that IDX vendors create a central image repository accessible over HTTP. The images for real estate listings are vast in number and change often. Why should each broker have to download tens or hundreds of thousands of image files to their web servers for local hosting? Of course I believe that they should be available for download if the broker chooses as they are now, but for those that only need them for their web site, there is no reason the IDX vendor cannot put them online and publish details about how to predict and format the image URLs for each listing. This would solve the "deleted image" problem that incremental updates create. The biggest argument against this that I can see would be that it would use significant bandwidth to serve the images. I would argue that it should actually result in less bandwidth overall. Presently, many brokers are downloading full image sets on a weekly basis, or incremental image updates every day. These include thousands of images that may or may not ever be requested by a web site user. The broker has to download ALL of them, just in case they are needed. If they were hosted locally, only images for listings people actually want to see are served up. Through proper caching and HTTP compression, the overall bandwidth can be reduced or stay the same since brokers would no longer need such large exports all the time.
-
Standardize the delivery format and method. Personally, I prefer pipe-delimited text files available directly over FTP with a username and password on the FTP account itself. What good is a password protected ZIP file when anyone can download it and then run a cracking utility against it offline, where nobody can monitor the attempts. Eventually an attacker would find the password and then download whatever they wanted from a folder without detection. Trying to attack a password on the FTP server itself is far more difficult and can be monitored, detected, and blocked.
-
Require listing addresses to be normalized upon input. Whenever a listing is entered into the MLS it should be checked against the postal service database (or some other source) and normalized according to a standard. Zip codes would be required to be valid and match a known city name before being accepted, etc. All of the rules for IDX say that the data cannot be modified before it is displayed, but when the IDX feed includes the city names as entered by the agents, it borders on negligence not to correct and normalize them before display. With the current IDX data I am working with there are no less than eight different ways that "St. Petersburg" is indicated as a city. You have "St. Pete", "Saint Pete", "St Petersburg", and on and on. For every misspelling and variation on every city we have to pick which one is "right" and then create an exception to normalize every other option so the search functions will work properly. If this were done at the source, where the data is entered, it would normalize the data everywhere at once and ensure more accurate listing information.
Given these changes, real estate listing data could be more easily distributed to brokers for use on their custom web sites, and it would save countless development hours and frustration in dealing with dozens of schemas, formats, and delivery methods. Developers could write software to handle the nuts and bolts of downloading and processing the data and make it easy to implement no matter which MLS or IDX vendor is being used.
Given the data hoarding history of real estate agents and brokers, and the glacial pace at which the industry seems to move, I do not expect that these suggestions will be implemented any time soon. In the mean time, I will continue to monitor the daily feeds, explain to brokers why some of their images are missing, and keep my fingers crossed that there will not be another MLS vendor change any time in the near future.
For those that are using ColdFusion to import IDX data from MLXchange, please feel free to ask any questions you might have in the comments.
2 comments - Posted by Justin Scott at 3:59 AM - Categories: ColdFusion