Multi-brand new/used car inventory search: an evolution story of python script to a web scraper to an API driven webapp

Forgive me for poking my head in here, this stuff is way above my pay grade.
Could someone explain in layman’s terms how these scrapers work differently from the engines on cars.com, cargurus, etc.?
Also, if these return better results, why would one of those massive companies not be using similar methods?

The way I understand it, cargurus etc aren’t scrapers, they are platforms that dealers willingly do business with to widen their exposure to public. This requires some sort of inventory management from dealer, which basically needs to supply information about their stock to those platforms explicitly.

Scraper is just a tool that programmatically connects to public-facing dealer web site to grab information that is available to public there. Without any need for dealer participation, nor any explicit approval.

On general note, I tuned DealerCom and DealerInspire detection for better parsing for Body Style and Drive Type fields. Should see less empty undetected values in those now.

I’m late to the party, but I’m glad to see that there are some other computer hackers here too :+1:

1 Like


RustyDaemon. Coding CarScraper, 2020, colorized
(and finally learning that Scraper is with one P lol)

On a serious note, why late? The party is just getting started. Feel free to contribute if you’d like :+1: :+1:

3 Likes

How common are the three dealer platforms outside of BMW?

From what I’ve seen, very common. For example, every Genesis dealer website I’ve seen is DealerInspire, identical to each other. Other makes might have slight variations of same platforms, or some regional dealers being on another platform, but it’s very common.

Cool. I will definitely check it out!

I know nothing about UIs, but I’d be interested in helping to port to a serverless cloud setup for the backend.

1 Like

That sounds awesome, you can pull the code from here: https://github.com/clasys/CarScrapper

… SolrNet? :thinking: GitHub - SolrNet/SolrNet: Solr client for .Net

Below is just some junk (that apparently loses formatting) for the moment, but I can look into this more if desired…

public class HomeController
{
private ISolrOperations solr;
public HomeController(ISolrOperations solr)
{
this.solr = solr;
}

}

I’ve been thinking more like async request/response using REST services… Something that works like this:

We can kill 2 birds with one stone here:

  1. If I abstract core functionality into RESTfull API services that can be consumed by anything REST, then we’re not bound by Azure/NET/Microsoft for UI.
  2. After that’s done, I could start building asynchronous client that would consume search service in MVC, and someone could start building another client in something else. That gives us flexibility in terms of choosing platforms for UI.

Let me know your thoughs.

1 Like

So you’re telling me there’s a chance I get to venture into the land of the unknown (at least for me) and finally put that Swift tutorial to practical use? You mad lad, I’m in. :+1::grin:

1 Like

All right! I’m gonna start wrapping core search into REST service, and we’ll take it from there. Probably going to take decent amount of time, but we’ll see as it goes.

1 Like

First service is up and running in azure:
https://api-carscraper.azurewebsites.net

API specification:
https://api-carscraper.azurewebsites.net/swagger/index.html

Invoking CarSearch method from Swagger doesn’t work, apparently Swagger doesn’t like complex types as parameters in request body, but you can use Postman (free) to test it.

Search is still syncronios, it will take long time to respond for large searches, but I wanted to boot up 1st revision for people to be able to access it and see definitions.

Feel free to start consuming search method, it’s up there just for that.

Next step is to make this asynchronous. Code was branched into its own branch for async work, if anyone wants to look at it: https://github.com/clasys/CarScrapper/tree/AsyncREST

1 Like

I haven’t used swagger before. The error message I get is:

TypeError: Failed to execute 'fetch' on 'Window': Request with GET/HEAD method cannot have body.

Normally GET requests have their arguments as parameters. Does changing the API to a POST API fix swagger?

The curl commands generated by swagger work fine in either case. What database of dealers is being used?

Swagger doesn’t seem to work for invoking methods with parameters from request body, but its there just for the API schema. Use Postman, or plain CURL for invoking.

Agree, normally I would do it that way, but there is relatively complex type I need as parameter, so I decided to implement it this way.

There is no database of dealers. It’s scraping tool that hits live dealer websites for information. Dealer info is stored in config file. Any dealer which website compatible with one of 3 implementations, could be added to the config

Yeah, that is true for GET requests. It sounds like it can do it for POST requests.

There is no database of dealers. It’s scraping tool that hits live dealer websites for information. Dealer info is stored in config file. Any dealer which website compatible with one of 3 implementations, could be added to the config

Yeah, I meant which dealer sites are in the config file for the API?

1 Like

Async REST services are here! :drum: :drum: :drum:

This is how to consume:

You call service to start search, get back result key and endpoint information, where results will be available when ready, and then you start polling result endpoint periodically, until results are ready.

Details:

  1. We now have 2 operations:
    image

  2. First you invoke “StartSearch” operation. It will synchronously return “202 Accepted” status with ticket information, while starting actual search asynchronously in background.


    You will get back searchKey for results retrieval, RetryAfter to tell you how often you should poll result endpoint, and result endpoint Uri.

  3. You start periodically polling (but not quicker than retryAfter suggests) results endpoint, supplying searchKey as parameter. If results aren’t ready yet, you will get “202 Accepted” code with payload indicating that search is still in progress

  4. Keep polling that endpoint until you get search results. You either get payload with success status and array of results, or with failure status and error message. In both cases it’ll be 200 OK code.

  5. Use the results to display in your favorite UI using your favorite UI framework.

Please feel free to test it out, and report any problems that you find. I’ll push latest code to Git shortly.

Edit: changed search service to POST, better compatibility with clients built on JS frameworks.

3 Likes

Cool. Is this implemented with some type of async Azure primitive (functions)?

No, nothing Azure specific in implementation, just ASP.net Web API controller using .NETCore 3.1.

Async comes from usage pattern, just without redirecting to additional resource URIs.