Preview: MF/RV scraper

RustyDaemon · November 8, 2020, 5:22am

A little preview of something I’ve been working on… Posted deals in “Share deals and tips” contain RV and MF values. It’ll be nice to have this info scraped into one place, searchable by make and model. Then possibly integrate scraping edmunds forums for this info as well…

Ursus · November 8, 2020, 6:04am

Not a critisism
Numbers without model are pretty much useless. Maybe need to search URLs for the model (ux250h in this case), but not all URLs will have it.
RV also based on mileage. And some Share Deals have marked up MF

edmcman · November 8, 2020, 12:19pm

Scraping from Edmunds Deals (which is not the same as the forums) would probably be easier, more accurate and complete. This has been on my todo list for a while but I’ve been busy

That being said, scraping LH deal summaries like this is also incredibly valuable. I just don’t think it’s the best source for MF/RV.

RustyDaemon · November 8, 2020, 12:48pm

All valid points, keep them coming. I just started, and was planning to add edmunds to scrap from the beginning. Just had to resolve the hurdle of scraping dynamic sites that load data via JS (this forum). After that is scraping extravaganza!

mattevan · November 8, 2020, 3:04pm

Wonder what the Edmunds TOS would say about it. Why risk messing up a good thing? People are that lazy?

RustyDaemon · November 8, 2020, 5:47pm

1st API service operation is up. Search results since Oct 1:

https://pastebin.com/embed_iframe/zHi535UE

RustyDaemon · November 8, 2020, 6:03pm

That is actually a good point. Wouldn’t want to mess that up.

I skimmed thru visitor agreement, and for the web site, there seems to be restriction on copying/gathering info. Forums though have separate section. Didn’t see anything there, but let someone more knowledgeable chime in.

https://www.edmunds.com/about/visitor-agreement.html

ajgraham · November 8, 2020, 7:34pm

I salute the effort but reminds me of the old saying: garbage in, garbage out.

I did my only bit of scraping for my degree honours project (in the U.K.) ~15 years ago to analyse language usage on de.licio.us, back in the quaint old days of social bookmarking. Nowhere near this complex; just shunting data into the db and then another script would analyse.

I think scraping data from sentences is an uphill battle. There is very little structural uniformity, even the moderators don’t reply in a consistent way and you have so many pieces of data you’re trying to scrap. I hate going to Edmunds for this info but without at least some crowdsourcing to sanity check I can’t see it being consistent enough to be reliable.