Products
Industries
Delivery
Resources
Company
Get Sample Data
← BrightCat Data
AVM training data

Canadian AVM training data: repeat-sale pairs, lifecycle coverage, and the property identification problem

An automated valuation model is only as good as the data it learns from. In Canada, the binding constraint on AVM quality is rarely the algorithm. It is whether the training set contains enough verified repeat-sale pairs, tied to a persistent property identifier, with lifecycle history attached. BrightCat's Canadian Home Price Index dataset provides 194,167 such pairs, drawn from a weekly pipeline that has operated continuously since 2014.

What an AVM actually needs

An AVM produces a point estimate of a property's current value. To do that well, it needs four inputs: observed transaction prices, property-level attributes, temporal coverage sufficient to model market movement, and geographic coverage sufficient to model local variation. The first of these is where most Canadian AVM training data breaks down.

Individual sale prices are useful but limited. A single transaction tells the model what one property was worth on one date. The richer signal comes from watching the same property sell more than once. The price change between two sales of the same property, net of general market movement, is the cleanest observable evidence of what that specific property is worth relative to its peers.

That is a repeat-sale pair. It is the unit of learning for the time dimension of any serious valuation model.

A repeat-sale pair is a single property sold twice, with both sale prices verified, both transaction dates confirmed, and a time gap between them long enough that the second sale reflects genuine market exposure rather than a clerical re-recording of the first.

The property identification problem

Building repeat-sale pairs in Canadian data is harder than it sounds. The core difficulty is that a single property may carry different identifiers across its listing history. Listing numbers are reassigned when a property relists. Addresses are recorded with variations in punctuation, unit formatting, and directional suffixes. A property that sold in 2015, relisted in 2019 under a new listing number, and sold again in 2021 may appear as three unrelated records in any system that relies on the listing number as the join key.

To produce accurate repeat-sale pairs, the pipeline needs a persistent property identifier: a stable reference that links every record touching the same physical property, regardless of listing number, agent change, relist, or cosmetic address variation. That identifier is not something an AVM can generate from the data it sees at training time. It has to be produced upstream, in the pipeline that assembles the training set.

BrightCat's pipeline produces that identifier as part of weekly processing. Every residential and commercial record flowing through the pipeline since 2014 has been assigned a persistent property identifier, reconciled across relists, address variations, and agent transitions. That work is what makes 194,167 verified pairs possible across the Canadian dataset. Without it, the same underlying transactions would produce a far smaller, noisier pair set.

What 194,167 pairs looks like in practice

Every pair in the BrightCat Canadian Home Price Index dataset meets four conditions:

Pairs that cannot satisfy all four conditions are excluded from the published series. The result is a training set where the price signal is as clean as the underlying transaction record permits. For AVM teams building or retraining models on Canadian data, that filtering is not a detail. It is the difference between a pair set that trains a stable model and a pair set that introduces coincidental noise the model ends up memorising.

Why temporal depth matters for AVM training

Canadian housing markets moved through several distinct regimes in the past decade: the long run-up from 2014 through early 2022, the rate-shock correction that followed, the regional divergence that emerged afterward. A pair set drawn only from recent years captures one part of that. A pair set drawn from a pipeline running since 2014 lets the model see how the same property behaved across different market conditions.

BrightCat's underlying lifecycle dataset covers 5.8 million residential properties and 297,000 commercial properties, with listing and transaction activity tracked weekly over twelve years. The repeat-sale pair set is the subset of that history where the conditions above are all satisfied. As the pipeline adds weekly data, the pair set grows, existing pairs gain additional context from subsequent listing activity, and the geographic and temporal distribution thickens.

Why lifecycle context matters alongside the pairs

A repeat-sale pair shows the first and last sale. The useful context lives between them. A property that sold in 2017, relisted four times over the next five years at declining prices before selling again in 2022, is a different data point than a property that sold cleanly in 2017 and again in 2022 with no activity in between. The final prices may be identical. The signal to an AVM is not.

BrightCat's pair set retains the link to the underlying lifecycle record: every listing event, every price change, every drop, every relist, every status transition between the two sales. AVM teams that need this context can join it back through the persistent property identifier. Teams that just want the pair prices can use the pair set directly.

Coverage across the ten provinces

The Canadian Home Price Index dataset spans all ten provinces, with residential coverage anchored by the weekly listing and sold pipeline. Pair density varies by province, driven by underlying transaction volume. Ontario, British Columbia, and Alberta together account for the largest share of pairs, consistent with their share of national residential transaction volume. Smaller provinces are represented in proportion to their market size. The full provincial breakdown is available on request.

How the dataset is delivered

The Canadian Home Price Index dataset and the underlying repeat-sale pair table are part of BrightCat Core. Delivery options include:

All four channels draw from the same weekly pipeline. Pair counts, lifecycle history, and property-level attributes are consistent across channels.

What this dataset is not

It is not a published house price index series. Teams that need a single national or provincial index number for reporting should look at official statistical publications. BrightCat's strength is the underlying pair table: the raw inputs an AVM team, a portfolio analyst, or a quantitative research group would use to produce their own index or train their own model. The 194,167 pairs are the substrate, not the final aggregate.

AVM quality is a function of training data quality. Training data quality, in Canadian property data, is a function of whether the pipeline can link the same property across time. BrightCat's dataset exists because that linkage runs weekly over twelve years of residential transaction history.
BrightCat Core · Canadian Home Price Index dataset · Updated weekly

Frequently asked questions

What is AVM training data?
AVM training data is the set of historical property transactions a valuation model learns from. The core input is repeat-sale pairs: two verified sales of the same property at different points in time, which let the model separate general market movement from property-specific value change.
How many repeat-sale pairs does BrightCat provide?
194,167 verified pairs across all ten provinces. A pair is a single property sold twice, with both sale prices and both transaction dates confirmed, and a minimum ninety-day gap between sales.
Why do AVMs need repeat-sale pairs specifically?
Single-sale records tell you what a property sold for once. Repeat-sale pairs tell you how value changed at the same property across time. That is the unit of learning for the time dimension of an AVM.
What is the property identification problem in Canadian data?
A single property may carry different identifiers across its listing history. Listings are reassigned new numbers on relist, and addresses are sometimes recorded inconsistently. Accurate repeat-sale pairs require a persistent property identifier that survives these changes.
Does BrightCat cover all Canadian provinces?
Yes. The dataset spans all ten provinces. Residential coverage is based on a weekly pipeline operating continuously since 2014.
How is the data delivered?
Through Snowflake Marketplace, the MCP Connector, a Developer API, or weekly flat-file delivery. The full dataset and enrichment layer are part of BrightCat Core.
How often does the data refresh?
Weekly. New sale events, new pairs, and updated lifecycle states land in the pipeline every week. Contact us for a sample.

Related reading

See the pair table

Request an AVM training sample

Verified Canadian repeat-sale pairs with full lifecycle context.

Request sampleTalk to sales