Which is a misnomer, because the data probably isn't generated by a model.

tomrod · on July 24, 2016

Semantics at this point. Data-generating process is a term that's also used. A model seeks to mimic or match the real process to a reasonable approximation. Hence "true model" as a scoring engine or data relationship that represents reality completely.

jsprogrammer · on July 24, 2016

Except that we can't access reality completely. Assuming that there is a "true model" generating data is just that: an assumption with no real basis.

Semantics is the essence of communication.

pessimizer · on July 24, 2016

Unless you have data, that is. The data is the basis for assuming that a process has generated data. Either that, or the data has existed for all eternity, and therefore could never have been collected.

jsprogrammer · on July 24, 2016

"data" is your own perception, however. Can you give an example of a data generating process?

tomrod · on July 24, 2016

We can create them!

Suppose I take the function y = log(x) and add random white noise. The function log() and the parameters on the random white noise process are the data generating process. We could then fit a model y = \beta X + \epsilon, and then compare the "true" (first) model to our second model. When the natural world generates our data, the idea behind all this is the same: there is a process which generates the data, and the data reveals information about that process to an approximate degree.

Some reads:

[1] http://www.rimini.unibo.it/fanelli/econometric_models2_2012....

[2] https://en.wikipedia.org/wiki/Data_generating_process

jsprogrammer · on July 24, 2016

Can you give a non-synthetic, ie. natural, example data generation process?

Edit: I don't know if people don't like the grammar, or what?

How about this:

Can you give a non-synthetic, ie. natural, example of a data generating process?

tomrod · on July 29, 2016

Sure! Most of the equations you find in your nearest physics or chemistry books are validated experimentally/empirically. The validation comes by better and better approximating the data generating process.