We see much discussion of late about how the importance of not only preserving scientific data but making data freely accessible. It is surely the case that much valuable data, produced at great cost, is subsequently discarded. But making data accessible incurs a cost (in some cases, a substantial cost). Thus, any systematic program aimed at preserving more data requires some systematic process for deciding what data is worth keeping.
I suggest that we can usefully approach the question of data preservation from an economic perspective. Data costs money to collect or create, a cost that may depend on both the data d involved and the time t at which it is created: thus, C(d,t). (For some data, this cost may decline over time, as instrumentation becomes cheaper; for data pertaining to a natural phenomenon that does not re-occur, the future cost may be viewed as infinite. Similarly, it costs money to preserve data from time t1 to t2: P(d,t1, t2)–a cost that encompasses both data preparation and storage. Data preservation costs will vary widely according to how the data is to be used: data stored on tape is cheap but inaccessible; data stored on a data-intensive supercomputing (DISC) system is expensive, but easily computed on.
Data also has value, and this value plays a role in determining whether data should be preserved: We may only want to preserve data for which future value meets or exceeds the costs of its preservation. However, the value of a piece of data, V(d)–like the value of anything–is hard to pin down. (Generally accepted accounting principles prohibit treating data as an asset. But there are many examples of apparently useless data, preserved by happenstance, that turn out to have great value: for example, observations of 19th Century naturalists.) But that is not to say that we cannot attempt an estimate. We can use the cost of generating the data in the first place as a lower bound on value. We can use market mechanisms to assign prices, by requiring data consumers to pay for access. We can measure data usage, and use that information to infer value. We can observe how much people spend on preserving data, and use that number as another estimate of value. We can ask communities to assign values to data: in effect, crowdsourcing the value assignment problem.
I expect that a systematic investigation of data costs and values will be tremendously illuminating. We will likely find that the costs and values associated with different data, in different fields, and by different researchers vary by orders of magnitude. I also expect that once even rough numbers are assigned, they will motivate changes in behavior and policy that will reduce costs and/or accelerate discovery. For example, we may determine that some data is best re-generated when it is required, rather than stored indefinitely, because P(d,now,then) > C(d,then), for expected future access time then. We can encourage researchers to write realistic costs for long-term data preservation into budgets, and reviewers can evaluate those costs relative to the expected value of the data that is to be preserved.
Please let me know your thoughts