© Tom Bäckström

Is my data any good?

Trustworthiness of research databases

Tom Bäckström
3 min readFeb 28, 2020

--

Evidence-based research (often known as “science”) and development of new technologies is based on collecting, processing and analysing data. The scientific results are based on data. Researchers themselves, and the scientific principle in general, further take many precautions to make sure that their results reliably reflect the evidence present in the data. For example, to make sure that researchers from another group can verify results of my group, it is good practice to openly publish the data as well as the analysis software. Independent researcher can then reproduce my results by their own analysis of the same data and software.

Machine learning methods is a category of research tools which is particularly hungry for data. It is, even more than conventional forms of research tools, highly dependent on the data and sensitive to the quality of data. In particular, machine learning methods only give answers, but does not usually tell you how it has reached those conclusions. It is therefore hard to verify the trustworthiness of the machine learning methods, but we must rely on the trustworthiness of the database upon which it was created.

For example, does a face image database represent a balanced sample of the population, such that a face-recognition algorithm developed from that database is equally accurate for all ethnic groups? How do we know, in general, if a database fulfils such requirements of trustworthiness?

This brings forth the problem; how do we determine the trustworthiness of a research database? Every field of science has its own vocabulary, as well as constraints and objectives, so it would seem like a futile goal to determine rules which covers all possible cases. However, Sir David Spiegelhalter has recently published a paper about the trustworthiness of algorithms, which provides a list of 7 questions to gauge trustworthiness. A beautiful aspect of this list is that the questions are human-readable in the sense that I could show the list to my uncle and he would immediately understand it. The simplicity of the language works also for the benefit of making it independent of the field of application.

I will therefore use Spiegelhalter’s list as a starting point to create a list of questions for determining the trustworthiness of databases. As with Spiegelhalter’s list, my objective is to use simple language such that the questions are easily understandable, but also such that they are not specific to any particular field of science. For the list of questions, my first draft is:

  1. Is the data any good when used in new parts of the real world or at a later point in time?
  2. Would a database, simpler or smaller, and more transparent and robust, be just as good?
  3. Could I explain the purpose (in general) of the data to anyone who is interested?
  4. Could I explain how the data is used in a particular case?
  5. Does the database clearly state when it is on shaky ground, and can it acknowledge uncertainty?
  6. Do people use the data appropriately, with the right level of scepticism?
  7. Does the data actually help in practice?

I threw this list together on a whim, so it is still a draft, but my objective is to update it if there are any updates. Still, I think it already now pretty good.

Note that this list is purely about the trustworthiness of data and omits entirely the aspect of ethics. In particular, even if accurate and trustworthy, a database can violate the privacy of test subjects or data collection process can in itself violate ethics. Such problematic is left for future work.

--

--

Tom Bäckström

An excited researcher of life and everything. Associate Professor in Speech and Language Technology at Aalto University, Finland.