Re-imagining the laboratory record

1 Leave a comment on paragraph 1 0 Automated experimentation brings the promise of a much improved record of the research process. Where experiments are sufficiently well defined that they can be carried out by automated instrumentation or computational resources it is to be expected that an excellent record of process can and will be created. In “Big Science” projects from particle physics [1] to genome sequencing [2] the sharing of records about samples and objects, experimental conditions and outputs, and the processing of data is a central part of planning and infrastructure, and often a central part of justifying the investment of resources. As some segments biological science have become industrialized with greater emphasis on high throughput analysis and the generation of large quantities of data sophisticated systems have been developed to track the experimental process and to describe and codify the results of experiments through controlled vocabularies, minimal description standards [3], and ontologies [4].

2 Leave a comment on paragraph 2 0 None of this has had a major impact on the recording process applied to the vast majority of research experiments, which are still carried out by single people or small teams in relative isolation from other research groups. The vast majority of academic research is still recorded in paper notebooks and even in industry the adoption of electronic recording systems is relatively recent and remains patchy. A paper notebook remains a means of planning and recording experiments that is both flexible, comfortable to use, and has a long history of successful use. However, it is starting to fail as an effective means of recording, collating, and sharing data due to the increasing volume and changing nature of the data that researchers are generating, The majority of data generated today is born digital. The proportion of global data generated in 2002 that was recorded on hard disks was estimated at over 90% of a total of around five exabytes with print accounting for less than 0.05% of the total [5]. In the case of small laboratory data some printouts make it into bound notebooks. In most cases however, data remains distributed on a collection of laboratory and personal hard disks. The record of data analysis, the conversion of that digital data into new digital objects and finally into scientific conclusions is, in most cases, poorly recorded.

3 Leave a comment on paragraph 3 0 The question of reproducibility lies at the heart of scientific method and there are serious concerns that much currently published science is of limited value due to poor record keeping. Data sharing mandates from research funders are driven, at least in part, by a concern about reproducibility. Opposition to those mandates is driven to a significant extent by concerns of the value of sharing data that cannot be placed in context due to inadequate recording of its production. With digital instrumentation, more complex experiments, and data volumes increasing a paper based record is now longer capable of providing the necessary context.

4 Leave a comment on paragraph 4 0 It is noteworthy in this context that a number of groups have felt it necessary to take an active advocacy position in trying to encourage the wider community that the reproducibility of data analysis is a requirement, and not an added bonus [6,7]. The promise of digital recording of the research process is that it can create a reliable record that would support automated reproduction and critical analysis of research results. The challenge is that the tools for generating these digital records must outperform a paper notebook while simultaneously providing enough advanced and novel functionality to convince users of the value of switching

5 Leave a comment on paragraph 5 0 At the same time the current low level of adoption means that that field is wide open for a radical re-imagining of how the record of research can be created and used. It lets us think deeply about what value the different elements of that record have for use and re-use and to take inspiration from the wide variety of web-based data and object management tools that have been developed for the mass consumer market. This paper will describe a new way of thinking about the research record that is rooted in the way that the World Wide Web works and consider the design patterns that will most effectively utilize existing and future infrastructure to provide a useful and effective record.

The distinction between capturing process and describing an experiment

6 Leave a comment on paragraph 6 0 In discussing tools and services for recording the process of research there is a crucial distinction to be made between capturing a record of as it happens and describing an experiment after the event. There is an important distinction between data, the raw material produced by an experiment, including the record of that experiment; information, which places that data in a context that allows inferences and deductions to be made; creating knowledge. A large part of the tension between researchers who develop systems for describing knowledge in structured form and research scientists who need a record of the processes carried out in the laboratory derives from a misunderstanding about whether data, information, or knowledge is being recorded. The best way to maximise success in recording the important details of a research process is to capture the data and record as they are generated, or in the case of plans, before they are generated. However, most controlled vocabularies and description systems are built, whether explicitly or implicitly, with the intention of describing the knowledge that is inferred from a set of experiments, after the results have been considered. This is seen mostly clearly in ontologies that place a hypothesis at the core of the descriptive structure or assume that the “experiment” is a clearly defined entity before it has been carried out.

7 Leave a comment on paragraph 7 0 These approaches work well for the highly controlled, indeed, industrialised studies that they were generally designed around. However they tend to fail when applied to small scale and individual research, and particularly in the situations where someone is “trying something out”. Most of the efforts to provide structured descriptions of the research process start with the concept of an “experiment” that is designed to test a “hypothesis” (see e.g [8,9]). However in the laboratory the concept of “the hypothesis” very often doesn’t usefully apply to the detail of the experimental steps that need to be recorded. And the details of where a specific experiment starts and finishes are often dependent on the viewer, the state of the research, or the choices made in how to publish and present that research after the fact. Products or processes may be part of multiple projects or may be later used in multiple projects. A story will be constructed later, out of these elements, to write a paper or submit a database entry but at the time the elements of this story are captured the framework may be vague or non-existent. Unexpected results clearly do not fit into an existing framework but can be the launching point for a whole new programme. The challenge therefore is to capture the elements of the research process in such a way that the sophisticated and powerful tools developed for structured description of knowledge can be readily applied once the story starts to take form. That is the data and record should be captured in such a way that it can easily be placed in context to provide information, which in turn can be structured and presented as new knowledge.

The Web Native Lab Notebook

8 Leave a comment on paragraph 8 0 If we are to consider a web-native approach to capturing the scientific record we need first to consider the laboratory notebook. The lab notebook is, at its core, a journal of events, an episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. There are strong analogies between this view of the lab notebook as a journal and the functionality of Web logs or “Blogs”. Blogs contain posts which are dated, usually linked to a single author, and may contain embedded digital objects such as images or videos, or indeed graphs and charts generated from online datasets as well as free or structured text. Each post has an address on the web, given by a post number or title (or both). Thus a Blog provides much of the functionality of a laboratory notebook: it feels like a journal and it can contain and present both free text and structured text such as tables. While it may not have page or volume numbers each post will have its own address on the web, a URL that points uniquely at that one piece of the record, and can be passed to collaborators to share specific objects or be used to index specific protocols, experiments, samples, or pieces of data.

A “semantic web ready” laboratory record

9 Leave a comment on paragraph 9 0 The creation of individually addressable objects is crucial because it enables these objects, whether they are datasets, protocols, or pointers to physical objects such as samples, to play a part in the semantic web [10]. The root concept of the semantic web is that the relationships between objects can be encoded and described. For this to be possible those objects must be uniquely addressable resources on the web. By creating individual posts or pages the researcher is creating these individual resources; and again these can represent physical objects, processes, or data. It is possible to describe the relationships between these resources via sophisticated semantic tools such as the Resource Description Framework (RDF) or locally via statements within the posts. However it not necessary to take these approaches as it is also possible to simply express relationships that directly leverages the existing toolset on the web is by linking posts together.

Feeds change the lab notebook from a personal record to a collaborative document

10 Leave a comment on paragraph 10 0 The other key functionality of the web to focus on is that of the “feed”. Feeds, whether they are RSS or Atom are XML documents that are regularly updated providing a stream of “events” which can then be consumed by various readers, Google Reader being one of the most popular. Along with the idea of hyperlinks between objects the feed provides the crucial difference between the paper based and web-native lab notebook. A paper notebook (whether it is a physical object or “electronic paper”) is a personal record. The web-native lab notebook is a collaborative notification tool that announces when something has happened, when a sample has been created, or a piece of data analysed.

11 Leave a comment on paragraph 11 0 Despite of the historical tendency to isolated research groups discussed above, these independent groups are banding together as research funders demand larger coordinated projects. Tasks are divided up by expertise and in many cases also divided geographically between groups that have in the past probably not even had good internal communication systems. Rapid and effective communication between groups on the details of ongoing projects is becoming more and more important and is increasingly a serious deficiency in the management of these collaborations. In addition reporting back to sponsors via formal reports is an increasing burden. The notification systems enabled via the generation of feeds go a significant way towards providing a means of dealing with these issues. Within a group the use of feeds and feed readers can provide an extremely effective means of pushing information to those who need to either track or interact with it. In addition the idea of selectively pushing specific elements of the record of interest to a specific group could also be adopted to push either raw productivity data or a much smaller subset of items, including summaries, of interest to funding agencies (Figure 1). The web native lab notebook should bring the collaborative authoring and discussion tools provided by the read-write web to bear on the problem of communicating research results.

12 Leave a comment on paragraph 12 0 Using feeds and feed readers to aggregate and push laboratory records. A) A screenshot of Google Reader showing an aggregated feed of laboratory notebook entries from http://biolab.isis.rl.ac.uk. Two buttons are highlighted which enable "sharing" to anybody who follows the user's feed or adding a tag. B) Sharing can also include annotating the entry with further information or tagging the entry to place it in a specific category. C) A tag can create a new feed. In this panel a feed has been created that contains all items tagged "neutron reflection". The feed for appriate tags can be consumed by readers with a specific interest such as collaborators on a specific project, regulatory agencies wishing to monitor specific classes of experiment, or funders who require access to research summaries.Using feeds and feed readers to aggregate and push laboratory records. A) A screenshot of Google Reader showing an aggregated feed of laboratory notebook entries from http://biolab.isis.rl.ac.uk. Two buttons are highlighted which enable “sharing” to anybody who follows the user’s feed or adding a tag. B) Sharing can also include annotating the entry with further information or tagging the entry to place it in a specific category. C) A tag can create a new feed. In this panel a feed has been created that contains all items tagged “neutron reflection”. The feed for appriate tags can be consumed by readers with a specific interest such as collaborators on a specific project, regulatory agencies wishing to monitor specific classes of experiment, or funders who require access to research summaries.

Integrating tools and services

13 Leave a comment on paragraph 13 0 With the general concept of the record as a blog in place, enabling us to create a set of individually addressable objects, and link them together, as well as providing feeds describing the creation of these objects, we can consider what tools and services we need to author and to interact with these objects. Again blogs provide a good model here as many widely used authoring tools can be used directly to create documents and publish them to blog systems. Tools based on the Atom Publishing Protocol [11,12] can hide the complications of publishing documents to the web from the user and made it easy to develop sophisticated services that can push content from one place to another online. Recent versions of Microsoft Office include the option of publishing documents to online services and a wide range of web services now make it easy to push content from wordprocessors, mobile phones, email, or any connected source to virtually any other.

14 Leave a comment on paragraph 14 0 A wide variety of web based tools and plugins are available to make the creation and linking of blog posts easy. Particularly noteworthy are tools such as Zemanta, a plugin which automatically suggests appropriate links for concepts within a post [13]. Zemanta scans the text of a post and identifies company names, concepts that are described in Wikipedia and other online information sources, using an online database that is built up from the links created by other users of the plugin. The service suggests possible links and tags to the users, and then exploits the response of the user to those suggestions to refine the model for future suggestions.

15 Leave a comment on paragraph 15 0 Sophisticated semantic authoring tools such as the Integrated Content Environment (ICE) developed at the University of Southern Queensland [14,15] provide a means of directly authoring semantic documents that can then be published to the web. ICE can also be configured to incorporate domain specific semantic objects that generate rich media representations such as three dimensional molecular models. These tools are rapidly become very powerful and highly useable, and will play an important role in the future by making rich document authoring straightforward.

Where do we put the data?

16 Leave a comment on paragraph 16 0 With the authoring of documents in hand we can consider the appropriate way of handling data files. At first sight it may seem simplest to upload data files and embed them directly in blog posts. However, the model of the blog points us in a different direction here again. On a blog images and video are not generally uploaded directly, they are hosted on an appropriate, specialised, external service and then embedded them on the blog page. Issues about managing the content and providing a highly user-friendly viewer are handled by the external data service. Hosting services are optimized for handling specific types of conten; Flickr for photos, YouTube (or Viddler or Bioscreencast) for video, Slideshare for presentations, Scribd for documents (Figure 2).

17 Leave a comment on paragraph 17 0 Researchers, instruments, or computers may create digital objects such as data, workflows, descriptions, and presentations as well as references to physical objects such as samples and materials. These objects can be stored in a wide range of services that provide specific functionality. Using feeds the record of these depositions can be re-aggregated, tracked, and processed using standard tools such as Feed readers.Researchers, instruments, or computers may create digital objects such as data, workflows, descriptions, and presentations as well as references to physical objects such as samples and materials. These objects can be stored in a wide range of services that provide specific functionality. Using feeds the record of these depositions can be re-aggregated, tracked, and processed using standard tools such as Feed readers.

18 Leave a comment on paragraph 18 0 While it might be argued that the development of these specialist services and embedding capabilities grew out of deficiencies in generic hosting platforms it is also true that these specialist platforms have exploited the economies of scale that arise from handling similar content together. Scientific infrastructure is resource limited and there is a strong argument that rather than building specialist publishing platforms it is more effective to use generic platforms for publishing. Specialist datahandling services can then grow up around specific data types and benefit from the economies of scale that arise from aggregating types together. In an ideal world there would be a trustworthy data hosting service, optimized for your specific type of data, that would provide cut and paste embed codes providing the appropriate visualizations in the same way that videos from YouTube can easily be embedded.

19 Leave a comment on paragraph 19 0 Some elements of these services exist for research data. Trusted repositories exist for structural data, for gene and protein sequences, and for chemical information. Large-scale projects are often required to put a specific repository infrastructure in place to make the data they generate available. And in most cases it is possible to provide a stable URL which points at a specific data item or dataset. It is therefore possible in many cases to provide a link directly to a dataservice that places a specific dataset in context and can be relied on to have some level of curation or quality control and provide additional functionality appropriate to the datatype. Currently many of these URLs encode database queries rather than providing a direct link. To be most effective and reliable such URLs need to be “Cool” [16]. That is they should be stable, human readable, and direct addresses rather than queries. Query engines may change, and database schemas may be modified, but the address of the underlying objects needs to stay constant for the linked data web to be stable enough to form.

20 Leave a comment on paragraph 20 0 What is less prevalent is the type of embedding functionality provided by many consumer data repository services. ChemSpider http://www.chemspider.com webcite is one example of a service that does enable the embedding of both molecules and spectra into external web pages. This is still clearly an area for development and there are discussions to be had about both the behind the scenes implementation of these services as well as the user experience but it is clear that this kind of functionality could play a useful role in helping researchers to connect information on the web up. If multiple researchers use the ChemSpider molecule embedding service to reference a specific molecule then all of those separate documents can be unambiguously assigned as describing the same molecule. This linking up of individual objects through shared identifiers is precisely what gives the semantic web its potential power.

21 Leave a comment on paragraph 21 0 A more general question is the extent to which such repositories can or will be provided and supported for less common data types. The long term funding of such data repositories is at best uncertain and at worst non-existent. Institutional repositories are starting to play a role in data archiving and some research funders are showing an interest. However there is currently little or no coordinated response to the problem of how to deal with archiving data in general. Piecemeal solutions and local archiving are likely to play a significant role. This does not necessarily make the vision of linked data impossible, all that is required is that the data be placed somewhere where it can be referenced via a URL. However, to enable rich functionality to manipulate and visualize that data it will be necessary to find funding sources and business models that can support and enable the development of high quality data repositories. In our model of the Blog as a lab notebook a piece of data can be uploaded directly to a post within the blog. This provides the URL for the data, but will not in and of itself enable visualization or manipulation. Nonetheless the data will remain accessible and addressable in this form. We can take a step forward by simply putting it on the web but to enable other researchers to use those objects most effectively it will be important to provide rich functionality. This will be best supported via provision in centralised services where economies of scale can be found.

22 Leave a comment on paragraph 22 0 A key benefit of this way of thinking about the laboratory record is that items can be distributed in many places depending on what is appropriate. It also means is that the search mechanisms we use to find objects and information on the web to index and search our own laboratory material. Web search relies primarily on mechanisms like Page Rank that prioritise how specific addresses are linked in to the wider web. By linking our record into that wide web we enable Google and other search engines to identify our most important datasets, based on how they are connected to the rest of our research record as well as to the wider research effort.

23 Leave a comment on paragraph 23 0 This is a lightweight way of starting to build up a web of data. It doesn’t provide the full semantic power of the linked data web as envisioned by Tim Berners-Lee and others but it also doesn’t hold the same challenges and fears for users. If we can get data up on the web and identify relationships between them it doesn’t matter so much to start with whether these relationships are fully described as long as there is enough contextual data to make it useful. Tagging or key-value pairs using the tools that are already available and more widely adopted by the general user community would enable us to make a good start on improving data availability and discoverability while the tools to provide more detailed semantic markup are developed.

24 Leave a comment on paragraph 24 0 However while distribution has benefits, it also poses significant risks. Services can fail, links can and do break, and interoperability is made more complex and can easily be compromised by developments of one service that are not mirrored on another. It would also seem at first sight to be opposed to integrative approaches that aggregate related objects together. However, such approaches, inspired by the “Datument” concept of Rzepa and Murray-Rust [17], can be more properly seen as providing the opportunity to aggregate, contain, and represent knowledge once it has been generated from the raw material. Our aim in distributing the elements of the record, the raw data, is therefore to provide the contextual information either through links, or through metadata, to make it straightforward to aggregate those elements into datuments for the presentation, publishing, and archival of knowledge.

Distributed sample logging systems

25 Leave a comment on paragraph 25 0 The same logic of distributing data according to where it is most appropriate to store it can also be applied to the recording of samples. In many cases, tools such as Laboratory Information Management System (LIMS) or sample databases will already be in place. In most cases these are likely to applied to a specific subset of the physical objects being handled; a LIMS for analytical samples, a spreadsheet for oligonucleotides, and a local database, often derived from a card index, for lab chemicals? As long as it is possible to point to the record for each physical object independently with the required precision you need then these systems can be used directly. Although a local spreadsheet may not be addressable at the level of individual rows Google Spreadsheets can be addressed in this way. Individual cells can be addressed via a URL for each cell and there is a powerful API that makes it possible to build services to make the creation of links easy. Web interfaces can provide the means of addressing databases via URL through any web browser or http capable tool.

26 Leave a comment on paragraph 26 0 Samples and chemical can also be represented by a post within a Blog. This provides the key functionality that we desired; a URL endpoint that represents that object. This can provide a flexible approach which may be more suited to small laboratories than heayweight, database backed systems, designed for industry. When samples involve a wide variety of different materials put to different uses, the flexibility of using an open system of posts rather than a database with a defined schema can be helpful.

27 Leave a comment on paragraph 27 0 In many cases it may be appropriate to use multiple different systems, a database for recording oligonucleotides, a spreadsheet for tracking environmental samples, and a full blown LIMS to enable barcoding and monitoring samples through preparation for sequencing. Similar to the data case, it is best to use a system that is designed for or best suited to creating a record for the specific set of samples. These systems are better developed than they are for data – but many of the existing systems don’t allow a good way of pointing at the record for specific samples from an external document – and very few make it possible to do this via a simple and cool URI.

Full distribution of materials, data, and process: The lab notebook as a feed of relationships

28 Leave a comment on paragraph 28 0 At this point it may seem that the core remaining component of the lab notebook is the description of the actions that link material objects and data files the record of process. However even these records could be passed to external services that might be better suited to the job. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). These may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where, once again, it is possible to unambiguously point at them.

29 Leave a comment on paragraph 29 0 What we are left with is the links that describe the relationship between materials, data, and process, arranged along a timeline. The laboratory record, the web-native laboratory notebook, is reduced to a feed that describes these relationships; that notifies users when a new relationship is created or captured (Figure 3). This could be a simple feed containing plain hyperlinks or it might be a sophisticated and rich feed that uses one or more formal vocabularies to describe the semantic relationship between items. In principle it is possible to mix both, gaining the best of detailed formal information where it is available but linking in relationships that are less clearly described where possible. That is, this approach can provide a way of building up a linked web of data and objects piece by piece, even when the details of vocabularies are not yet agreed or in place.

30 Leave a comment on paragraph 30 0 A) A series of research objects collected as part of an experiment. A raw material is converted into a sample that is then subjected to analysis. A photo is taken of the sample. The analysis generates data that is in turn is described as consistent with a previously published model. The connection of the new data with the model, and onto the literature that describes the generation of that model wires the new, unpublished experiment, into the wider web. The dotted green lines show previously existing relationships between the model data and published papers. B) The same set of objects and relationships shown as a set of time dependent feeds generated by a set of services. Each object is accessible on a feed generated by the service where it is hosted. The "laboratory notebook" is the feed shown at the bottom which records and describes relationships between the research objects.A) A series of research objects collected as part of an experiment. A raw material is converted into a sample that is then subjected to analysis. A photo is taken of the sample. The analysis generates data that is in turn is described as consistent with a previously published model. The connection of the new data with the model, and onto the literature that describes the generation of that model wires the new, unpublished experiment, into the wider web. The dotted green lines show previously existing relationships between the model data and published papers. B) The same set of objects and relationships shown as a set of time dependent feeds generated by a set of services. Each object is accessible on a feed generated by the service where it is hosted. The “laboratory notebook” is the feed shown at the bottom which records and describes relationships between the research objects.

Implementation: Tools and services

31 Leave a comment on paragraph 31 0 At one level this framework of links and objects can be put together out of existing pieces from online tools and services but on another the existing tools are totally inadequate. What is lacking is an integrated framework for managing the wide range of research objects that we create. While there are a wide range of freely accessible sites for hosting data of different or arbitrary types, documents, and bookmarks, and these can be linked together in various ways there are very few tools and services that provide the functionality of managing links within the kind of user friendly environment that would be like to encourage adoption. Most existing web tools have also been built to be “sticky” so as to keep users on the site. This means that they are often not good at providing functionality to link out to objects on other services. A central problem is therefore the lack of good tools to integrate the management of this disparate set of objects.

32 Leave a comment on paragraph 32 0 The linked data web-native notebook described above could potentially be implemented using existing tools if an integration tool or framework could be implemented. A full implementation would involve a variety of structured documents placed on various services that used specific controlled vocabularies to describe them. The relationships between these documents would then be specified in a formal semantic feed such as for example a structured RSS feed generated via a separate integrating service or be available by querying a RDF data store via SPARQL. Either of these approaches would then require a controlled vocabulary or vocabularies to be used in description of relationships and therefore the feed.

33 Leave a comment on paragraph 33 0 In practice, while this is technically feasible, for the average researcher the vocabularies are often not available or not appropriate. The tools for generating semantic documents, whether XML or RDF based are not, where they exist at all, designed with the general user in mind. The average lab is therefore restricted to a piecemeal approach based on existing, often general consumer web services. This approach can go some distance, using wikis, online documents, and data visualization extensions. An example of this approach is described in [18] where a combination of Wikis, GoogleDoc Spreadsheets and visualization tools based on the GoogleChart API were used in a distributed set of pages that linked data representations to basic data to the procedures used to generate it through simple links. However, because these are just simple hypertext links in free text documents this approach currently can’t exploit the full potential of a semantic approach. A RDF description of the experiment was created in this case by developing a specific parser to convert from the spreadsheet and experimental records to RDF and deposit into a triple store.

34 Leave a comment on paragraph 34 0 Clearly there is a gap in the tool set for delivering a linked experimental record. But what is needed to fill that gap? High quality dataservices are required and are starting to appear in specific areas with a range of business models. Many of those that exist in the consumer space already provide the type of functionality that would be required including RSS feeds, visualization and management tools, tagging and categorization, and embedding capabilities. Slideshare http://www.slideshare.net webcite and Flickr (flickr.com) are excellent models for scientific data repositories in many ways.

35 Leave a comment on paragraph 35 0 Sophisticated collaborative online and offline document authoring tools are available. Online tools include blogs and wikis, and increasingly offline tools including Microsoft Word and Open Office, provide a rich and useable user experience that is well integrated with online publishing systems. Tools such as ICE can provide a sophisticated semantic authoring environment making it possible to effectively link structured information together.

The missing link

36 Leave a comment on paragraph 36 0 What are missing are the tools that will make it easy to connect items together. Semantic authoring systems solve part of the problem by enabling the creation of structured documents and in some cases by assisting in the creation of links between objects. However these are usually inward looking. The key to the web-native record is that it is integrated into the wider web, monitoring sites and feeds for new objects that may need to be incorporated into the record. These monitoring services would then present these objects to the user within a semantic authoring tool, ideally in a contextual manner.

37 Leave a comment on paragraph 37 0 The conceptually simplest system would monitor appropriate feeds for new objects and present these to the user as possible inputs and outputs. The user would then select an appropriate set of inputs and outputs and select the relationship between them from a limited set of possible relationships (is an input to, is an output of, generated data). This could be implemented as three drop down menus in its simplest form (Figure 4) but this would only apply after the event. Such a tool would not be particularly useful in planning experiments or as a first pass recording tool and would therefore add another step to the recording process.

38 Leave a comment on paragraph 38 0 A simple menu based tool for identifying the connections between research objects. A range of web services are monitored to identify new objects, data, processes, documents, or controlled vocabulary terms that may be relevant to the user. In this simple tool these are presented as potential subjects and objects in drop down menus. The relationship can be selected from a central menu. The output of the tool is a feed of relationships between web accessible objects.A simple menu based tool for identifying the connections between research objects. A range of web services are monitored to identify new objects, data, processes, documents, or controlled vocabulary terms that may be relevant to the user. In this simple tool these are presented as potential subjects and objects in drop down menus. The relationship can be selected from a central menu. The output of the tool is a feed of relationships between web accessible objects.

Capturing in silico process: A new type of log file

39 Leave a comment on paragraph 39 0 In an ideal world, the whole process of creating the linked record would be done behind the scenes, without requiring the intervention of the user. This is likely to be challenging in the experimental laboratory but is entirely feasible for in silico work and data analysis within defined tools. Many data analysis tools already generate log files, although as data analysis tools have become more GUI driven these have become less obvious to the user and more focussed on aiding in technical support. Within a given data analysis or computational tool objects will be created and operated on by procedures hard coded into the system. The relationships are therefore absolute and defined.

40 Leave a comment on paragraph 40 0 The extensive work of the reproducible research movement on developing approaches and standards to recording and communicating computational procedures has largely focussed on the production of log files and command records (or scripts) that can be used to reproduce an analysis procedure as well as arguing for the necessity to provide running code and the input and intermediate data files. In the linked record world it will be necessary to create one more “logfile” that describes the relationships between all the objects created by reference to some agreed vocabulary. This “relationships logfile” which would ideally be RDF or a similar framework is implicit in a traditional log file or script but by making it explicit it will be possible to wire these computational processes into a wider web of data automatically. Thus the area of computational analysis is where the most rapid gains can be expected to be made as well as where the highest standards are likely to be possible. The challenge of creating similar log files for experimental research is greater, but will benefit significantly from building up experience in the in silico world.

Systems for capturing physical processes

41 Leave a comment on paragraph 41 0 For physical experimentation, where it is more difficult to automatically capture the objects being used and the role they are playing, we need to develop tools that assist the scientist in capturing these connections. We will have feeds of potential materials, samples, and data but require tools that assist the user in authoring the description of process in a way that captures which objects have been used or created and the connections between them. Such an authoring tool would automatically tracks feeds of possible input and output objects and provide these as options as the researcher describes what they are planning or what they are doing. For example the description of an experiment that generates an image would trigger the system to look at recent items from the feed (or feeds) of the appropriate image service(s). Recent options would be presented to the user in a drop down menu for selection. The selection of the correct item would add a link from the document to the image. As the input samples have already been registered, either by logging commercial samples into the lab system, or having been generated automatically as outputs for a previous process typing “sample for treatment A” would trigger the system to link to that record and incorporate a machine readable statement within the document that “sample for treatment A” was the input for the process “an image was taken” which generated data (the image). The objects, including the description of the process, are then linked together, via the outgoing feed by their appearance in one process, making it possible to describe a composite object that is the overall experiment.

42 Leave a comment on paragraph 42 0 Such a system would consist of two parts; first an intelligent feed reader that monitors all the relevant feeds, laboratory management systems for samples, data repositories, or laboratory instruments, for new data. In a world in which it seems natural that the Mars Phoenix lander should have a Twitter account, and indeed the high throughput sequencers at the Sanger Centre send status updates via Twitter, the notion of an instrument automatically providing a status update on the production of data seems natural. What are less prevalent are similar feeds generated by sample and laboratory information management systems although the principles are precisely equivalent; when an object (data or sample or material) is created or added to the system a feed item is created and pushed out for notification.

43 Leave a comment on paragraph 43 0 The second part of the system is more challenging. Providing a means of linking inputs to outputs, via for example drop down menus is relatively straightforward. Even the natural language processing required to automatically recognise where links should be created is feasible with today’s technology. But capturing what those links mean is more difficult. There is a balance that needs to be found between providing sufficient flexibility to describe a very wide range of possible connections (“is an input to”, “is the data generated from”) and providing enough information, via a rigorously designed controlled vocabulary, to enable detailed automated parsing by machines. Where existing vocabularies and ontologies exist and are directly applicable to the problem in hand these should clearly be used and strongly supported by the authoring tools. In the many situations where they are not ideal then more flexibility needs to be supported allowing the researcher to go off the planned track in an unexpected direction. However at the same time ambiguity should be avoided. Thus such a system needs to “nudge” the user into using the most appropriate available structured description of their experiment, while allowing them the choice to use none, but at the same time helping them avoid using terms or descriptions that could be misinterpreted. Not a trivial system to build.

44 Leave a comment on paragraph 44 0 IMPLEMENTATION IN PRACTICE

45 Leave a comment on paragraph 45 0 Many efforts at building data infrastructures for the “average researcher” have been funded, designed and in some cases even built. Most of them have limited success. Part of the problem has always been building systems that solve problems that the “average researcher” doesn’t know that they have. Issues of curation and metadata are so far beyond the day to day issues that an experimental researcher is focussed on as to be incomprehensible. We clearly need better tools, but they need to be built to deal with the problems that researchers face. This post is my current thinking on a proposal to create a solution that directly faces the researcher, but offers the opportunity to address the broader needs of the community. What is more it is designed to allow that average researcher to gradually realise the potential of better practice and to create interfaces that will allow technical systems to build out better systems.

46 Leave a comment on paragraph 46 0 Solve the immediate problem – better backups

47 Leave a comment on paragraph 47 0 The average experimental lab consists of lab benches where “wet work” is done and instruments that are run off computers. Sometimes the instruments are in different rooms, sometimes they are shared. Sometimes they are connected to networks and backed up, often they are not. There is a general pattern of work – samples are created through some form of physical manipulation and then placed into instruments which generate digital data. That data is generally stored on a local hard disk. This is by no means comprehensive but it captures a large proportion of a lot of the work.

48 Leave a comment on paragraph 48 0 The problem a data manager or curator sees here is one of cataloguing the data created, creating a schema that represents where it came from and what it is. We build ontologies and data models and repositories to support them to solve the problem of how all these digital objects relate to each other.

49 Leave a comment on paragraph 49 0 The problem a researcher sees is that the data isn’t backed up. More than that, its hard to back up because institutional systems and charges make it hard to use the central provision (“it doesn’t fit our unique workflows/datatypes”) and block what appears to be the easiest solution (“why won’t central IT just let me buy a bunch of hard drives and keep them in my office?”). An additional problem is data transfer – the researcher wants the data in the right place, a problem generally solved with a USB drive. Networks are often flakey, or not under the control of the researcher so they use what is to hand to transfer data from instrument to their working computer.

50 Leave a comment on paragraph 50 0 The challenge therefore is to build systems under group/researcher control that the needs for backup and easy file transfer. At the same time they should at least start to solve the metadata capture problem and satisfy the requirements of institutional IT providers.

51 Leave a comment on paragraph 51 0 The Lab Box

52 Leave a comment on paragraph 52 0 The PI wants to know that data is being collected, backed up and is findable. They generally want this to be local. Just as researchers still like PDFs because they can keep them locally, researchers are happiest if data is on a drive they can physically locate and control. The PI does not trust their research group to manage these backups – but central provision doesn’t serve their needs. The ideal is a local system under their control that automates data backups from the relevant control computers.

53 Leave a comment on paragraph 53 0 The obvious solution is a substantial hard drive with some form of simple server that “magically” sucks data up from the relevant computers. In the best case scenario appropriate drives on the instrument computers are accessible on a local network. In practice life is rarely this simple and individually creating appropriate folder permissions to allow backups is enough of a problem that it rarely gets done. One alternate solution is to use the USB drive approach – add an appropriate USB fob to the instrument computer that grabs relevant data and transmits it to the server, probably over a dedicated WiFi network. There are a bunch of security issues on how best to design this but one option is a combined drive/Wifi fob where data can be stored and then transmitted to the server.

54 Leave a comment on paragraph 54 0 Once on the server the data can be accessed and if necessary access controls applied. The server system need not be complex but it probably does at least need to be on the local network. This would require some support from institutional IT. Alternately a separate WiFi network could be run, isolating the system entirely from both the web and the local network.

55 Leave a comment on paragraph 55 0 The data directory of instrument computers if replicated to a server via a private WiFi network. The server then provides access to those files through a web interface.

56 Leave a comment on paragraph 56 0 Collecting Metadata

57 Leave a comment on paragraph 57 0 The beauty of capturing data files at the source is that a lot of metadata can be captured automatically. The core metadata of relevance is “What”, “Who”, and “When”. What kind of data has been collected, who was the researcher who collected it and when was it collected. For the primary researcher use cases (finding data after the student has left, recovering lost data, finding your own data six months later) this metadata is sufficient. The What is easily dealt with as is the When. We can collect the original source location of the data (and that tells us what instrument it came from) and the original file creation date. While these are not up to standards of data curators who might want a structured description of what the data is it is enough for the user, it provides enough context for a user to apply their local knowledge of how source and filetype relate to data type.

58 Leave a comment on paragraph 58 0 “Who” is a little harder, but it can be solved with some local knowledge. Every instrument control computer I have ever seen has a folder, usually on the desktop helpfully called “data”. Inside that folder are a set of folders with lab members names on them. This convention is universal enough that with a little nudging it can be relied on to deliver reasonable metadata. If the system allows user registration and automatically creates the relevant folders then saving files to the right folder, and thus providing that critical metadata will be quite reliable.

59 Leave a comment on paragraph 59 0 The desired behaviour can be encouraged even further if files dropped into the correct folder are automatically replicated to a “home computer” thus removing the need for transfer via  USB stick. Again a “convention over configuration” approach can be taken in which the directory structure found on the instrument computers is simply inverted. A data folder is created in which a folder is provided for each instrument. As an additional bonus other folders could be added which would then be treated as if they were an instrument and therefore replicated back to the main server.

60 Leave a comment on paragraph 60 0

61 Leave a comment on paragraph 61 0 How good is this?

62 Leave a comment on paragraph 62 0 If such a system can be made reliable (and that’s not easy – finding a secure way to ensure data gets transferred to the server and managing the dropbox style functionality suggested above is not trivial) then it can solve for a remarkable number of the use cases faced by the small scale laboratory on a regular basis. It doesn’t work for massive files or for instruments that write to a database. That said in research labs, researchers are often “saving as” some form of CSV, text or Excel file type even when the instrument does have a database. It isn’t trivial to integrate into existing shared data systems for instance for departmental instruments. Although adaptors could be easily built they would likely need to be bespoke developments working with local systems. Again though frequently what happens in practice is the users make a local copy of the data in their own directory system.

63 Leave a comment on paragraph 63 0 The major limitations are that there is no real information on what the files really are, just an unstructured record of the instrument that they came from. This is actually sufficient for most local use cases (the users know what the instruments are and the file types that they generate) but isn’t sufficient to support downstream re-use or processing. However, as we’ve argued in some previous papers, this can be seen as a feature not a bug. Many systems attempt to enforce a structured view of what a piece of data is early in creation process. This works in some contexts but often fails in the small lab setting. The lack of structure, while preserving enough contextual information to be locally useful, can be seen as a strength – any data type can be collected and stored without it needing to be available in some sort of schema. That doesn’t mean we can’t offer some of that structure, if and when there is functionality that gives an immediate benefit back to the user, but where there isn’t an immediate benefit we don’t need to force the user to do anything extra.

64 Leave a comment on paragraph 64 0 Offering a route towards more

65 Leave a comment on paragraph 65 0 At this point the seasoned data management people are almost certainly seething if not shouting at their computers. This system does not actually solve many of the core issues we have in data management. That said, it does solve the problem that the community of adopters actually recognises. But it also has the potential to guide them to better practice. One of the points that was made in the LabTrove paper that described work from the Frey group that I was involved in was how setting up a virtuous cycle has the potential to encourage good metadata practice. If good metadata drives functionality that is available to the user then the user will put that metadata in. But more than that, if the schema is flexible enough they will also actively engage in improving it if that improves the functionality.

66 Leave a comment on paragraph 66 0 The system I’ve described has two weaknesses – limited structured metadata on what the digital objects themselves actually are and as a result very limited possibilities for capturing the relationships between them. Because we’ve focussed on “merely” backing up our provenance information beyond that intrinsic to the objects themselves is limited. It is reasonably easy to offer the opportunity to collect more structured information on the objects themselves – when configuring an instrument offer a selection of known types. If a known type of instrument is selected then this can be flagged to the system which will then know to post-process the data file to extract more metadata, perhaps just enriching the record but perhaps also converting it to some form of standard file type. In turn automating this process means that the provenance of the transformation is easily captured. In standard form the system might offer in situ visualisation or other services direct to the users providing an immediate benefit. Such a library of transformations and instrument types could be offered as a community resource, ideally allowing users to contribute to build the library up.

67 Leave a comment on paragraph 67 0 Another way to introduce this “single step workflow” concept to the user is suggested above. Let us say that by creating an “analysis” folder on their local system this gets replicated to the backup server. The next logical step is to create a few different folders, that receive the results of different types of analysis. These then get indexed by the server as separate types but without further metadata. So they might separate mass spectrometry analyses from UV-Vis but not have the facility to search either of these for peaks in specific ranges. If a “peak processing” module is available that provides a search interface or visualisation then the user has an interest in registering those folders as holding data that should be submitted to the module. In doing this they are saying a number of things about the data files themselves – firstly that they should have peaks but also possibly getting them converted to a standard format, perhaps saying what the x and y coordinates of the data are, perhaps providing a link to calibration files.

68 Leave a comment on paragraph 68 0 The transformed files themselves can be captured by the same data model as the main database, but because they are automatically generated the full provenance can be captured. As these full provenance files populate the database and become easier to find and use the user in turn will look at those raw files and ask what can be done to make them more accessible and findable. It will encourage them to think more carefully about the metadata collection for their local analyses. Overall it provides a path that at each step offers a clear return for capturing a little more metadata in a little more structured form. It provides the opportunity to take the user by the hand, solve their immediate problem and carry them further along that path.

69 Leave a comment on paragraph 69 0 Building the full provenance graph

70 Leave a comment on paragraph 70 0 Once single step transformations are available then chaining them together to create workflows is an obvious next step. The temptation at this point is to try to build a complete system in which the researcher can work but in my view this is doomed to failure. One of the reasons workflow systems tend to fail is that they are complex and fragile. They work well where the effort involved in a large scale data analysis justifies their testing and development but aren’t very helpful for ad hoc analyses. What is more workflow systems generally require that all of the relevant information and transformation be available. This is rarely the case in an experimental lab where some crucial element of the process is likely to be offline or physical.

71 Leave a comment on paragraph 71 0 Therefore in building up workflow capabilities within this system it is crucial to build in a way that creates value when only a partial workflow is possible. Again this is surprisingly common – but relatively unexplored. Collecting a set of data together, applying a single transformation, checking for known types of data artefacts. All of these can be useful without needing an end to end analysis pipeline. In many cases implied objects can be created and tracked without bothering the user until they interested. For instance most instruments require samples. The existence of a dataset implies a sample. Therefore the system can create a record of a sample – perhaps of a type appropriate to the instrument, although that’s not necessary. If a given user does runs on three instruments, each with five samples over the course of a day its a reasonable guess that those are the same five samples. The system could then offer to link those records together so a record is made that those datasets are related.

72 Leave a comment on paragraph 72 0 From here its not a big mental leap to recording the creation of those samples in another digital object, perhaps dropping it into a directory labelled “samples”. The user might then choose to link that record with the sample records. As the links propagate from record of generation, to sample record, to raw data to analysis, parts of the provenance graphs are recorded. The beauty of the approach is that if there is a gap in the graph it doesn’t reduce the base functionality the user enjoys, but if they make the effort to link it up then they can suddenly release all the additional functionality that is possible. This is essentially the data model that we proposed in the LabTrove paper but without the need to record the connections up front.

73 Leave a comment on paragraph 73 0 Challenges and complications

74 Leave a comment on paragraph 74 0 None of the pieces of this system are terribly difficult to build. There’s an interesting hardware project – how best to build the box itself and interact with the instrument controller computers. All while keeping the price of the system as low as possible. There’s a software build to manage the server system and this has many interesting user experience issues to be worked through. The two clearly need to work well together and at the same time support an ecosystem of plugins that are ideally contributed by user communities.

75 Leave a comment on paragraph 75 0 The challenge lies in building something where all these pieces work well together and reliably. Actually delivering a system that works requires rapid iteration on all the components while working with (probably impatient and busy) researchers who want to see immediate benefits. The opportunity if it can be got right is immense. While I’m often sceptical of the big infrastructure systems in this space that are being proposed and built, they do serve specific communities well. The bit that is lacking is often the interface onto the day to day workings of an experimental laboratory. Something like this system could be the glue-ware that brings those much more powerful systems into play for the average researcher.

Provenance and Security

76 Leave a comment on paragraph 76 0 A major concern with a distributed model of the research record is security and provenance. Security issues are raised both because valuable content is distributed amongst systems that are not under the institution’s control and therefore may be lost due to technical or financial failure, but also because valuable intellectual property is potentially wandering free on the open web. On top of this there are concerns about the reliability of provenance records; who deposited a specific piece of data or document, and when did they do so? Third party date stamps and identifiers might, at a naive level, seem a good record of priority but the standards and reliability of these third party services is rarely validated to any specific standard. In terms of provenance trails, how a specific statement or datafile has been arrived at, existing consumer focused tools are generally very poor at assisting the user in describing what derivative material they have used. Slideshare, while supporting Creative Commons licensing of slidesets does not provide any easy means of describing where a given slide or image derives from. Given that it is only recently that an intellectual property case has been concluded with a contribution from an electronic laboratory notebook, and the lack of widely accepted standards for maintaining and describing security and provenance on the web it will not surprising if corporate users in particular are unconvinced by the idea of a distributed and openly available model of their research records.

77 Leave a comment on paragraph 77 0 There are three main answers to such a criticism; the most water tight from a security perspective is that the whole network can be maintained behind a corporate firewall. Thus an internal community has access to all the functionality provided by the networked record but the content is “safe” from outside eyes. Such an arrangement loses the value of being wired into the wider information infrastructure of the public web but for large enough companies this may be acceptable. Hybrid inside/outside systems may be feasible but are likely to be porous, especially given the weak point in any security system is almost always the people who use it. Nonetheless there is value in creating shareable references to objects via URLs even if those objects themselves are not shared or accessible. There are a range of partial models that may be valuable in specific contexts.

78 Leave a comment on paragraph 78 0 The second possible answer to criticism is that we need to develop the tool set and standards that lets us describe and validate security and provenance of objects on the web. Distributed access and authentication control is an active area of research development with tools such as Shibboleth [21], OpenID [22], and OAuth [23] providing mechanisms for authentication and access control targeted at the consumer web. A wide range of developing offerings to support identity management for researchers is also developing utilising various underlying protocols. At one level distribution should make the description of resources more reliable. Where multiple services and multiple people point at resources the level of possible trust should be much greater than when one or two, often internal, signing authorities are being relied on. This distributed network of date stamps and comments while more diffuse is actually much more difficult to fake as it requires multiple different records, often created by different people, on different services and servers to be changed. It is also however more difficult to aggregate and document such a distributed record in a way that makes sense in the context of today’s record keeping systems and legal records.

79 Leave a comment on paragraph 79 0 The final answer is to say that if you’re attempting to build a collaborative research record that leverages networks and the web and you are focussed on protecting your content from other people then you are missing the point. The entire purpose of the exercise is to get the most out the information and the people manipulating it on the web by providing content that they can re-use, and in the process, pass useful information and connections back to you. If you do this in a partial way then you only get a partial benefit. It is only by allowing people outside your organization to wire your samples, data, and ideas into the wider linked open web that you gain benefits over and above what you can gain by simply using your own internal expertise more effectively. For very large companies an entirely internal web may be a viable strategy but for most small and medium companies or public research organizations the gains are potentially much greater than the losses. This doesn’t mean there is a free for all. Provenance and identity are also important in this view, but here it is less about demanding that credit be assigned and more about gaining the most from connections that other people make by understanding who they are and where they are coming from.

80 Leave a comment on paragraph 80 0 The problems of attribution, provenance, and access rights are general ones that are facing the consumer web as users gradually gain a more sophisticated understanding of what privacy means in a networked world. If the view is accepted, that in wiring the research record into the wider web that the benefits gained outweigh the losses, then it follows that issues of access control with respect to viewing objects are less important. However being able to reliably identify the author of a dataset or other object becomes much more important in determining how you will respond to the way they have interacted with your materials. Therefore provenance becomes a key issue and reliable authentication mechanisms that provide strong identities are crucial. There will no doubt be much development over the next few years of mechanisms for tracking the network of citations between objects on the web and using this to assign priority and precedence of ideas and statements. All of these will improve the toolset available to work with distributed research records of the type being discussed here.


81 Leave a comment on paragraph 81 0 The web and its organization fundamentally challenges the idea of the research record as a single document or record and removes the constraints created by a physical record or document to enable a multifaceted, multifunctional, and multimedia laboratory notebook. The web itself has evolved from its original form of linked static documents to dynamic services and sites populated by objects created by their users and content automatically aggregated and re-published in new contexts and new forms. The vision of the web as a dynamic network of addressable objects and their relationships can be traced right to its earliest origins in the early 1990s but it is only now being realized.

82 Leave a comment on paragraph 82 0 Fully exploiting the infrastructure and functionality of the web to create a web-native laboratory record requires re-thinking the traditional view of the laboratory notebook as a linear narrative of events. By creating individual addressable objects that refer to physical samples and materials, laboratory or computational processes, and data files, it is possible to create a dynamic record that can be directly linked into the growing semantic and linked data web. The web native laboratory notebook is a combination of the web of data and the web of things. If the best available services are used for each kind of object then the actual laboratory record can be reduced to a feed describing the relationships between these objects and functionality that has been specifically designed around those objects can be fully exploited.

83 Leave a comment on paragraph 83 0 Such an approach recognizes that the processes that occur in science, particularly in an experimental laboratory, are often haphazard. The key to capturing the research record, the issue we are focussed on here, is to ensure that the objects created during the research process are recognized, stored and indexed, and to make it as easy as possible for the researcher to record the relationships between these objects. The use of purpose built services for specific object types is intended to make their capture and “wiring up” as straightforward as possible. This distribution does not need to be seen as antithetical to integrative and aggregation approaches to capturing and describing knowledge [17]. Rather it can be thought of as an efficient way of generating and indexing the raw material, the data and information, that will make up these “datuments” to facilitate their production.

84 Leave a comment on paragraph 84 0 The beauty of this approach is that it doesn’t require users to shift from the applications and services that they are already using, like, and understand. What it does require is intelligent and specific repositories for the objects they generate that know enough about the object type to provide useful information and context. What it also requires is good plugins, applications, and services to help people generate the lab record feed. It also requires a minimal and arbitrarily extensible way of describing the relationships. This could be as simple html links with tagging of the objects (once you know an object is a sample and it is linked to a procedure you know a lot about what is going on) but there is a logic in having a minimal vocabulary that describes relationships (what you don’t know explicitly in the tagging version is whether the sample is an input or an output). But it can also be fully semantic if that is what people want. And while the loosely tagged material won’t be easily and tightly coupled to the fully semantic material the connections will at least be there. A combination of both is not perfect, but it’s a step on the way towards the global data graph.

85 Leave a comment on paragraph 85 0 The technical challenges of implementing this vision are formidable. Authoring tools are needed, along with well designed repositories; a whole infrastructure of services and tools for pushing information between them. In this context the announcement of the Google Wave protocol is a very interesting development. The functionality that is described has an enormous potential to make the implementation of many of these tools much easier. The proof of this will be in the development of useful functionality within this platform. Google has the brand awareness and expertise to make such a revolution in online communication technology possible. They have set the agenda and it will be a very interesting story to follow.

86 Leave a comment on paragraph 86 0 The view of the laboratory record as a distributed set of objects on the open web is closely linked with the agenda of the Open Research Movement. The presumption is that the gains made by wiring your own ideas, samples, and results into the wider information graph, far outweigh the losses. Only limited gains could be made by adopting this architecture but keeping the content closed off from the wider web. Thus its adoption and development depends closely on the users view of the future of scientific communication. If you accept the vision of open communication and the idea that it will make your own research more competitive then this is a path to follow. The web was built for the sharing of data amongst research scientists. We are only just learning how to do that effectively and efficiently.


  1. Girone M: Cern database services for the lhc computing grid.J Phys: Conf Ser 2008, 119:052017. Publisher Full Text OpenURL
  2. Batley J, Edwards D: Genome sequence data: management, storage, and visualization.BioTechniques 2009, 46:333-336. PubMed Abstract | Publisher Full Text OpenURL
  3. Taylor CF, Field D, Sansone S-A, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz P-A, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Novere NL, Leebens-Mack J, Lewis SE, Lord P, Mallon A-M, Marthandan N, Masuya H, Mcnally R, Mehrle A, Morrison N, Orchard S, Quackenbush J, Reecy JM, Robertson DG, Rocca-Serra P, Rodriguez H, Rosenfelder H, Santoyo-Lopez J, Scheuermann RH, Schober D, Smith B, Snape J, Stoeckert CJ, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.Nat Biotech 2008, 26:889-896. Publisher Full Text OpenURL
  4. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone S-A, Scheuermann RH, Shah N, Whetzel PL, Lewis S: The OBO foundry: coordinated evolution of ontologies to support biomedical data integration.Nat Biotech 2007, 25:1251-1255. Publisher Full Text OpenURL
  5. Lyman P, Varian HL: How Much Information. [http://www.sims.berkeley.edu/how-much-info-2003] webcite2003.

    on 2 September 2009

  6. Reproducible Research [http://reproducibleresearch.net] webcite
  7. Pedersen T: Empiricism is not a matter of faith.Computational Linguistics 2008, 34:465-470. Publisher Full Text OpenURL
  8. Jones AR, Miller M, Aebersold R, Apweiler R, Ball CA, Brazma A, Degreef J, Hardy N, Hermjakob H, Hubbard SJ, Hussey P, Igra M, Jenkins H, Julian RK, Laursen K, Oliver SG, Paton NW, Sansone S-A, Sarkans U, Stoeckert CJ, Taylor CF, Whetzel PL, White JA, Spellman P, Pizarro A: The functional genomics experiment model (FUGE): an extensible framework for standards in functional genomics.Nat Biotech 2007, 25:1127-1133. Publisher Full Text OpenURL
  9. King RD, Rowland J, Oliver SG, Young M, Aubrey W, Byrne E, Liakata M, Markham M, Pir P, Soldatova LN, Sparkes A, Whelan KE, Clare A: The automation of science.Science 2009, 324:85-89. PubMed Abstract | Publisher Full Text OpenURL
  10. Shadbolt N, Berners-Lee T, Hall W, May : The semantic web revisited.IEEE Intelligent Systems 2000, 21(3):96-101. Publisher Full Text OpenURL
  11. The atom publishing protocol [http://bitworking.org/projects/atom/rfc5023.html] webcite
  12. Atom (standard) [http://en.wikipedia.org/wiki/Atom_(standard)] webcite
  13. Zemanta [http://www.zemanta.com] webcite
  14. Sefton P: The integrated content environment. [http://eprints.usq.edu.au/697/] webciteAusWeb06: The 12th Australasian World Wide Web Conference: 1-5 July 2006 OpenURL
  15. Downing J, Sefton P: ICE-TheOREM – End to End Semantically Aware eResearch Infrastructure for Theses. [http://eprints.usq.edu.au/5248/] webciteOpen Repositories 2009. OpenURL
  16. Cool URIs don’t change [http://www.w3.org/Provider/Style/URI] webcite
  17. Murray-Rust P, Rzepa HS: The next big thing: From hypermedia to datuments.[http://journals.tdl.org/jodi/article/view/130/128] webciteJournal of Digital Information 2004., 5(1) OpenURL
  18. Bradley JC, Guha R, Lang A, Lindenbaum P, Neylon C, Williams A, Willighagen E: Beautifying Data in the Real World.In Beautiful Data Edited by Segaran T, Hammerbacher J. Sebastopol, O’Reilly. 2009. OpenURL
  19. Google Wave [http://wave.google.com/] webcite
  20. Google Wave API Overview [http://code.google.com/apis/wave/guide.html] webcite
  21. Shibboleth [http://shibboleth.internet2.edu/] webcite
  22. OpenID Foundation Website [http://openid.net/] webcite
  23. OAuth – An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications [http://oauth.net/] webcite
  24. Murata M, Tozawa A, Kudo M, Hada S: XML access control using static analysis.Proceedings of the 10th ACM conference on Computer and communications security 2003.Publisher Full Text OpenURL
  25. Fundulaki I, Marx M: Specifying access control policies for XML documents with XPath.Proceedings of the ninth ACM symposium on Access control models and technologies 2004.Publisher Full Text OpenURL

88 Leave a comment on paragraph 88 0  

89 Leave a comment on paragraph 89 0 Notes

90 Leave a comment on paragraph 90 0 This piece combines two pieces of writing. The first is derived from a paper published in the journal Automated Experimentation in 2009 entitled “Head in the Clouds: Re-imagining the experimental laboratory record for the web-based, networked world“. The second is a blog post from 2014 entitled “Data capture for the real world“. This is inserted as the implementation section. It replaces a section from the original paper in which Google Wave was proposed as a promising solution to the technical issues. My excitement about Google Wave as a technology is well captured by another blog post

Page 46

Source: http://book-shaped-object.cameronneylon.net/wp/discarded/futures-of-scholarly-communication/re-imagining-the-laboratory-record/