Leave a comment on paragraph 1 0
For the 2009 Open Science workshop at the Pacific Symposium on Biocomputing I wrote a very long essay as an introductory paper. It turned out that this was far too long for the space available so an extremely shortened version was submitted for the symposium proceedings. I originally posted the full length essay in instalments in October 2008.
¶ 3 Leave a comment on paragraph 3 0 Openness is arguably the great strength of the scientific method. At its core is the principle that claims and the data that support them are placed before the community for examination and critique. Through open examination and critical analysis models can be refined, improved, or rejected. Conflicting data can be compared and the underlying experiments and methodology investigated to identify which, if any, is more reliable. While individuals may not always adhere to the highest standards, the community mechanisms of review, criticism, and integration have proved effective in developing coherent and useful models of the physical world around us. As Lee Smolin of the Perimeter Institute for Theoretical Physics recently put it, “we argue in good faith from shared evidence to shared conclusions“. It is an open approach that drives science towards an understanding which, while never perfect, nevertheless enables the development of sophisticated technologies with practical applications.
¶ 4 Leave a comment on paragraph 4 0 The Internet and the World Wide Web provide the technical ability to share a much wider range of both the evidence and the argument and conclusions that drive modern research. Data, methodology, and interpretation can also be made available online at lower costs and with lower barriers to access than has traditionally been the case. Along with the ability to share and distribute traditional scientific literature, these new technologies also offer the potential for new approaches. Wikis and blogs enable geographically and temporally widespread collaborations, the traditional journal club can now span continents with online book marking tools such as Connotea and CiteULike, and the smallest details of what is happening in a laboratory (or on Mars ) can be shared via instant messaging applications such as Twitter.
¶ 5 Leave a comment on paragraph 5 0 The potential of online tools to revolutionise scientific communication and their ability to open up the details of the scientific enterprise so that a wider range of people can participate is clear. In practice, however, the reality has fallen far behind the potential. This is in part due to a need for tools that are specifically designed with scientific workflows in mind, partly due to the inertia of infrastructure providers with pre-Internet business models such as the traditional “subscriber pays” print literature and, to some extent, research funders. However it is predominantly due to cultural and social barriers within the scientific community. The prevailing culture of academic scientific research is one of possession – where control over data, methodological secrets, and exploitation of results are paramount. The tradition of Mertonian Science has receded, in some cases, so far that principled attempts to reframe an ethical view of modern science can seem charmingly naive.
¶ 6 Leave a comment on paragraph 6 0 It is in the context of these challenges that the movement advocating more openness in science must be seen. There will always be places where complete openness is not appropriate, such as where personal patient records may be identifiable, where research is likely to lead to patentable (and patent-worthy) results, or where the safety or privacy of environments, study subjects, or researchers might be compromised. These, however are special instances for which exceptional cases can be made, and not the general case across the whole of global research effort. Significant steps forward such as funder and institutional pre-print deposition mandates and the adoption of data sharing policies by UK Research Councils must be balanced against the legal and legislative attempts to overturn the NIH mandate and widespread confusion over what standards of data sharing are actually required and how they will be judged and enforced. Nonetheless there is a growing community interested in adopting more open practices in their research, and increasingly this community is developing as a strong voice in discussions of science policy, funding, and publication. The aim of this workshop is to strengthen this voice by focusing the attention of the community on areas requiring technical development, the development and implementation of standards, both technical and social, and identification and celebration of success.
¶ 8 Leave a comment on paragraph 8 0 The case for taxpayer access to the taxpayer funded peer reviewed literature was made personally and directly in Jonathon Eisen’s first editorial for PLoS Biology .
¶ 9 Leave a comment on paragraph 9 0 […describing the submission of a paper to PLoS Biology as an ‘experiment’…] But then, while finalizing the paper, a two-month-long medical nightmare ensued that eventually ended in the stillbirth of my first child. While my wife and I struggled with medical mistakes and negligence, we felt the need to take charge and figure out for ourselves what the right medical care should be. And this is when I experienced the horror of closed-access publishing. For unlike my colleagues at major research universities that have subscriptions to all journals, I worked at a 300-person nonprofit research institute with a small library. So there I was—a scientist and a taxpayer—desperate to read the results of work that I helped pay for and work that might give me more knowledge than possessed by our doctors. And yet either I could not get the papers or I had to pay to read them without knowing if they would be helpful. After we lost our son, I vowed to never publish in non-OA journals if I was in control. […]
¶ 11 Leave a comment on paragraph 11 0 As a scientist in a small institution he was unable to access the general medical literature. More generally, as a US taxpayer he was unable to access the outputs of US government funded research or indeed of research funded by the governments of other countries. The general case for enabling access of both the general public, scientists in less well funded institutions, and in the developing world has been accepted by most in principle. While there are continuing actions being taken to limit the action of the NIH mandate by US publishers a wide range of research institutions have adopted deposition mandates. There remains much discussion about routes to open access with the debate over ‘Green’ and ‘Gold’ routes continuing as well as an energetic ongoing debate about the stability and viability of the business models of various open access journals. However it seems unlikely that the gradual increase in number and impact of open access journals is likely to slow or stop soon. The principle that the scientific literature should be available to all has been won. The question of how best to achieve that remains a matter of debate.
¶ 12 Leave a comment on paragraph 12 0 A similar case to that for access to the published literature can also be made for research data. At the extremes, withholding data could lead to preventable deaths or severely reduced quality of life for patients. Andrew Vickers, in a hard hitting New York Times essay  dissected the reasons that medical scientists give for not making data from clinical cancer trials available; data that could, in aggregate, provide valuable insights into enhancing patient survival time and quality of life. He quotes work by John Kirwan (Bristol University) showing that three quarters of researchers in one survey opposed sharing data from clinical trials. While there may be specific reasons for retaining specific types of data from clinical trials, particularly in small specialised cases where maintaining the privacy of participants is difficult or impossible, it seems unarguable that the interests of patients and the public demand that such data be available for re-use and analysis. This is particularly the case where the taxpayer has funded these trials, but for other funders, including industrial funders, there is a public interest argument for making clinical trial data public in particular.
¶ 13 Leave a comment on paragraph 13 0 In other fields the case for data sharing may seem less clear cut. There is little obvious damage done to the general public by not making the details of research available. However, while the argument is more subtle, it is similar to that for clinical data. There the argument is that reanalysis and aggregation can lead to new insights with an impact on patient care. In non-clinical sciences this aggregation and re-analysis leads to new insights, more effective analysis, and indeed new types of analysis. The massive expansion in the scale and ambition of biological sciences over the past twenty years is largely due to the availability of biological sequence, structural, and functional data in international and freely available archives. Indeed the entire field of bioinformatics is predicated on the availability of this data. There is a strong argument to be made that the failure of the chemical sciences to achieve a similar revolution is due to the lack of such publicly available data. Bioinformatics is a highly active and widely practiced field of science. By comparison, chemoinformatics is marginalised, and, what is most galling to those who care for the future of chemistry, primarily driven by the needs and desires of biological scientists. Chemists for the most part haven’t grasped the need because the availability of data is not part of their culture.
¶ 15 Leave a comment on paragraph 15 0 The final piece of the puzzle, and in many ways the most socially and technically challenging is the sharing of research procedures. Data has no value in and of itself unless the process used to generate it is appropriate and reliable. Disputes over the validity of claims are rarely based on the data themselves but on the procedures used either to collect them or those used to process and analyse them. A widely reported recent case turned on the details of how a protein was purified; whether with a step or gradual gradient elution. This detail of procedure led laboratories to differing results, a year of wasted time for one researcher, and ultimately the retraction of several high profile papers [refs – nature feature, retractions, original paper etc]. Experimental scientists generally imagine that in the computational sciences where a much higher level of reproducibility and the ready availability of code and subversion repositories makes sharing and documenting material relatively straightforward, would have much higher standards. However, a recent paper  by Ted Pedersen (University of Minnesota, Duluth) – with the wonderful title ‘Empiricism is not a matter of faith’ – criticized the standards of both code documentation and availability. He makes the case that working with the assumption that you will make the tools available to others not only allows you to develop better tools, and makes you popular in the community, but also improves the quality of your own work.
¶ 16 Leave a comment on paragraph 16 0 And this really is the crux of the matter. If the central principle of the scientific method is open analysis and criticism of claims then making the data and process and conclusions avalable and accessible is just doing good science. While we may argue about the timing of release or the details of ‘how raw’ available data needs to be or the file formats or ontologies used to describe it there can be no argument that if the scientific record is to have value it must rest on an accessible body of relevant evidence. Scientists were doing mashups long before the term was invented; mixing data from more than one source; reprocessing it to provide a different view. The potential of online tools to help to do this better is massive, but the utility of these tools depends on the sharing of data, workflows, ideas, and opinions.
¶ 17 Leave a comment on paragraph 17 0 There are broadly three areas for development that are required to enable the more widespread adoption of open practice by research scientists. The first is the development of tools that are designed for scientists. While many of the general purpose tools and services have been adopted by researchers there are many cases where specialised design or adaptation is required for the specific needs of a research environment. In some cases the needs of research willpush development in specific areas, such as controlled vocabularies, beyond what is being done in the mainstream. The second, and most important area involves the social and cultural barriers within various research communities.These vary widely in type and importance across different fields and understanding and overcoming the fears as well as challenging entrenched interests will be an important part of the open science programme. Finally, there is a value and a need to provide top-down guidance in the form of policies and standards. The vagueness of the term ‘Open Science’ means that while it is a good banner there is a potential for confusion. Standards, policies, and brands can provide clarity for researchers, a clear articulation of aspirations (and a guide to the technical steps required to achieve them), and the support required to help people actually make this happen in their own research.
¶ 19 Leave a comment on paragraph 19 0 It is the rapid expansion and development of tools that are loosely categorised under the banner of ‘Web2.0′ or ‘Read-write web’ that makes the sharing of research material available. Many of the generic tools, particularly those that provide general document authoring capabilities, have been adopted and used by a wide range of researchers. Online office tools can enable collaborative development of papers and proposals without the need for emailing documents to multiple recipients and the resultant headaches associated with which version is which. Storing spreadsheets, databases, or data online means that collaborators have easy access to the most recent versions and can see how these are changing. More generally the use of RSS feed readers and bookmarking sites to share papers of interest and, to some extent, to distribute the task of triaging the literature are catching in in some communities. The use of microblogging platforms such as Twitter and aggregation and conversational tools such as Friendfeed have recently been used very effectively to provide coverage of conferences in progress, including collaborative note-taking. In combination with streamed or recorded video as well as screencasts and sharing of presentations online the idea of a dstributed conference, while not an everyday reality, is becoming feasible.
¶ 20 Leave a comment on paragraph 20 0 However it is often the case that,while useful, generic web based services do not provide desired functionality or do not fit well into the existing workflows of researchers. Here there is the opportunity, and sometime necessity, to build specialised or adapated tools. Collaborative preparation of papers is a good example of this. Conventional web bookmarking services, such as del.icio.us provide a great way of sharing the literature or resources that a paper builds on with other authors but they do not automatically capture and recognise the necessary metadata associated with published papers (journal, date, author, volume, page numbers). Specialised services such as citeulke and Connotea have been developed to enable one click bookmarking of published literature and these have been used effectively by for example using a specific tag for references associated with a specific paper in progress. The problem with these services as they exist at the moment is that they don’t provide the crucial element in the workflow that scientists want to aggregate the references for, the formatting of the references in the finalised paper. Indeed the lack of formatting functionality in GoogleDocs, the most widely used collaborative writing tool, means that in practice the finalised document is usually cut and pasted into Word and the references formatted using proprietary software such as Endnote.The available tools do not provide the required functionality.
¶ 21 Leave a comment on paragraph 21 0 A number of groups and organisations have investigated the use of Blogs and Wikis as collaborative and shareable laboratory notebooks. However few of these systems offer good functionality ‘out of the box’. While there are many electronic laboratory notebook systems sold by commercial interests most are actually designed around securing data rather than sharing it so are not of interesthere. While the group of Jean-Claude Bradley has used the freely hosted WikiSpaces as a laboratory notebook without further modification, much of the data and analysis is hosted on other services, including YouTube, FlickR, and GoogleDocs. The OpenWetWare group has made extensive modifications to the MediaWiki system to provide laboratory notebook functionality whereas Garret Lisi has adapted the TiddlyWiki framework as a way of presenting his notebook. The Chemtools collaboration at the University of Southampton has developed a specialised Blog platform . Commercial offerings in the area of web based lab notebooks are also starting to appear. All of these different systems have developed because of the specialised needs of recording the laboratory work of the scientists they were designed for. The different systems make different assumptions about where they fit in the workflow of the research scientist, and what that workflow looks like. They are all, however, built around the idea that they need to satisfy the needs of the user.
¶ 22 Leave a comment on paragraph 22 0 This creates a tension in tool building. General tools, that can be used across a range of disciplines, are extremely challenging to design, because workflows, and the perception of how they work, are different in different disciplines. Specialist tools can be built for specific fields but often struggle to translate into new areas. Because the market is small in any field the natural desire for designers is to make tools as general as possible. However in the process of trying to build for a sufficiently general workflow it is often the case that applicability to specific workflows is lost. There is a strong argument based on this for building interoperable modules, rather than complete systems, that will allow domain specialists to stich together specific solutions for specific fields or even specific experiments. Interoperability of systems and standards that enable it is a criteria that is sometimes lost in the development process, but is absolutely essential to making tools and processes shareable. The use of workflow management tools, such as Taverna, Kepler, and VisTrails have an important role to play here.
¶ 23 Leave a comment on paragraph 23 0 While not yet at a stage where they are widely configurable by end users the vision behind them has the potential both to make data analysis much more straightforward for experimental scientist but also to solve many of the problems involved in sharing process, as opposed to data. The idea of visually wiring up online or local analysis tools to enable data processing pipelines is compelling. The reason most experimental scientists use spreadsheets for data analysis is that they do not wish to learn programming languages. Providing visual programming tools along with services with clearly defined inputs and outputs will make it possible for a much wider range of scientists to use more sophisticated and poweful analysis tools. What is more the ability to share, version, and attribute, workflows will go some significant distance towards solving the problem of sharing process. Services like MyExperiment which provide an environment for sharing and versioning Taverna workflows provide a natural way of sharing the details of exactly how a specific analysis is carried out. Along with an electronic notebook to record each specific use of a given workflow or analysis procedure (which can be achieved automatically though an API) the full details of the raw data, analysis procedure, and any specific parameters used, can be recorded. This combination offers a potential route out of the serious problem of sharing research processes if the appropriate support infrastructure can be built up.
¶ 24 Leave a comment on paragraph 24 0 Also critical to successful sharing is a shared language or vocabulary. The development of ontologies, controlled vocabularies, and design standards are all important in sharing knowledge and crucial to achieving the ulitmate goals of making this knowledge machine readable. While there are divisions in the technical development and user communities over the development and use of controlled vocabularies there is little disagreement over the fact that good vocabularies combined with good tools are useful. The disagreements tend to lie in how they are best developed, when they should be applied, and whether they are superior to or complementary to other approaches such as text mining and social tagging. An integrated and mixed approach to the use of controlled vocabularies and standards is the most likely to be successful. In particular it is important to match the degree of structure in the description to the natural degree of structure in the object or objects being described. Highly structured and consistent data types, such as crystal structures and DNA sequences, can benefit greatly from highly structured descriptions which are relatively straightforward to create, and in many cases are the standard outputs of an analysis process. For large scale experimental efforts the scale of the data and sample management problem makes an investment in detailed and structured desriptions worth while. In a small laboratory doing unique work, however, there may be a strong case for using local descriptions and vocabularies that are less rigorous but easier to apply and able to grow to fit the changing situation on the ground. Ideally designed in such a way that mapping onto an external vocabulary is feasible if it is required or useful in the future.
¶ 25 Leave a comment on paragraph 25 0 Making all of this work requires that researchers adopt these tools and that a community develops that is big enough to provide the added value that these tools might deliver. For a broad enough community to adopt these approaches the tools must fit well in their existing workflow and help to deliver the things that researchers are already motivated to produce. For most researchers, published papers are the measure of their career success and the basis of their reward structures. Therefore tools that make it easier to write papers, or that help researchers to write better papers, are likely to get traction. As the expectations of the quality and completeness of supporting data increase for published papers, tools that make it easier for the researcher to collate and curate the record of their research will become important. It is the process of linking the record of what happened in the laboratory, or study, to the first pass intepretation and analysis of data, through further rounds of analysis until a completed version is submitted for review, that is currently poorly supported by available tools, and it is this need that will drive the development of improved tools. These tools will enable the disparate elements of the record of research, currently scattered between paper notebooks, various data files on multiple hard drives, and unconnected electronic documents, to be chained together. Once this record is primarily electronic, and probably stored online in a web based system, the choice to make the record public at any stage from the moment the record is made to the point of publication, will be available. The reason to link this to publication is to tie it into an existing workflow in the first instance. Once the idea is embedded the steps involved in making the record even more open are easily taken.
¶ 27 Leave a comment on paragraph 27 0 Scientists are inherently rather conservative in their adoption of new approaches and tools. A conservative approach has served the community well in the process of sifting ideas and claims; this approach is well summarised by the aphorism ‘extraordinary claims require extraordinary evidence’. New methodologies and tools often struggle to be accepted until the evidence of their superiority is overwhelming. It is therefore unreasonable to expect the rapid adoption of new web based tools and even more unreasonable to expect scientsits to change their overall approach to their research en masse. The experience of adoption of new Open Access journals is a good example of this.
¶ 28 Leave a comment on paragraph 28 0 Recent studies have shown that scientists are, in principle, in favour of publishing in Open Access journals yet show marked reluctance to publish in such journals in practice . The most obvious reason for this is the perceived cost. Because most operating Open Access publishers charge a publication fee, and until recently such charges were not allowable costs for many research funders, it can be challenging for researchers to obtain the necessary funds. Although most OA publishers will waive these charges there is anecdotally a marked reluctance to ask for such a waiver. Other reasons for not submitting papers to OA journals include the perception that most OA journals are low impact and a lack of OA journals in specific fields. Finally, simple inertia can be a factor where the traditional publication outlets for a specific field are well defined and publishing outside the set of ‘standard’ journals runs the risk of the work simply not being seen by peers. As there is no perception of a reward for publishing in open access journals, and a perception of significant risk, uptake remains relatively small.
¶ 29 Leave a comment on paragraph 29 0 Making data available faces similar challenges but here they are more profound. At least when publishing in an open access journal it can be counted as a paper. Because there is no culture of citing primary data, but rather of citing the papers they are reported in, there is no reward for making data available. If careers are measured in papers published then making data available does not contribute to career development. Data availability to date has generally been driven by strong community norms, usually backed up by journal submission requirements. Again this links data publication to paper publication without necessarily encouraging the release of data that is not explicitly linked to a peer reviewed paper. The large scale DNA sequencing and astronomy facilities stand out as cases where data is automatically made available as it is taken. In both cases this policy is driven largely by the funders, or facility providers, who are in position to make release a condition of funding the data collection. This is not, however a policy that has been adopted by other facilities such as synchrotrons, neutron sources, or high power photon sources.
¶ 30 Leave a comment on paragraph 30 0 In other fields where data is more heterogeneous and particular where competition to publish is fierce, the idea of data availability raises many fears. The primary one is of being ‘scooped’ or data theft where others publish a paper before the data collector has had the ability to fully analyse the data. This again is partly answered by robust data citation standards but this does not prevent another group publishing an analysis quicker, potentially damaging the career or graduation prospects of the data collector. A principle of ‘first right to publish’ is often suggested. Other approaches include timed embargoes for re-use or release. All of these have advantages and disadvantages which depend to a large extent on how well behaved members of a specific field are. Another significant concern is that the release of substandard, non peer-reviewed, or simply innaccurate data into the public domain will lead to further problems of media hype and public misunderstanding. This must be balanced against the potential public good of having relevant research data available.
¶ 31 Leave a comment on paragraph 31 0 The community, or more accurately communities, in general, are waiting for evidence of benefits before adopting either open access publication or open data policies. This actually provides the opportunity for individuals and groups to take first mover advantages. While remaining controversial [8, 9] there is some evidence that publication in open access journals leads to higher citation counts for papers [10, 11] and that papers for which the supporting data is available receive more citations . This advantage is likely to be at its greatest early in the adoption curve and will clearly disappear if these approaches become widespread. There are therefore clear advantages to be had in rapidly adopting more open approaches to research which can be balanced against the risks described above.
¶ 32 Leave a comment on paragraph 32 0 Measuring success in the application of open approaches and particularly quantifying success relative to traditional approaches is a challenge, as is demonstrated by the continuing controversy over the citation advantage of open access articles. However pointing to examples of success is relatively straightforward. In fact Open Science has a clear public relations advantage as the examples are out in the open for anyone to see. This exposure can be both good and bad but it makes publicising best practice easy. In many ways the biggest successes of open practice are the ones that we miss because they are right in front of us, the global databases of freely accessible data in biological databases such as the Protein Data Bank, NCBI, and many others that have driven the massive advances in biological sciences over the past 20 years. The ability to analyse and consider the implications of genome scale DNA sequence data, as it is being generated, is now
¶ 33 Leave a comment on paragraph 33 0 In the physical sciences, the arXiv has long stood as an example to other disciplines of how the research literature can be made available in an effective and rapid manner, and the availability of astronomical data from efforts such as the Sloan Digital Sky Survey make efforts combining public outreach and the crowdsourcing of data analysis such as Galaxy Zoo possible. There is likely to be a massive expansion in the availability of environmental and ecological data globally as the potential to combine millions of data gatherers holding mobile phones, and sophisticated data aggregation and manipulation tools is realised.
¶ 34 Leave a comment on paragraph 34 0 Closer to the bleeding edge of radical sharing there have been less high profile successes, a reflection both of the limited amount of time these approaches have been pursued and the limited financial and personnel resources that have been available. Nonetheless there are examples. Garret Lisi’s high profile preprint on the ArXiv, An exceptionally simple theory of everything,  is supported by a comprehensive online notebook at http://deferentialgeometry.org that contains all the arguments as well as the background detail and definitions that support the paper. The announcement by Jean-Claude Bradley of the successful identification of several compounds with activity against malaria  is an example where the whole research process was carried out in the open, from the decision on what the research target should be, through the design and in silico testing of a library of chemicals, to the synthesis and testing of those compounds. For every step of this process the data is available online and several of the collaborators that made the study possible made contact due to finding that material online. The potential for a coordinated global synthesis and screening effort is currently being investigated.
¶ 35 Leave a comment on paragraph 35 0 There are both benefits and risks associated with open practice in research. Often the discussion with researchers is focussed on the disadvantages and risks. In an inherently conservative pursuit it is perfectly valid to ask whether changes of the type and magnitude offer any benefits given the potential risks they pose. These are not concerns that should be dismissed or ridiculed, but ones that should be taken seriously, and considered. Radical change never comes without casualties, and while some concerns may be misplaced, or overblowm, there are many that have real potential consequences. In a competitive field people will necessarily make diverse decisions on the best way forward for them. What is important is providing as good information to them as is possible to help them balance the risks and benefits of any approach they choose to take.
¶ 37 Leave a comment on paragraph 37 0 A question that needs to be asked when contemplating any major change in practice is the balance and timing of ‘bottom up’ versus ‘top-down’ approaches for achieving that change. Scientists are notoriously un-responsive to decrees and policy initiatives but as has been discussed they are also inherently conservative and generally resistant to change led from within the community as well. For those advocating the widespread, and ideally rapid, adoption of more open practice in science it will be important to strike the right balance between calling for mandates and conditions for funding or journal submission and of simply adopting these practices in their own work. While the motivation behind the adoption of data sharing policies by funders such as the UK research councils is to be applauded it is possible for such intiatives to be counterproductive if the policies are not supported by infrastructure development, appropriate funding, and appropriate enforcement. Equally, standards and policy statements can send a powerful message on the aspirations of funders to make the research they fund more widely available and, for the most part, when funders speak, scientists listen.
¶ 39 Leave a comment on paragraph 39 0 There are two broad approaches to standards that are currently being discussed. The first of these is aimed at mainstream acceptance and uptake and can be described as ‘The fully supported paper’. This is a concept that is simple on the surface but very complex to implement in practice. In essence it is the idea that the claims made in a peer reviewed paper in the conventional literature should be fully supported by a publically accessible record of all the background data, methodology, and data analysis procedures that contribute to those claims. On one level this is only a slightly increased in requirements from the Brussels Declaration made by the Internaional Association of Scientific, Technical, and Medical Publishers in 2007 which states;
¶ 40 Leave a comment on paragraph 40 0 Raw research data should be made freely available to all researchers. Publishers encourage the public posting of the raw data outputs of research. Sets or sub-sets of data that are submitted with a paper to a journal should wherever possible be made freely accessible to other scholars
¶ 42 Leave a comment on paragraph 42 0 The degree to which this declaration is supported by publishers and the level to which different journals require their authors to adhere to it is a matter for debate but the principle of availability of background data has been accepted by a broad range of publishers. It is therefore reasonable to consider the possibility of making the public posting of data as a requirement for submission. At a simple level this is already possible. For specific types of data repositories already exist and in many cases most journals require submission of these data types to recognised respositories. More generally it is possible to host data sets in some institutional repositories and with the expected announcement of a large scale data hosting service from Google the argument that this is not practicable is becoming unsustainable. While such datasets may have limited discoverability and limited metadata, they will at least be discoverable from the papers that reference them. It is reasonable to expect sufficent context to be provided in the published paper to make the data useable.
¶ 43 Leave a comment on paragraph 43 0 However the data itself, except in specific cases, is not enough to be useful to other researchers. The detail of how that data was collected and how it was processed are critical for making a proper analysis of whether the claims made in a paper to be properly judged. Once again we come to the problem of recording the process of research and then presenting that in a form which is both detailed enough to be widely useful but not so dense as to be impenetrable. The technical challenges of delivering a fully supported paper are substantial. However it is difficult to argue that this shouldn’t be available. If claims made in the scientific literature cannot be fully verified can they be regarded as scientific? Once again – while the target is challenging – it is simply a proposal to do good science, properly communicated.
¶ 45 Leave a comment on paragraph 45 0 While the fully supported paper would be a massive social and technical step forward it in many ways is no more open than the current system. It does not deal with the problem of unpublished or unsuccessful studies that may never find a home in a traditional peer reviewed paper. As discussed above the ‘fully supported paper’ is not really ‘open science’; it is just good science. What then are the requirements, or standards for ‘open science’. Does there need to be a certificate or a set of requirements that need to be met before a project, individual, or institution can claim they are doing Open Science. Or is Open Science simply too generic and prone to misinterpretation?
¶ 46 Leave a comment on paragraph 46 0 I would argue that while ‘Open Science’ is a very generic term it has real value as a rallying point or banner. It is a term which generates significant positive reaction amongst the general public, the mainstream media, and large sections of the research community. Its very vagueness also allows some flexibility making it possible to welcome contributions from publishers, scientists, and funders which while not 100% open are nonetheless positive and helpful. Within this broad umbrella it is then possible to look at defining or recomending practices and standards and giving these specific labels for identification.
¶ 47 Leave a comment on paragraph 47 0 The main work in the area of defining relevant practices and standards has been carried out by Science Commons and the Open Knowledge Foundation. Science Commons have published four ‘Principles for Open Science‘ which focus on the availability and accessiblity of published literature, research tools, and data, and the development of cyberinfrastructure to make this possible. These four principles currently do no explicitly include the availability of process, which has been covered in detail above, but provide a clear set of criteria which could form the basis of standards. Broadly speaking research projects, individuals, or institutions that deliver on these principles could be said to be doing Open Science. The Open Knowledge Definiton, developed by the Open Knowledge Foundation, is another useful touchstone here. Another possible defining criterion for Open Science is that all the relevant material is made available under licenses that adhere to the definition.
¶ 48 Leave a comment on paragraph 48 0 The devil, naturally, lies in the details. Are embargoes on data and methodology appropriate, and if so, in what fields and how should they be constructed? For data that cannot be released should specific exceptions be made, or special arrangments made to hold data in secure repositories? Where the same group is doing open and commercial research how should the divisions between these projects be defined and declared? These details are important, and will take time to work out. In the short term it is therefore probably more effective to identify and celebrate examples of open science, define best practice and observe how it works (and does not work) in the real world. This will raise the profile of Open Science without making it immediately an exclusive preserve of those with the luxury of radically changing practice. It enables examples of best practice to be held up as aspirational standards, providing the goals for others to work towards, and the impetus for the tool and infrastructure development that will support them. Many government funders are starting to introduce data sharing mandates, generally with very weak wording, but in most cases these refer to the expectation that funded research will adhere to the standard of ‘best practice’ in the relevant field. At this stage of development it may be more productive to drive adoption throgh the strategic support of improving best practice in a wide range of fields than to attempt to define strict standards.
¶ 49 Leave a comment on paragraph 49 0 The community advocating more open practice in scientific research is growing in size and influence. The major progress made in the past 12-18 months by the Open Access movement and the development of deposition and data sharing mandates by a range of research funders show that real progress is being made in increasing access to both the finished products of research and the materials that support them. While there have been significant successes this remains a delicate moment. There is a risk of over enthusiasm driving expectations which cannot be delivered and of alienating the mainstream community that we wish to draw in. The fears and concerns of researchers in widening access to their work need to be addressed sensitively and seriously, pointing out the benefits but also acknowledging the risks involved in adopting these practices.
¶ 50 Leave a comment on paragraph 50 0 It will not be enough to develop tools and infrastructure that, if adopted, would revolutionize science communication. Those tools must be built with an understanding of how scientists work today, and with the explicit aim of embedding these tools in existing workflows. The need for, and the benefits of, adopting controlled vocabularies needs to be sold much more effectively to the mainstream scientific community. The ontologies community also needs to recognise that there are cases and areas where the use of strict controlled vocabularies is not appropriate. Web 2.0 and Semantic web technologies are not competitors but are complementary approaches that are appropriate in different contexts. Again, the right question to ask is ‘what do scientists do? And what can we do to make that work better?’; not how can we make scientists see they need to do things the ‘right’ way.
¶ 51 Leave a comment on paragraph 51 0 Finally, it is my belief that now is not the time to set out specific and strict standards of what qualifies as Open Science. It is the right time to discuss the details of what these standards might look like. It is the right time to look at examples of best practice; to celebrate these and to see what can be learnt from them, but with our current lack of experience, and lack of knowledge of what the unintended consequences of specific standards might be, it is too early to pin down the details of those standards. It is a good time to be clearly articulating the specific aspirations of the movement, and to provide goals that communities can aggregate around; the fully supported paper, the Science Commons principles, and the Open Knowledge Definition are all useful starting points. Open Science is gathering momentum, and that is a good thing. But equally it is a good time to take stock, identify the best course forward, and make sure that we ar carrying as many people forward with use as we can.
- ¶ 53 Leave a comment on paragraph 53 0
- Smolin L (2008), Science as an ethical community, PIRSA ID#08090035, http://pirsa.org/08090035/
- Mars Phoenix on Twitter, http://twitter.com/MarsPhoenix
- Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048
- Vickers A (2008), http://www.nytimes.com/2008/01/22/health/views/22essa.html?_r=1
- Pedersen T (2008), Computational Linguistics, Volume 34, Number 3, pp. 465-470, Self archived.
- Warlick S E, Vaughan K T. Factors influencing publication choice: why faculty choose open access.Biomedical Digital Libraries. 2007;4:1-12.
- Bentley D R. Genomic Sequence Information Should Be Released Immediately and Freely. Science. 1996;274(October):533-534.
- Piwowar H A, Day R S, Fridsma D B. Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE. 2007;1(3):e308.
- Davis P M, Lewenstein B V, Simon D H, Booth J G, Connolly M J. Open access publishing, article downloads, and citations: randomised controlled trial. BMJ. 2008;337(October):a568.
- Rapid responses to David et al., http://www.bmj.com/cgi/eletters/337/jul31_1/a568
- Eysenbach G. Citation Advantage of Open Access Articles. PLoS Biology. 2006;4(5):e157.
- Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin, 28 (4). pp. 39-47. http://eprints.ecs.soton.ac.uk/12906/
- Lisi G, An exceptionally simple theory of everything, arXiv:0711.0770v1 [hep-th], November 2007.
- Bradley J C, We have antimalarial activity!, UsefulChem Blog,http://usefulchem.blogspot.com/2008/01/we-have-anti-malarial-activity.html, January 25 2008.
¶ 55 Leave a comment on paragraph 55 0 As noted at the top this was originally written as the paper to accompany the workshop run by myself and Shirley Wu at the 2009 Pacific Symposium on Biocomputing. The essay was originally posted as a GoogleDoc around August 2008 and published in four parts on the Science in the Open blog (then at OpenWetWare) as “A Personal View of Open Science” [Parts 1, 2, 3 and 4].