Comment by geokon

Comment by geokon 11 hours ago

10 replies

Working in geology, I find the opposite problem. Field work is so highly valued that we're at a place where we have so much data and not enough people really working and analyzing it. My general impression is that in some subfields work that's done exclusively using preexisting data is kind of looked down on. In my opinion tons and tons of money is essentially wasted collecting new data - and then it's poorly catalogued and hard to access. You typically have to email some author and hope they send you the data. People are fiercely protective of their data b/c it took a lot of effort to collect and they want credit and to be in on any derivative work (and not just a reference at the bottom of a paper)

I would say the main workflow is collect some new data nobody has collect before, look at it and see if it shows anything interesting, make up some interesting publishable interpretation.

It feels like it'd be smarter to start with working with existing data and publish that way. If you hit on some specific missing piece, go collect that data, and work from there. But the incentive structures aren't aligned with this

The AI angle is really shoehorned in, but irrelevant to the larger problem. Sure, it allows you to annotate more data. Obviously it's more fun to go do field work than count pollen grains under a microscope. If anything AI make it easier to do more fieldwork and collect even more data b/c now you can in-theory crunch it faster

bonsai_spool 2 hours ago

This is largely solved in biomedicine by funders (not journals) and regulatory bodies requiring that human subjects research data be stored with NIH.

I guess there may be a broader and less public-oriented set of funders in geology- and maybe there aren’t as many standardized data types as there are in the world of biology.

willtemperley 8 hours ago

The current situation with the way big tech plays fast and loose with other people's data, I don't suppose the siloed nature of geological data is going to get better any time soon.

Perhaps creating secure private clouds for scientists, away from AI scrapers etc that scientists can access, with associated counter-surveillance, is the way forward.

I'm a GIS guy working on cloud native tech, but with a focus on privacy. I have a local-first Mac native product nearing beta. I'm thinking a lot about what data sharing options can be at the moment.

  • geokon 8 hours ago

    i dont see what the problem is. AI is mostly irrelevant. okay they scrape your data.. but then what? If the data isn't offically published and doesnt have a DOI, anything built on that wont be accepted

    Some people scrape charts in publications to extract data. This has been done for a while. Maybe AI could automate this step. Thatd be useful

    • willtemperley 8 hours ago

      I understand that publications are the currency of academics but they're largely irrelevant in business. Geological data are valuable and if an oil exploration company finds a nice dataset they can scrape, they're not going to publish it.

      From a pure business perspective, AI is largely about copyright circumvention. The laws are lagging and people are making serious money from data theft.

      • fc417fc802 5 hours ago

        Aren't you describing trade secrets? I don't see how AI makes that any better or worse. If your competitor gets his hands on your proprietary dataset you're sunk regardless of AI, right?

        I don't see how copyright enters into it. I doubt that "oh hey I published this very valuable and proprietary dataset online but it's copyright me so pretty please don't use it to make money" was ever going to get you anywhere to begin with.

      • geokon 7 hours ago

        Am I understanding it correctly. So internally if a company is using a competitor's stolen data directly, then if anyone finds out they're in legal trouble. But if they train a model and then use the model, then they're in the clear?

smeeagain2 8 hours ago

Sounds like a business opportunity for someone to create a web portal for making available such data with licensing terms, indexing and cataloging it with a nice search engine, etc.

  • geokon 8 hours ago

    the databases exist, for instance: https://www.pangaea.de/

    what you need is people uploading data in consistent well documented formats. There are all sorts or projects that do this, but there is a strong incentive to not upload things, or sort of half upload it.. but in a way where anyone using it is going to have to reach out to you. Not suggesting bad intentions, Maybe youre still working with the data and expect to publish more and dont want someone swooping in and beating you to the punch. Typically journals require data availability, but its kind of informal and adhoc