They don’t love all of it, just 3/5ths.
They don’t love all of it, just 3/5ths.
There is no problem with ingesting synthetic data. Well, at least none coming from the fact that it is synthetic. If there was a fundamental difference between the 1s and 0s encoding synthetic data and the 1s and 0s encoding any other data, then you could easily filter it. But there isn’t. The ideas that this community has are magical thinking.
How am I supposed to take seriously an article that misuses a basic term like “scraping”?
No. I simply don’t see a plausible scenario for that. The social media comments are quite deplorable. You really have to look for bubbles with educated people. I don’t know why this gets so much traction. Maybe it’s because the copyright industry likes it, or maybe it feeds some psychological need like Intelligent Design.
It depends on what you are looking for. Identifying AI generated data is generally hard, though it can be done in specific cases. There is no mathematical difference between the 1s and 0s that encoded AI generated data and any other data. Which is why these model collapse ideas are just fantasy. There is nothing magical about any data that makes it “poisonous” to AI. The kernel of truth behind these ideas is not likely to matter in practice.
hindered.
I doubt that.
Hmm. Per Facebook v. Power Ventures, it could be a (criminal) violation of the CFAA to “circumvent” IP blocks.
https://annas-archive.org/volunteering
Be aware that helping Anna’s Archive may be illegal, or even criminal.
A different, more legal archiving effort is the Archive Team. It focuses on public data on the internet. https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
In some places without a strong freedom of information tradition (like the EU), this may still be illegal.
The boomers had cars and flexed being able to drive stick or know what a carburetor is, unlike those feeble Millennials. They had that greaser subculture. Hmm. I guess that makes the movie Grease the equivalent of War Games or Hackers.
So what is the zoomer thing? What eye-rolling help do they give to doddering old gen-Xers? What will they flex in their old age?
No, you cannot patent an ingredient. What you can do - under Indian law - is get “protection” for a plant variety. In this case, a potato.
That law is called Protection of Plant Varieties and Farmers’ Rights Act, 2001. The farmer in this case being PepsiCo, which is how they successfully sued these 4 Indian farmers.
Farmers’ Rights for PepsiCo against farmers. Does that seem odd?
I’ve never met an intellectual property freak who didn’t lie through his teeth.
Heh. Funny that this comment is uncontroversial. The Internet Archive supports Fair Use because, of course, it does.
This is from a position paper explicitly endorsed by the IA:
Based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.
By
The copyright industry wants money. So, 4 legs good, 2 legs better. It’s depressing to see how easily people are led around by the nose.
But that’s unethical!
Copyright is utterly corrupted. Besides, I believe it is corrosive and outright dangerous in the age of the internet. Every time you open a website or a stream or anything, that is copied to your device. In the age of the printing press, it was about what happened in a few “factories”/printing houses. Libraries were fine because they didn’t copy, but online libraries do. Now, copyright is about all our communications. Total enforcement would mean total surveillance.
So this is not a defense of copyright. It is simply an explanation.
Building products for sale is what US-copyright is all about. Think about the copyright clause: To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Without copyright, everything would be public domain. Everyone would be free to share any book or movie. That makes it hard to make money, to monetize your product, to recoup your investment. Copyright is supposed to be a way to enable that. It’s supposed to create an incentive to entertain you. If you have to pay for your entertainment, then someone will come along and entertain you to get your money. Piracy is an attack on that system.
If AI companies have to buy licenses, that would not incentivize much of anything. Licensing curated datasets for AI training would be one thing, but paying for individual books or even Reddit posts makes no sense. It would just make development slower and much more expensive. That makes it an unconstitutional use of copyright.
Let’s engage in a little fantasy. Someone invents a magic machine that is able to duplicate apartments, condos, houses, … You want to live in New York? You can copy yourself a penthouse overlooking the Central Park for just a few cents. It’s magic. You don’t need space. It’s all in a pocket dimension like the Tardis or whatever. Awesome, right? Of course, not everyone would like that. The owner of that penthouse, for one. Their multi-million dollar investment is suddenly almost worthless. They would certainly demand that you must not copy their property without consent. And so would a lot of people. And what about the poor construction workers, ask the owners of constructions companies? And who will pay to have any new house built?
So in this fantasy story, the government goes and bans the magic copy machine. Taxes are raised to create a big new police bureau to monitor the country and to make sure that no one use such a machine without a license.
That’s turned from magical wish fulfillment into a dystopian story. A society that rejects living in a rent-free wonderland but instead chooses to make itself poor. People work to ensure poverty, not to create wealth.
You get that I’m talking about data, information, knowledge. The first magic machine was the printing press. Now we have computers and the Internet.
I’m not talking about a utopian vision here. Facts, scientific theories, mathematical theorems, … All such is free for all. Inventors can get patents, but only for 20 years and only if they publish them. They can keep their invention secret and take their chances. But if they want a government enforced monopoly, they must publish their inventions so that others may learn from it.
In the US, that’s how the Constitution demands it. The copyright clause: [The United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Cutting down on Fair Use makes everyone poorer and only a very few, very rich people richer. Have you ever thought about where the money goes if AI training requires a license?
For example, to Reddit, because Reddit has rights to all those posts. So do Facebook and Xitter. Of course, there’s also old money, like the NYT or Getty. The NYT has the rights to all their old issue about a century back. If AI training requires a license, they can sell all their old newspapers again. That’s pure profit. Do you think they will their employees raises out of the pure goodness of their heart if they win their lawsuits? They have no legal or economics reason to do so. The belief that this would happen is trickle-down economics.
This paperwork is required by EU regulation (Digital Services Act - DSA).
It is theoretically possible to be excepted but I doubt OP has any chance there.
I was just being sarcastic. The article is explicit that there is a copyright organization behind this.
In what country is that?
Under US law, you cannot copyright recipes. You can own a specific text in which you explain the recipe. But anyone can write down the same ingredients and instructions in a different way and own that text.
SMITH created thousands of accounts on the Streaming Platforms (the “Bot Accounts”) that he could use to stream songs. He then used software to cause the Bot Accounts to continuously stream songs that he owned. At a certain point in the charged time period, SMITH estimated that he could use the Bot Accounts to generate approximately 661,440 streams per day, yielding annual royalties of $1,207,128.
From the original press release: https://www.justice.gov/usao-sdny/pr/north-carolina-musician-charged-music-streaming-fraud-aided-artificial-intelligence
Kinda funny how the term “AI” drowns out all rational thought and reading comprehension. Of course, that’s why it’s there in the clickbait headline. I avoid news sources that pull that sort of thing. I don’t appreciate being manipulated.
Yes, I shouldn’t bother replying in these threads. In truth, I’ve already given up on this community but sometimes when I’m bored I can’t help a little peek. Maybe in a few years, some of the smarter ones will wonder why nothing ever came of this. Anyway, be careful with those AI detectors. They don’t work and sooner or later someone is going to get in trouble over that.