Hacker News new | past | comments | ask | show | jobs | submit login
The Data Transfer Project (datatransferproject.dev)
138 points by l2dy on July 20, 2018 | hide | past | favorite | 50 comments



I was shocked to see the list of partners involved (which is why I assume their logos are big and bold and center).

I'm glad to see them participating, but I wonder what their motivation is? Is it truly genuine?

I know for example that when we made the reddit API people questioned our motives since "the data was everything" and "how can reddit be willing to make the data so accessible", but we knew that in reality "the community is everything".

So I really hope these companies are similarly motivated, knowing that it is the community and their platform that are their true assets.


For a possible motivation, the GDPR confers a right to data portability, and whilst there is guidance[1] that this "does not create an obligation for you to adopt or maintain processing systems which are technically compatible with those of other organisations", this is probably one of the easiest ways of demonstrating compliance.

For the biggest and most widely used services, it definitely would respect the spirit of the regulation more closely than e.g. just giving a json dump, since that's fairly useless to your average user.

[1] https://ico.org.uk/for-organisations/guide-to-the-general-da...


I believe this might be the next big thing in EU, crafting specific legislation which allows people to "own the data" stored in various services.

The PSD2 directive is already doing this in the banking world. Banks are forced to provide APIs which allow 3rd parties to access customer account information (with permission from the customer)

For companies, it makes sense to be pro-active with these things instead of waiting for the legislator to force them to do stuff. If they are active enough, it might be that no legislation is needed.


> shocked to see the list of partners involved

Google and Microsoft?

Think of these companies' retail successes:

- Google has basically zero retail services that are not designed to support its ads business. Chromebook and Pixel devices are arguable exceptions, but as hardware devices irrelevant to this proposal anyway.

- MS has Surface, Windows and Xbox.

Google's moneymaker - Ads - wouldn't be affected by portability since it has no retail customers. MS's products are equally impervious.

Facebook on the other hand would be existentially threatened by this achieving ubiquity. If social media becomes federated, their monopoly rates charged advertisers would fall to competitive levels.

There is Microsoft's LinkedIn and it would be very interesting to be a fly on the wall of the discussions between its execs and the MS execs overseeing this data portability project. Either they don't take the project seriously or MS thinks they may gain more imports into LinkedIn than exports. Or at least enough to come out even.


LinkedIn portability (NOT delete) is a useful use case to Microsoft because of Teams, VSTS, Outlook, etc.


<snark> I guess this is why Microsoft has a great portable format for word, and moving from/to Word is a piece of cake. </snark>


Word supports ODT.


Their motivation is they are far far behind #1 AWS and need to try wild and crazy things to win market share, including acting like they are the more developer-friendly platforms because they don't lock you in.


I'm very suspicious of this. The most dangerous thing here IMO would be if it were to allow these companies to share data among themselves, a data cartel so-to-speak. Currently from the website (emphasis mine) "enabling a seamless, direct, user initiated portability of data". I worry that they might simply remove the "user initiated" part after adoption hits critical mass. I'm now following the development on github [0]...

[0] https://github.com/google/data-transfer-project


Howdy (I work on DTP),

I'd say suspicion is always wanted with things like this. If you know of other services you would like to see data transfer to/from, please let us know, we want this be open to everyone, big and small, and are looking for suggestions.

FWIW, the team that is building this at Google is the same team that builds Takeout, so we've have been trying to give users useful tools for leaving Google a while now. We think giving users the ability to directly move data to a new service provider is the next evolution of the Takeout ethos about not locking users in.


Thanks for your effort

I'd lije to be able to transfer all of my maps data. That is the places I have marked by both name, and location as well as tag, reviews, and photos.

places I could exchange this data with include OSM, Apple Maps, Yelp, etc...


Thanks for the reply, I've had some downtime today to look over the documentation. It looks like it's pretty solid, but I have a way to go, but I'm actually planning on becoming active on github for your project. The whole java thing kind of irks me, but hey I did fine with it in college.

How welcoming of contributors is the project?


Super welcoming :).

Re: Java, ya its in the roadmap that the adapters should be language agnostic. Forcing them all into one language, regardless of what it is, is kind of lame.

Check out https://github.com/google/data-transfer-project/blob/master/... for ways to stay in contact with us and start contributing.


Awesome! Thank you for taking the time to respond.

I will try to refrain from criticizing the project until I am more familiar. The next time I interact with this project will be on github, my username there is the same as it is here.


Let me repeat this loudly so people in the back can hear:

TRANSPORT MECHANISMS ARE NOT DATA STANDARDS!

It is great that they want to build on REST and use common authentication standards and what-not: but that's not the hard problem. The hard problem is the data itself (including the critical structure of the data) and getting it to agree.

I liken it to someone that wants to build a rail system out of LEGO and then saying it is a 'standard system' because the LEGO are all the same...but they don't say anything about the size of the tracks, the overhead clearance, how the train systems should act in case of junctions or possible collision.

This is yet another 'data transfer standard' that hand-waves away the hard part -- the inter-compatible data model.


Let me shout from the back so you can hear:

BUT THEY HAVE AN INTER-COMPATIBLE DATA MODEL?!

Or at least they're trying to create one. They talk about it in their overview: https://datatransferproject.dev/dtp-overview.pdf

And the actual implementation of the model is on Github as well:

https://github.com/google/data-transfer-project/tree/master/...


They have a couple Data Models:

Calendars [0] is a simple Event + Attendees model Contacts [1] Re-uses vCard Mail [2] wants RFC 2822 compliant strings Photos [3] have just basic metadata with a URL to fetch pixels Tasks [4] is a simple to-do model

Overall the data model seems like a significant amount of wheel-reinvention. These data models are all just JSON records... they should be JSON-LD with @context pointing to a shared schema, probably defined at schema.org.

[0] https://github.com/google/data-transfer-project/tree/master/...

[1] https://github.com/google/data-transfer-project/blob/master/...

[2] https://github.com/google/data-transfer-project/tree/master/...

[3] https://github.com/google/data-transfer-project/tree/master/...

[4] https://github.com/google/data-transfer-project/tree/master/...


Honestly, that's more than most that come out with their own pet data transfer standards.

But, looking at the models, there is a loooong way to go. And once you have more than a handful of these very small, very shallow models they'll have to, themselves, fit in a meaningful structure (at least at each organization that plans to do something useful with them)...

It will be interesting to see how they handle this when it reaches the next levels of complexity.

I mean, since Leibniz there has been a time-honored tradition of very smart people trying and failing at this omni-model. Who knows, maybe this group will be the first to conquer it.

I'll be happy if they take a small piece -- music services, for example, and really nail that. But the players have every reason not to get things too compatible so it's going to have to be a group of people pushing the services to use the data model than then other way around.


This is how I find out I've still got *.dev pointing to localhost... somewhere


I think all .dev is owned by Google, right? I guess they are using that since they are a contributor on DTP.


Oh yeah it is, I just mean for my local projects I used to use .dev and I still had *.dev pointing to 127.0.0.1 but not on my local machine

It turns out I'd also set it up in my network's DNS server at some point


Yes, Google owns the .dev TLD. Any project you see with .dev has come out of Google in some fashion.


it was such a good idea to preload the entire thing into Chrome's HSTS cache wasn't it?


Absolutely. Makes it break in very obvious ways that prevent people from continuing using it as an internal TLD, probably before the domain is actually owned by someone.


The linked PDF has more info on what this is about: https://datatransferproject.dev/dtp-overview.pdf

The use cases and data model sections are particularly interesting. My initial thoughts are that this is wildly unrealistic and will lead to the usual N+1 data formats issue that plagues other multi-company initiatives.


GitHub: https://github.com/google/data-transfer-project

Several other companies are working on it in various forms. Disclaimer: I am working for one of those companies.


What's your company?


The intention of this project is great. But I doubt if it can even work well. The whole project depends on the fact that everything to synchronize can be standardized. But this is almost impossible.

Let's say "Birth of Date". Some contact providers support BOD without year. But some support only the whole date. If the companies do no even agree on this tiny little item, how can they agree on larger deviations? Like max number of phone numbers in a contact profile? The size/format of profile image? Address format? and so on


To me it doesn't seem like the rocket science you're making it out to be. Most teams store data in non-silly ways that correspond to the common reality that the data represent. For example everybody knows what a date of birth is, and that it has a day, month and year. If the year is missing from the source data, you set it null in your intermediary data model. If it's required in the destination data... basic solution would be choose a default and notify the user; deluxe solution would be to give them some choices up front what to do about it (since you'll already know about the mismatch based on knowing the source & destination providers).

Max number of phone numbers: if the source contains more than the destination's max, stop at the max (basic) or display a list and ask the user to eliminate some (deluxe).


> A user’s decision to move data to another service should not result in any loss of transparency or control over that data.

> It is worth noting that the Data Transfer Project doesn’t include any automated deletion architecture. Once a user has verified that the desired data is migrated, they would have to delete their data from their original service using that service’s deletion tool if they wanted the data deleted.

This project has copy, not move semantics. Therefore, in contrast to the stated purpose of allowing users to control their data, it actually has the opposite consequence of making it simpler to spread users' data around. Without a delete capability, the bias is towards multiple copies of user data.

This project normalizes web scraping to export data from non-participating APIs that project partners benefit from asymmetrically by establishing this as an open-source effort. In other words, API providers that do not provide export tools will nonetheless be subject to DTP adapters that exfiltrate data and send it to the (no doubt excellent) DTP importers maintained by DTP partners. This has the effect of creating a partial vacuum, sucking data from non-participants into participants' systems.

The economics of maintaining a high-volume bidirectional synchronization pipeline between DTP partners guarantees that these toy DTP adapters will not be the technology used to copy data between DTP partners, but rather, a dedicated pathway will be established instead. In other words, the public open-source DTP effort could be understood as a facade designed to create a plausible reason for why DTP partners have cross-connected their systems.

TLDR:

- Copy semantics are counterproductive to the goal of providing user control of their data.

- The approach of using existing APIs to scrape data from non-participating vendors is a priori hostile.

- Economics dictate that the lowest cost option for providing bidirectional synchronization between vendors involve dedicated links and specialized transport schemes that DTP project itself does not provide equally.

There is some merit to providing abstract representations of common data formats -- look at EDI, for instance. I'd welcome someone from the project stopping by to explain away my concerns.


Howdy, (I work on DTP)

I wanted to provide my thinking on some of these very valid wories,

Re: Copy vs. Move: This was a conscious choice that I think has a solid backing in two things: 1) In our user research for Takeout, the majority of users who user Takeout don't do it to leave Google. We suspect that the same will be true for DTP, users will want to try out a new service, or user a complementary service, instead of a replacement. 2) Users should absolutely be able to delete their data once they copy it. However we think that separating the two is better for the user. For instance you want to make sure the user has a chance to verify the fidelity of the data at the destination. It would be terrible if a user ported their photos to a new provider and the new provider down-sampled them and originals were automatically deleted.

Re: Scraping Its true that DTP can use API of companies that are 'participating' in DTP. But we don't do it by scraping their UIs. We do it like any other app developer, asking for an API key, which that service is free to decline to give. One of the foundational principals we cover in the white paper is that the source service maintain control over who, how, and and when to give the data out via their API. So if they aren't interesteed in their data being used via DTP, that is absolutely their choice.

Re: Economics As with all future looking statements we'll have to wait and see how it works out. But I'll give one antidote on why I don't think this will happen. Google Takeout (which I also work on) allows users to export their data to OneDrive, DropBox, and Box (as well as Google Drive). One of the reasons we wanted to make DTP is we were tired of dealing with other peoples APIs, as it doesn't scale well. Google should build adapters for Google, and Microsoft should build adapters for Microsoft. So with Takeout we tried the specialized transport method, but it was a lot of work, so we went with the DTP approach specifically to try to avoid having specialized transports.

DTP is still in the early phases, and I would encourage you, and everyone else, to get involved in the project (https://github.com/google/data-transfer-project) and help shape the direction of the project.


Hey! Thanks for the response. If you don't mind, I have some questions and comments after reading through your feedback.

> We suspect that [the majority of users who use Takeout don't do it to leave Google] will be true for DTP, users will want to try out a new service, or user a complementary service, instead of a replacement.

Interesting, thanks. I think this sort of worldview makes sense from a certain perspective.

> 2) Users should absolutely be able to delete their data once they copy it.

This is an aspirational statement and not a requirement of DTP, so it's problematic from a public perception standpoint to make the claim that DTP provides the user with more control of their data when the control very much remains at the mercy of the data controller. Indeed, this project directly facilitates the opportunity for more data controllers to obtain copies of the subject's data.

> If they aren't interested in their data being used via DTP, that is absolutely their choice.

Can you clarify whether you are saying that the DTP Project will honor takedown requests from parties targeted by DTP tooling?

> Google should build adapters for Google, and Microsoft should build adapters for Microsoft.

Can you explain the business drivers that incentivize these companies to provide parity between their import and export capabilities? Does the DTP Project require parity between these capabilities?


>This is an aspirational statement and not a requirement of DTP, so it's problematic from a public perception standpoint to make the claim that DTP provides the user with more control of their data when the control very much remains at the mercy of the data controller. Indeed, this project directly facilitates the opportunity for more data controllers to obtain copies of the subject's data.

I don't really disagree with what you, but I interpret things differently:

Without DTP, if you ask a data controller to delete your data you have to trust that they do. There is very little way to verify that the deletion actually happened, you more or less need to rely on the reputation of the company. Nowadays they all should have published retention statements which state their deletion practices in more details, so that helps some, and allow for some recourse if in fact they aren't following it. But in general for the average user, it comes down mostly to trust.

With DTP, nothing is worse. But users now can get their data into a new service easier.

If DTP had move semantics you still have the same problem as above, it mostly comes down to trust.

It is true that after a copy there are now two copies of the data, which isn't ideal in terms of data minimization. But because of the reasons I outline previously, I think it is important to keep deletion as a separate action from copy. I do think that after a copy the option to delete the data should be presented to the user prominently to make that as easy as possible if that is what they want to do.

So DTP isn't trying to solve every problem, but my take is that it makes some things better without making anything else significantly worse, so it's a net win.

> Can you clarify whether you are saying that the DTP Project will honor takedown requests from parties targeted by DTP tooling?

DTP doesn't really store data, so I don't think it is scope for a traditional takedown request. But I think more to the spirit of the question, yes if a service doesn't want to grant a DTP host a API key, or revoked an API, we wouldn't condone trying to work around that.

(One super detailed note, DTP is just an open source project, and doesn't operate any production code. A Hosting Entity can download/run the code. A Hosting Entity could be a company letting users transfer data in or out, or a user running DTP locally. Each Hosting Entity is responsible for acquiring API keys for all the services they want to interact with; including agreeing to and complying with any restrictions that that service might impose for access to their API.)

> Can you explain the business drivers that incentivize these companies to provide parity between their import and export capabilities? Does the DTP Project require parity between these capabilities?

This is a little bit of a bet on our part. I think Google has demonstrated, through its almost decade long investment in Takeout, that giving users more control over their data leads to greater user trust and that is good for business.

As for requiring parity, we cover this a bit in the white paper, but as you say, we recognize the reciprocity is key, and we need to incentive services to invest equally in import and export otherwise the whole thing falls apart.

Right now the stance we are taking is the reciprocity is strongly encouraged and we will be collection stats/metrics to try to measure it so we can name and shame folks that aren't following that. We hope that by providing transparency around different service's practices in this area will allow users to make informed decisions about where to store their data.

An interesting thought experiment in this area is that if a user wants to transfer data from service A to service B, but service B doesn't allow export back out, what should service A do? Ideally you force service B to support export, but on the other hand the user should be in control, and who is service A to say no. Its almost putting the good of an individual user against the good of the ecosystem.

We are hoping that as the project, and the large portability ecosystem, evolves there emerges some kind of neutral governance model that can help mediate some of these issues. It is problematic for service A to decide that question, but a neutral group representing the interests of users will have more legitimacy in making these tough questions.


Thanks for taking the time to provide these detailed follow ups. I'm still pretty wary of this project, but you've demonstrated that at least one person on the team is thinking through some of this stuff.

> An interesting thought experiment in this area is that if a user wants to transfer data from service A to service B, but service B doesn't allow export back out, what should service A do? Ideally you force service B to support export, but on the other hand the user should be in control, and who is service A to say no. Its almost putting the good of an individual user against the good of the ecosystem.

I'll offer that the European Union's answer to this -- the GDPR -- is to put the data subject first. It would be nice to see the DTP Project align with that position.


Please define 'delete' in this context. I'm afraid that if I transfer my data, the original will never be deleted.

Now I've doubled my problem.


In this context, "delete" should probably be understood to mean "removed from production systems, and retained only to the extent required to meet legal obligations".


Some might argue that web scraping to export data from non-participating services has been normalized for a significant amount of time. This has been a common practice for decades!

Further, it's perhaps possible that copy semantics comprise a very useful operational primitive when coupled with the existing delete primitives. With copy and delete actions available to them, users can choose to share, move, and delete data as they see fit. With only move actions available, users do not get to make their own choices and are limited to the choices prescribed for them.

There is *substantial( merit in calling for more directly useful primitives, but this could perhaps be done in a context that is informed by knowledge of extant primitives.


What would this actually gain me? Say I transfer my twitter data to facebook what happens? Does it create new posts with all my tweets, or what? Why would I bother? It would be nice to have a simple chat to show what it actually does.


It would be nice to have Apple aboard in this project. As an iPhone user I am thinking to use iCloud BUT if I migrate to an Android phone I will have to migrate all my data manually to google drive or onedrive.


It’s an interesting idea, and at the same time strange that some of the companies listed (eg facebook) would be open to contributing to this... esp given that for them data is their business model.


I don't think It's strange at all, if anything it gives them access to data they wouldn't have otherwise.


But also shares THEIR data they’ve collected with competing companies for free. Not to mention potentially lowering switching costs for users looking to move away from their platform.


Data portability is interesting for private data (like the example of photos given in the white paper) but I don't think it's useful for social data. It doesn't solve privacy problems; in fact it would give more companies access to your data. It doesn't solve monopoly problems; when you port data from Facebook to Google you still have a Facebook account.


Is this a competitor to Segment.io?


oh look who's once again abusing their monopoly over .dev


Ok, first of all, how is it a monopoly if you just bought the thing? That's like saying facebook has a monopoly over the domain name facebook.com.

Second, how is using it... abusing it?


There's a reasonable argument to be made that ICANN should not have sold an extremely desirable TLD like .dev to a company intending to use it internally rather than permitting open sale.

Not to mention that it was well-known to be commonly used for internal domains at other organizations, and they then intentionally acted to break everyone's internal workflows on a global scale. It'd be kind of like if ICANN went and sold .local (which Windows servers default to as internal domains.) The other correct option for ICANN was just to refuse to sell .dev and mark it as an internal use only TLD.


The way we allocate second level domains is already terrible (see also domain parking/domain squatting), why would I be okay with someone just throwing money at ICANN to get a bunch of generic top level ones too?


ICANN should carefully consider each gTLD application but instead they like money so much they'll sell anything to anyone so long as they pay.

Get ready for .fart.


> Ok, first of all, how is it a monopoly if you just bought the thing?

In general, monopolies are owned, and may be bought and sold.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: