Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SSH to authenticate GitDagBundle #44976

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ephraimbuddy
Copy link
Contributor

This uses SSH hook to authenticate GitDagBundle when provided.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Dec 17, 2024
@ephraimbuddy ephraimbuddy force-pushed the auth-git-dagbundle branch 2 times, most recently from 3806221 to 9537677 Compare December 18, 2024 15:18
@ephraimbuddy ephraimbuddy marked this pull request as ready for review December 18, 2024 16:20
repo_url: str,
tracking_ref: str,
subdir: str | None = None,
ssh_conn_kwargs: dict[str, str] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ssh_conn_kwargs: dict[str, str] | None = None,
ssh_hook_kwargs: dict[str, str] | None = None,

These are really kwargs for SSHHook. docstring is wrong too.


def initialize(self) -> None:
if self.ssh_conn_kwargs:
if not self.repo_url.startswith("git@") and not self.repo_url.endswith(".git"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.repo_url.startswith("git@") and not self.repo_url.endswith(".git"):
if not self.repo_url.startswith("git@") or not self.repo_url.endswith(".git"):

We need to check that they are there individually.

Comment on lines 308 to 309
def test_repo_url_starts_with_git_when_using_ssh_conn_id(self):
repo_url = "https://github.com/apache/airflow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_repo_url_starts_with_git_when_using_ssh_conn_id(self):
repo_url = "https://github.com/apache/airflow"
@pytest.mark.parametrize(
"repo_url",
[
pytest.param("https://github.com/apache/airflow", id="https_url"),
pytest.param("airflow@example:apache/airflow.git", id="does_not_start_with_git_at"),
pytest.param("git@example:apache/airflow", id="does_not_end_with_dot_git"),
]
)
def test_repo_url_starts_with_git_when_using_ssh_conn_id(self, repo_url):

Or similar. This better tests that we enforce it both starts with and ends with git.

self.bare_repo.remotes.origin.fetch("+refs/heads/*:refs/heads/*")
self.repo.remotes.origin.pull()
if self.ssh_hook:
with self.ssh_hook.get_conn():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this create a session to the (a?) remote ssh host here? Not sure we want that. Not even sure where this would connect, since the remote host likely wouldn't match the repo_url.

Unless we can somehow tell git to use that socket/session?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I experimented, once the ssh connection is established, it will be used by git to clone the repo. The remote host is where the ssh will establish the connection and not the repo url. While testing in breeze, I set localhost as the remote host and cloned a private repo. Also, in the breeze ssh config, I put it to use private key:

Host * 
 IdentityFile ~/.ssh/id_ed25519

Maybe @potiuk can help here.

Copy link
Member

@potiuk potiuk Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not paid too much attention to those changes so far but happy to help. Not sure exactly what is the problem, but if I understand correctly the question is about the repo_url and host connection established in SSHHook via the SSH conn.

Yeah - generally git can use SSH protocol (And I think it should be generally default for most "production" setup configurations.

https://git-scm.com/book/ms/v2/Git-on-the-Server-The-Protocols

And in this case indeed we need to somehow make repo_url and SSHHook host match. because if they don't then indeed SSHHook.get_conn() uses the host from SSH conn to establish connection and they connection is reused later.

I believe (from what I see and earlier comments) we are not trying to use existing ssh_conn_id - which I find a bit strange, because Connection (and/or secrets) is where we can securely store authentication information, so i think we should pass conn_id - to retrieve the authentication information from the Connection.

I can imagine - for example that:

Connection(
    login = "git"
    host = "github.com"
    extra = {... private_key etc ..)

Where repo_url = :

[email protected]:apache/airflow.git

There is a bit of a problem here indeed because the question is how to make host part of the repo_url and connection host match. We don't have really a place to store the "repo" portion of the github URL in SSH Connection, so in the current setup we have a bit of a strange mixup between connection and repo_url where part of the connection definition is in the hook and part in the repo_url.

IMHO - the cleanest solution would be to create a GItSSHHook and Git SSH Connection extending from SSHHook. We could add repository extra to Git SSH Connection on top of what SSH Connection already has - and use it exclusively for git SSH connectivity/authentication information.

In this case repo_url and ssh_conn_id would be mutually exclusive. This would avoid a lot of confusion. We have very similar case where we mixed configuration of emails sent by Airflow - where part of the email configuration is taken from the email conn id and part from the configuraiton and it ain't pretty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I worked towards a separate hook earlier but later changed to using the ssh conn_id. I will revisit it. I think the issue I had was populating the connection type field with the new hook since the hook is not yet in the providers package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a change to use githook. Not manually tested for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
Development

Successfully merging this pull request may close these issues.

4 participants