Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: fix technical inaccuracies in Local JSON destination documentation #55808

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 161 additions & 24 deletions docs/integrations/destinations/local-json.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

:::danger

This destination is meant to be used on a local workstation and won't work on Kubernetes
This destination is meant to be used on a local workstation and won't work on Kubernetes production deployments. This is because the destination writes data to the local filesystem of the container, which is not accessible outside the pod in a Kubernetes environment unless you configure persistent volumes.

:::

## Overview

This destination writes data to a directory on the _local_ filesystem on the host running Airbyte. By default, data is written to `/tmp/airbyte_local`. To change this location, modify the `LOCAL_ROOT` environment variable for Airbyte.
This destination writes data to a directory on the filesystem within the Airbyte container. All data is written under the `/local` directory inside the container.

### Sync Overview

Expand Down Expand Up @@ -37,39 +37,176 @@ This integration will be constrained by the speed at which your filesystem accep

The `destination_path` will always start with `/local` whether it is specified by the user or not. Any directory nesting within local will be mapped onto the local mount.

By default, the `LOCAL_ROOT` env variable in the `.env` file is set `/tmp/airbyte_local`.

The local mount is mounted by Docker onto `LOCAL_ROOT`. This means the `/local` is substituted by `/tmp/airbyte_local` by default.
The connector code enforces that all paths must be under the `/local` directory. If you provide a path that doesn't start with `/local`, it will be automatically prefixed with `/local`. Attempting to write to a location outside the `/local` directory will result in an error.

:::caution

Please make sure that Docker Desktop has access to `/tmp` (and `/private` on a MacOS, as /tmp has a symlink that points to /private. It will not work otherwise). You allow it with "File sharing" in `Settings -> Resources -> File sharing -> add the one or two above folder` and hit the "Apply & restart" button.
When using abctl to deploy Airbyte locally, the data is stored within the Kubernetes cluster created by abctl. You'll need to use kubectl commands to access the data as described in the "Access Replicated Data Files" section below.

:::

### Example:

- If `destination_path` is set to `/local/cars/models`
- the local mount is using the `/tmp/airbyte_local` default
- then all data will be written to `/tmp/airbyte_local/cars/models` directory.

## Access Replicated Data Files

If your Airbyte instance is running on the same computer that you are navigating with, you can open your browser and enter [file:///tmp/airbyte_local](file:///tmp/airbyte_local) to look at the replicated data locally. If the first approach fails or if your Airbyte instance is running on a remote server, follow the following steps to access the replicated files:
- then all data will be written to `/local/cars/models` directory inside the container

1. Access the scheduler container using `docker exec -it airbyte-server bash`
2. Navigate to the default local mount using `cd /tmp/airbyte_local`
3. Navigate to the replicated file directory you specified when you created the destination, using `cd /{destination_path}`
4. List files containing the replicated data using `ls`
5. Execute `cat {filename}` to display the data in a particular file

You can also copy the output file to your host machine, the following command will copy the file to the current working directory you are using:

```text
docker cp airbyte-server:/tmp/airbyte_local/{destination_path}/{filename}.jsonl .
```
:::info
**Understanding Airbyte's Architecture:** In Airbyte's Kubernetes deployment, destination connectors don't run as standalone pods. Instead, they are executed as jobs by the worker pods. This means that to persist data from the Local JSON destination, you must mount volumes to the worker pods, not to the destination connectors directly.
:::

Note: If you are running Airbyte on Windows with Docker backed by WSL2, you have to use similar step as above or refer to this [link](/integrations/locating-files-local-destination.md) for an alternative approach.
## Using with Kubernetes

When using Airbyte in a Kubernetes environment, you need to follow these steps to properly configure and access data:

1. **Create a Persistent Volume**
- First, create a persistent volume claim (PVC) in your Kubernetes cluster:
```
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-json-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
EOF
```
- Note: Adjust the namespace and other parameters according to your Kubernetes setup

2. **Configure the Destination with Volume Mount**
- When setting up your Local JSON destination, set the destination path to `/local/data`
- In the Airbyte UI, create or edit your connection to use this destination
- **Important**: You must configure the worker pods that run the destination connector to mount the PVC during sync
- In Airbyte's Kubernetes deployment, destination connectors run as jobs launched by the worker deployment
- For Helm deployments, modify your values.yaml to include volume mounts for the worker:
```yaml
worker:
extraVolumes:
- name: data-volume
persistentVolumeClaim:
claimName: local-json-data
extraVolumeMounts:
- name: data-volume
mountPath: /local
```
- Apply this configuration when installing or upgrading Airbyte:
```bash
helm upgrade --install airbyte airbyte/airbyte -n airbyte -f values.yaml
```
- For manual Kubernetes deployments, patch the worker deployment:
```bash
kubectl patch deployment airbyte-worker -n airbyte --patch '
{
"spec": {
"template": {
"spec": {
"volumes": [
{
"name": "data-volume",
"persistentVolumeClaim": {
"claimName": "local-json-data"
}
}
],
"containers": [
{
"name": "airbyte-worker",
"volumeMounts": [
{
"name": "data-volume",
"mountPath": "/local"
}
]
}
]
}
}
}
}'
```
- This step is critical - without mounting the volume to the worker pods that run the destination, data will not persist

3. **Access Data After Sync Completion**
- For completed pods where the data is stored in the persistent volume, create a temporary pod with the volume mounted:
```
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: file-access
spec:
containers:
- name: file-access
image: busybox
command: ["sh", "-c", "ls -la /data && sleep 3600"]
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: local-json-data
EOF
```
- Then access the pod to view files:
```
kubectl exec -it file-access -- sh
```
- To view file contents directly:
```
# First, list all directories to find your stream names
kubectl exec -it file-access -- ls -la /data

# Then view specific files (replace stream_name with actual stream name from above)
kubectl exec -it file-access -- cat /data/stream_name/*.jsonl
```
- When finished, delete the temporary pod:
```
kubectl delete pod file-access
```

4. **Alternative: View File Paths in Logs**
- If you can't mount the volume, you can at least see the file paths in the logs:
```
kubectl logs <pod-name> | grep "File output:"
```

Note: The exact pod name will depend on your specific connection ID and sync attempt. Look for pods with names containing "destination" and your connection ID.

If you are running Airbyte on Windows, you may need to adjust these commands accordingly. You can also refer to the [alternative file access methods](/integrations/locating-files-local-destination.md) for other approaches.

## Troubleshooting

### Verifying Volume Mounts

If you're having trouble with data persistence, follow these steps to verify your volume mounting configuration:

1. **Check if the PVC was created successfully:**
```bash
kubectl get pvc local-json-data -n <your-namespace>
```
The status should be "Bound".

2. **Verify that the worker pods have the volume mounted:**
```bash
kubectl describe pod -l app=airbyte-worker -n <your-namespace> | grep -A 10 "Volumes:"
```
You should see your volume listed with the correct PVC.

3. **Check the logs of a recent sync job for file paths:**
```bash
kubectl logs <destination-pod-name> -n <your-namespace> | grep "File output:"
```
This should show paths starting with `/local/`.

4. **Common issues:**
- Volume not mounted to worker pods (most common issue)
- Incorrect mount path (must be `/local`)
- PVC not bound or available
- Insufficient permissions on the mounted volume

## Changelog

Expand Down
Loading