Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(understanding-airbyte): Add file sync and permission sync documentation (do not merge) #55783

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/using-airbyte/file-transfer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Airbyte File Sync

Airbyte File Sync is a capability that allows you to move unstructured data, non-text data, and compressed files between sources and destinations without parsing their contents. This document explains how File Sync works, which connectors support it, and how to use it.

## Overview

Traditional data integration in Airbyte involves extracting structured data as individual records, which are then processed and loaded into a destination. However, many use cases require transferring raw files without parsing their contents:

- Moving binary files (images, videos, PDFs).
- Transferring compressed files (ZIP, GZIP).
- Migrating unstructured text data.
- Preserving file formats for specialized processing.

File Sync addresses these needs by copying files exactly as they appear in the source to the destination, preserving their original format and content.

## How File Sync Works

When using File Sync:

1. The source connector identifies files to be transferred.
2. Instead of parsing file contents into records, the file is transferred as-is.
3. The destination connector writes the raw file to the target location.
4. File metadata (name, path, size, etc.) is preserved.

This differs from standard Airbyte syncs where files would be parsed into individual records.

## Supported Connectors

File Sync is currently supported by the following connectors:

### Sources

- [SFTP (Gen 2)](../integrations/sources/sftp-bulk.md)
- [Microsoft SharePoint](../integrations/sources/microsoft-sharepoint.md)
- [S3](../integrations/sources/s3.md)

### Destinations

- [S3](../integrations/destinations/s3.md)

## Using File Sync

To use File Sync:

1. Configure a connection using a source and destination that both support File Sync.
2. The File Sync mode will be automatically enabled when compatible connectors are used.
3. Files will be transferred without parsing their contents.

### Configuration Example

When configuring a connection between SFTP Bulk (source) and S3 (destination):

1. Set up the SFTP Bulk source with your server credentials and file paths.
2. Configure the S3 destination with your bucket information.
3. The connection will automatically use File Sync mode.

## Limitations

- Both the source and destination must support File Sync.
- File Sync is designed for raw file movement, not for transforming data.
- Maximum file size limits may apply depending on the connectors.

## Technical Implementation

File Sync is implemented in two Airbyte CDKs:
- Python Files CDK: Provides file transfer capabilities for Python-based connectors.
- Java/Kotlin Bulk Destination CDK: Supports file transfer for Java-based connectors.

Connectors that support this feature have the `supportsFileTransfer: true` flag in their metadata.yaml file.

## Future Enhancements

The File Sync capability is being expanded to support more source and destination connectors. Check the documentation of specific connectors to see if they support File Sync.

## Related Topics

- [Permission Sync](./permission-sync.md) - Learn about transferring access control information between systems
- [Parsing Unstructured Documents](./unstructured-documents.md) - Learn about extracting text from unstructured documents
81 changes: 81 additions & 0 deletions docs/using-airbyte/permission-sync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Airbyte Permission Sync

Permission Sync is a capability in Airbyte that allows you to transfer access control information and permission structures between systems. This document explains how Permission Sync works, which connectors support it, and how to use it.

## Overview

When transferring data between systems, it's often important to maintain not just the data itself but also the permission structures that govern access to that data. Permission Sync addresses this need by:

- Preserving user and group access controls.
- Maintaining role-based permissions.
- Transferring ownership information.
- Replicating sharing settings.

This ensures that when data is moved between systems, the appropriate access controls are maintained.

## How Permission Sync Works

When using Permission Sync:

1. The source connector extracts both data and associated permission metadata.
2. Permission structures are replicated from the source and sent as records.
3. The destination connector receives permission information as incoming records.
4. Permission logic allows restrictions to be reconstructed in downstream applications.

While Permission Sync and File Sync connections can often complement each other, they are distinct and separate features and should be set up as separate connections.

## Supported Connectors

Permission Sync is currently in early development with limited connector support. The following source connectors are planned to support Permission Sync:

### Sources

- Microsoft SharePoint (in development)
- Google Drive (planned)
- Box (planned)

### Destinations

Permission Sync uses standard record-type processing, making it compatible with all Airbyte destinations.

## Using Permission Sync

To use Permission Sync:

1. Configure a connection using a source and destination that both support Permission Sync.
2. Enable the Permission Sync option in the connection settings.
3. Configure user/group mapping if needed for cross-system synchronization.

### Configuration Example

When configuring a connection between Microsoft SharePoint (source) and S3 (destination):

1. Set up the SharePoint source with your tenant credentials.
2. Configure the S3 destination with your bucket information and IAM settings.
3. Enable Permission Sync in the advanced options.
4. Configure user mapping between SharePoint users and AWS IAM roles/users.

## Limitations

- Permission structures vary significantly between systems, so perfect mapping is not always possible.
- Some permission types may not have equivalents in destination systems.
- User and group identity mapping may require manual configuration.
- Permission Sync is most effective between systems with similar access control models.

## Technical Implementation

Permission Sync is implemented as an extension to the Airbyte protocol, allowing connectors to exchange permission metadata alongside regular data records. Connectors that support this feature have the `supportsPermissionSync: true` flag in their metadata.yaml file.

## Future Enhancements

The Permission Sync capability is being actively developed with plans to support more source and destination connectors. Future enhancements will include:

- More granular permission mapping options.
- Support for complex role-based access control (RBAC) systems.
- Automated user/group identity mapping.
- Audit logging for permission changes during sync.

## Related Topics

- [File Sync](./file-transfer.md) - Learn about transferring files between systems without parsing
- [Parsing Unstructured Documents](./unstructured-documents.md) - Learn about extracting text from unstructured documents
83 changes: 83 additions & 0 deletions docs/using-airbyte/unstructured-documents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Parsing Unstructured Documents

Airbyte provides capabilities for extracting and processing unstructured text documents from various sources. This document explains how Airbyte's unstructured document parsing works, which connectors support it, and how to use it.

## Overview

Traditional data integration typically focuses on structured data with well-defined schemas. However, many organizations need to extract value from unstructured documents such as:

- Text documents (Word, PDF, TXT).
- Emails and email attachments.
- Web pages and HTML content.
- Presentations and spreadsheets.
- Scanned documents with OCR text.

Airbyte's unstructured document parsing capabilities address these needs by extracting text content from various document formats and making it available for analysis, search, or AI processing.

## How Unstructured Document Parsing Works

When using unstructured document parsing:

1. The source connector identifies documents to be processed.
2. The document parser extracts text content from the documents.
3. The extracted text is normalized and cleaned.
4. The text is sent as records to the destination.

This process enables you to work with text from documents in the same way you work with other structured data in Airbyte.

## Supported Connectors

Unstructured document parsing is currently supported by the following connectors:

### Sources

- Google Drive
- Microsoft SharePoint
- S3
- SFTP (Gen 2)

## Using Unstructured Document Parsing

To use unstructured document parsing:

1. Configure a connection using a source that supports document parsing.
2. Enable the document parsing option in the connection settings.
3. Configure any additional parsing options (e.g., language detection, OCR settings).
4. The parsed text will be extracted and sent to your destination.

### Configuration Example

When configuring a connection between Google Drive (source) and a destination:

1. Set up the Google Drive source with your account credentials.
2. Enable the "Parse Documents" option in the advanced settings.
3. Configure document type filters if needed (e.g., only process PDFs).
4. Complete the connection setup with your desired destination.

## Limitations

- Document parsing may not extract formatting, images, or complex layouts.
- Very large documents may be truncated based on size limits.
- OCR accuracy depends on document quality and language support.
- Some document types may require specific parser configurations.

## Technical Implementation

Unstructured document parsing is implemented using the "Unstructured Text Documents" parser in the Python Files CDK. This parser leverages open-source libraries to extract text from various document formats.

Connectors that support this feature have the `supportsUnstructuredDocumentParsing: true` flag in their metadata.yaml file.

## Future Enhancements

The unstructured document parsing capability is being actively developed with plans to support more document types and extraction features. Future enhancements will include:

- Improved layout preservation.
- Better table extraction from documents.
- Enhanced metadata extraction.
- Support for more document formats.
- Integration with AI models for content analysis.

## Related Topics

- [File Sync](./file-transfer.md) - Learn about transferring files between systems without parsing
- [Permission Sync](./permission-sync.md) - Learn about transferring access control information between systems
6 changes: 6 additions & 0 deletions docusaurus/redirects.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@
to: /cloud/managing-airbyte-cloud/configuring-connections
- from: /cloud/managing-airbyte-cloud/manage-schema-changes
to: /using-airbyte/schema-change-management
- from: /file-sync
to: /using-airbyte/file-transfer
- from: /permission-sync
to: /using-airbyte/permission-sync
- from: /unstructured-data
to: /using-airbyte/unstructured-documents
# November 2023 documentation restructure:
- from:
- /project-overview/product-support-levels
Expand Down
Loading