adobe-research
diff --git a/‎README.md
+11-2 b/‎README.md
+11-2
diff --git a/‎data_processing/example_videos/getty-soccer-ball-jordan-video-id473239807_26.mp4
176 KB b/‎data_processing/example_videos/getty-soccer-ball-jordan-video-id473239807_26.mp4
176 KB
diff --git a/‎data_processing/example_videos/getty-video-of-american-flags-being-sewn-together-at-flagsource-in-batavia-video-id804937470_87.mp4
417 KB b/‎data_processing/example_videos/getty-video-of-american-flags-being-sewn-together-at-flagsource-in-batavia-video-id804937470_87.mp4
417 KB
diff --git a/‎data_processing/example_videos/giphy-fgiT2cbsTxl8k_0.mp4
93 KB b/‎data_processing/example_videos/giphy-fgiT2cbsTxl8k_0.mp4
93 KB
diff --git a/‎data_processing/example_videos/giphy-gkvCpHRX9IqkM_3.mp4
99.4 KB b/‎data_processing/example_videos/giphy-gkvCpHRX9IqkM_3.mp4
99.4 KB
diff --git a/‎data_processing/example_videos/yt--4Fx5XUD-9Y_345.mp4
670 KB b/‎data_processing/example_videos/yt--4Fx5XUD-9Y_345.mp4
670 KB
diff --git a/‎data_processing/example_videos/yt-mNdvtOO7UqY_15.mp4
294 KB b/‎data_processing/example_videos/yt-mNdvtOO7UqY_15.mp4
294 KB
diff --git a/‎data_processing/moments_dataset.py
+54 b/‎data_processing/moments_dataset.py
+54
@@ -35,8 +35,17 @@ We start training from the official SD1.4 model (with the first layer modified t
 
 ### Data Processing
 The data processing code can be found under the `data_processing` folder. You can simply put all the videos in a directory, and pass the directory as the folder name in `data_processing/moments_processing.py`. If your videos are long (~ex more than 5 seconds and contain cut scenes), then you would want to use pyscenedetect to detect cut scenes and split the videos accordingly.
-For data processing, you also need to download the checkpoint for SegmentAnything, and install soft-splatting
+For data processing, you also need to download the checkpoint for SegmentAnything, and install soft-splatting. You can setup softmax-splatting and SAM, by following 
+```
+cd data_processing
+git clone https://github.com/sniklaus/softmax-splatting.git
+pip install segment_anything
+cd sam_model
+wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+```
+For softmax-splatting to run, you need to install `pip install cupy` (or you might need to use `pip install cupy-cuda11x` or `pip install cupy-cuda12x` depending on your cuda version, and load the appropriate cuda module)
 
+Then run `python moments_processing.py` to start processing frames from the provided examples video (included under `data_processing/example_videos`). For the version provided, we used the [Moments in Time Dataset](http://moments.csail.mit.edu)
 
 ### Running the training script
 Make sure that you have downloaded the pretrained SD1.4 model above.  Once you download the training dataset and pretrained model, you can simply start training the model with 
@@ -45,7 +54,7 @@ Make sure that you have downloaded the pretrained SD1.4 model above.  Once you d
 ```
 The training code is in `main.py`, and relies mainly on pytorch_lightning in training.
 
-<TODO add details on how you should modify the config>
+Note that you need to modify the train and val paths in the chosen config file to the location where you have the processed data.
 
 Note: we use Deepspeed to lower the memory requirements, so the saved model weights will be sharded. The script to reconstruct the model weights will be created in the checkpoint directory with name `zero_to_fp32.py`. One bug in the file is that it wouldn't recognize files with deepspeed1 (which is the one we use), so simply find and replace the string `== 2` with the string `<= 2` and it will work.
 
 
@@ -0,0 +1,54 @@
+# Copyright 2024 Adobe. All rights reserved.
+
+#%%
+import glob
+import torch
+import torchvision
+import matplotlib.pyplot as plt
+from torch.utils.data import Dataset
+import numpy as np
+
+
+# %%
+class MomentsDataset(Dataset):
+    def __init__(self, videos_folder, num_frames, samples_per_video, frame_size=512) -> None:
+        super().__init__()
+        
+        self.videos_paths = glob.glob(f'{videos_folder}/*mp4')
+        self.resize = torchvision.transforms.Resize(size=frame_size)
+        self.center_crop = torchvision.transforms.CenterCrop(size=frame_size)
+        self.num_samples_per_video = samples_per_video
+        self.num_frames = num_frames
+
+    def __len__(self):
+        return len(self.videos_paths) * self.num_samples_per_video
+    
+    def __getitem__(self, idx):
+        video_idx = idx // self.num_samples_per_video
+        video_path = self.videos_paths[video_idx]
+        
+        try:
+            start_idx = np.random.randint(0, 20)
+            
+            unsampled_video_frames, audio_frames, info = torchvision.io.read_video(video_path,output_format="TCHW")
+            sampled_indices = torch.tensor(np.linspace(start_idx, len(unsampled_video_frames)-1, self.num_frames).astype(int))
+            sampled_frames = unsampled_video_frames[sampled_indices]
+            processed_frames = []
+
+            for frame in sampled_frames:
+                resized_cropped_frame = self.center_crop(self.resize(frame))
+                processed_frames.append(resized_cropped_frame)
+            frames = torch.stack(processed_frames, dim=0)
+            frames = frames.float() / 255.0
+        except Exception as e:
+            print('oops', e)
+            rand_idx = np.random.randint(0, len(self))
+            return self.__getitem__(rand_idx)
+        
+        out_dict = {'frames': frames,
+         'caption': 'none',
+         'keywords': 'none'}
+        
+        return out_dict
+        
+