Skip to content

Commit

Permalink
inspect
Browse files Browse the repository at this point in the history
  • Loading branch information
e3rd committed Mar 19, 2024
1 parent e0620f9 commit a9d133a
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 24 deletions.
46 changes: 41 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

Yet another file deduplicator.

- [About](#about)
* [What are the use cases?](#what-are-the-use-cases)
* [What is compared?](#what-is-compared)
* [Why not using standard sync tools like meld?](#why-not-using-standard-sync-tools-like-meld)
* [Doubts?](#doubts)
- [Launch](#launch)
- [Examples](#examples)
* [Duplicated files](#duplicated-files)
* [Names shuffled](#names-shuffled)
- [Documentation](#documentation)
* [Parameters](#parameters)
* [Utils](#utils)

# About

## What are the use cases?
Expand Down Expand Up @@ -34,15 +47,18 @@ These imply the folders have the same structure. Deduplidog is tolerant towards

## Doubts?

The program does not write anything to the disk, unless `execute=True` is set. Feel free to launch it just to inspect the recommended actions. Or set `bashify=True` to output bash commands you may launch after thorough examining.
The program does not write anything to the disk, unless `execute=True` is set. Feel free to launch it just to inspect the recommended actions. Or set `inspect=True` to output bash commands you may launch after thorough examining.

# Launch

Install with `pip install deduplidog`.

It works as a standalone program with both CLI and TUI interfaces. Just launch the `deduplidog` command.
Moreover, it works best when imported from a [Jupyter Notebook](https://jupyter.org/).

# Examples

## Duplicated files
Let's take a closer look to a use-case.

```python3
Expand Down Expand Up @@ -94,14 +110,14 @@ Affectable: 38/38
Affected size: 59.9 kB
```

You see, the log is at the most brief, yet transparent form. The files to be affected at the work folder are prepended with the 🔨 icon whereas those affected at the original folder uses 📄 icon. We might add `execute=True` parameter to perform the actions. Or use `bashify=True` to inspect.
You see, the log is at the most brief, yet transparent form. The files to be affected at the work folder are prepended with the 🔨 icon whereas those affected at the original folder uses 📄 icon. We might add `execute=True` parameter to perform the actions. Or use `inspect=True` to inspect.

```python3
Deduplidog("/home/user/duplicates", "/media/disk/origs",
ignore_date=True, rename=True, set_both_to_older_date=True, bashify=True)
ignore_date=True, rename=True, set_both_to_older_date=True, inspect=True)
```

The `bashify=True` just produces the commands we might use.
The `inspect=True` just produces the commands we might subsequently use.

```bash
touch -t 1524754680.0 /media/disk/origs/foo.txt
Expand All @@ -110,6 +126,26 @@ mv -n /home/user/duplicates/bar.txt /home/user/duplicates/✓bar.txt
mv -n /home/user/duplicates/third.txt /home/user/duplicates/✓third.txt
```

## Names shuffled

You face a directory that might contain some images twice. Let's analyze. We turn on `media_magic` so that we find the scaled down images. We `ignore_name` because the scaled images might have been renamed. We `skip_bigger` files as we examine the only folder and every file pair would be matched twice. That way, we declare the original image is the bigger one. And we set `log_level` verbosity so that we get a list of the affected files.

```
$ deduplidog --work-dir ~/shuffled/ --media-magic --ignore-name --skip-bigger --log-level=20
Only files with media suffixes are taken into consideration. Nor the size nor the date is compared. Nor the name!
Duplicates from the work dir at 'shuffled' (only if smaller than the pair file) would be (if execute were True) left intact (because no action is selected).
Number of originals: 9
Caching image hashes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 16.63it/s]
Caching working files: 9it [00:00, 62497.91it/s]
* /home/user/shuffled/IMG_20230802_shrink.jpg
/home/user/shuffled/IMG_20230802.jpg
Affectable: 1/9
Affected size: 636.4 kB
```

We see there si a single duplicated file whose name is `IMG_20230802_shrink.jpg`.

# Documentation

## Parameters
Expand All @@ -130,7 +166,7 @@ Find the duplicates. Normally, the file must have the same size, date and name.
| original_dir | str \| Path | - | Folder of the original files. Normally, these files will not be affected.<br> (However, they might get affected by `treat_bigger_as_original` or `set_both_to_older_date`). |
| **Actions** |
| execute | bool | False | If False, nothing happens, just a safe run is performed. |
| bashify | bool | False | Print bash commands that correspond to the actions that would have been executed if execute were True.<br> You can check and run them yourself. |
| inspect | bool | False | Print bash commands that correspond to the actions that would have been executed if execute were True.<br> You can check and run them yourself. |
| rename | bool | False | If `execute=True`, prepend ✓ to the duplicated work file name (or possibly to the original file name if treat_bigger_as_original).<br>Mutually exclusive with `replace_with_original` and `delete`. |
| delete | bool | False | If `execute=True`, delete theduplicated work file name (or possibly to the original file name if treat_bigger_as_original).<br>Mutually exclusive with replace_with_original and rename. |
| replace_with_original | bool | False | If `execute=True`, replace duplicated work file with the original (or possibly vice versa if treat_bigger_as_original).<br>Mutually exclusive with rename and delete. |
Expand Down
43 changes: 25 additions & 18 deletions deduplidog/deduplidog.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import re
import shutil
from collections import defaultdict
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from dataclasses import dataclass
from datetime import datetime
from functools import cache
Expand Down Expand Up @@ -76,7 +76,7 @@ class Deduplidog:
# Action section
execute: Annotated[bool, flag(
"If False, nothing happens, just a safe run is performed.")] = False
bashify: Annotated[bool, flag(
inspect: Annotated[bool, flag(
"""Print bash commands that correspond to the actions that would have been executed if execute were True.
You can check and run them yourself.""")] = False
rename: Annotated[bool, flag(
Expand Down Expand Up @@ -255,10 +255,11 @@ def perform(self):
def preload_metadata(self, files: list[Path]):
""" Populate self.metadata with performance-intensive file information """
# Strangely, when I removed cached_properties from FileMetadata in order to be serializable for multiprocesing,
# using ThreadPoolExecutor is just as quick as ProcessPoolExecutor. And it spans multiple processes too.
# I thought ThreadPoolExecutor spans just threads.
# using ThreadPoolExecutor is just as quick as ProcessPoolExecutor
# as it spans the threads over multiple cores too.
# I thought ThreadPoolExecutor spans just on a single core.
images = [x for x in files if x.suffix.lower() in IMAGE_SUFFIXES]
with ProcessPoolExecutor() as executor:
with ProcessPoolExecutor(max_workers=2) as executor:
for file, *args in tqdm(executor.map(FileMetadata.preload, images),
total=len(images), desc="Caching image hashes"):
self.metadata[file] = FileMetadata(file, *args)
Expand Down Expand Up @@ -326,7 +327,7 @@ def check(self):

match self.rename, self.replace_with_original, self.delete:
case False, False, False:
pass
print("left intact (because no action is selected).")
case True, False, False:
print("renamed (prefixed with ✓).")
case False, True, False:
Expand Down Expand Up @@ -497,7 +498,7 @@ def _affect(self, work_file: Path, original: Path):

def _rename(self, change: Change, affected_file: Path):
msg = "renamable"
if self.execute or self.bashify:
if self.execute or self.inspect:
# self.queue.put((affected_file, affected_file.with_name("✓" + affected_file.name)))
target_path = affected_file.with_name("✓" + affected_file.name)
if self.execute:
Expand All @@ -510,20 +511,20 @@ def _rename(self, change: Change, affected_file: Path):
else:
affected_file.rename(target_path)
msg = "renaming"
if self.bashify:
print(f"mv -n {_qp(affected_file)} {_qp(target_path)}")
if self.inspect:
self._inspect_print(f"mv -n {_qp(affected_file)} {_qp(target_path)}")
self.passed_away.add(affected_file)
self.metadata.pop(affected_file, None)
change[affected_file].append(msg)

def _delete(self, change: Change, affected_file: Path):
msg = "deletable"
if self.execute or self.bashify:
if self.execute or self.inspect:
if self.execute:
affected_file.unlink()
msg = "deleting"
if self.bashify:
print(f"rm {_qp(affected_file)}")
if self.inspect:
self._inspect_print(f"rm {_qp(affected_file)}")
self.passed_away.add(affected_file)
self.metadata.pop(affected_file, None)
change[affected_file].append(msg)
Expand All @@ -534,16 +535,16 @@ def _replace_with_original(self, change: Change, affected_file: Path, other_file
if self.execute:
msg = "replacing"
shutil.copy2(other_file, affected_file)
if self.bashify:
print(f"cp --preserve {_qp(other_file)} {_qp(affected_file)}") # TODO check
if self.inspect:
self._inspect_print(f"cp --preserve {_qp(other_file)} {_qp(affected_file)}") # TODO check
else:
if self.execute:
msg = "replacing"
shutil.copy2(other_file, affected_file.parent)
affected_file.unlink()
if self.bashify:
if self.inspect:
# TODO check
print(f"cp --preserve {_qp(other_file)} {_qp(affected_file.parent)} && rm {_qp(affected_file)}")
self._inspect_print(f"cp --preserve {_qp(other_file)} {_qp(affected_file.parent)} && rm {_qp(affected_file)}")
change[affected_file].append(msg)
self.metadata.pop(affected_file, None)

Expand All @@ -560,8 +561,8 @@ def _change_file_date(self, path, old_date, new_date, change: Change):
if self.execute:
os.utime(path, (new_date,)*2) # change access time, modification time
self.metadata.pop(path, None)
if self.bashify:
print(f"touch -t {new_date} {_qp(path)}") # TODO check
if self.inspect:
self._inspect_print(f"touch -t {new_date} {_qp(path)}") # TODO check

def _path(self, path):
""" Strips out common prefix that has originals with work_dir for display reasons.
Expand Down Expand Up @@ -664,3 +665,9 @@ def _print_change(self, change: Change):
[print(text, *(str(s) for s in changes))
for text, changes in zip((f" {wicon}{wn}:",
f" {oicon}{on}:"), change.values()) if len(changes)]

def _inspect_print(self, text):
if self._output:
self._output.write(text + "\n")
else:
print(text)
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "deduplidog"
version = "0.6.1"
version = "0.6.2"
description = "Yet another file deduplicator"
authors = ["Edvard Rejthar <[email protected]>"]
license = "GPL-3.0-or-later"
Expand Down

0 comments on commit a9d133a

Please sign in to comment.