I’m about to talk a lot, so if you happened upon this blog post because you’re looking for a way to merge all your duplicates in Calibre, head down to the download and installation instructions.
I get big bouts of what I suppose are a form of imposter syndrome. Bootcamp kid, no computer science degree, always struggled with math, lucky charisma roll means I get to “sneak” my way into companies with engineers orders of magnitude better than me thus leaving me always feeling like a neanderthal in comparison. Now that I’m living in Taiwan and ostensibly using half my time to back-fill my education (and learn Machine Learning), I’m always on the hunt for opportunities to “prove” that I can actually hack it.
This blog post isn’t going to go in any depth into the psychological and sociological issues of imposter syndrome existing, or being a self-imposed gate on an already gate-kept industry. No space. Some other time. What I’ve found as something that helps me overcome the syndrome is, as I said, “proving” to myself that I can “do it.” A great way to “prove it” is by contributing to an open source library. Plus, it feels awesome to do so. Like look at me, I’m a “real engineer!”
Even though I’ve got a couple years under my belt of messing around with civic hack organizations, spending time with g0v folks, winning minor hackathons, and squeezing in some merge requests to open source projects over the years, it’s really daunting to wake up and say “ok, time to do something for an open source project.” I’ve had this inkling since I was coming out of the bootcamp, but back then it seemed literally impossible. For the sake of anybody else that feels this way, I’m writing this post to go line by line through my thought process of how I found an issue that bothered me about an open source project, and then fixed that issue. I hope this can help other people that feel similarly to me, and want to find a way to contribute.
The Problem that Needed Solving
I use Calibre to manage my ebook library. Calibre is a GPL-3 licensed (sort of) program chock full of ebook management tasks, such as tagging, converting formats (e.g. Mobi -> Epub), or mass-editing metadata. It also has a relatively straightforward API for writing plugins.
My ebook library is huge, at nearly 5,000 titles (after de-duping). I’m a data hoarder, and love reading. Before I de-duped that was more like 8,000 titles, and therein lay the problem: as far as I could tell, there was no straightforward way to automatically remove duplicates from a Calibre library. There is a “find-duplicates” plugin, and that’s where this journey began.
I used the “find-duplicates” plugin to list out all my duplicates, which ended up being several thousand “groups” of duplicates. It looks like this:
Each of those groups can be acted upon, typically by merging. Annoyingly though, though it appears both books are selected, they aren’t, so if I try to hit “M” to merge, I get an error about needing more than one book selected. So that means all that “find-duplicates” can really do is put my duplicates in alphabetical order for me. Not exactly useless, but not what I need for my specific usecase.
I slowly was shift-clicking to highlight all books in a group and pressing “M” to merge. Would have taken a while. Needs automating. Engineers automate things.
Figuring Out What to Do
I knew that the “find-duplicates” plugin had some concept of a “Duplicate Group,” and because there were next/previous buttons for hopping between duplicate groups, I figured there was some way to get all book identifiers, whatever they may look like, and then somehow invoke on those IDs whatever function gets invoked when I hit “M” and try to merge books. First, I needed to find the code.
Unfortunately, I’ve as of yet to find a public repository for the “find-duplicates” plugin. In fact, I installed it from the Calibre plugin browser directly. I got to googling, and found this thread on the mobileread forums, which appears to be where the plugin was originally posted back in 2011. Luckily, it had the most up to date version of the plugin available as a ZIP, so I downloaded it.
Even luckier: I guess Calibre plugins are just Python apps, and fairly straightforward ones at that. The ZIP had an __init__.py
at the root, alongside a couple straightforward file names.
I poked around a bit just to get a better notion of what was going on, and then pulled my age old super-lazy-frontender-trick: did a project-wide search for a relevant string from a rendered view. In this case, I knew I wanted to do something to all duplicate groups, and there was already some action being performed to all duplicate groups by the “find-duplicates” plugin. In the menu for the plugin, there’s an option to “Mark All Groups as Exempt.” So I just did a project-wide search for that:
(If you’re curious, the IDE is Emacs, and the text search function is helm-project-do-smart-search
. I blog a bit about my specific environment)
Again pulling from my fronted experience, I guessed that I can probably ignore .po files, as they’re usually translation files used to allow for language-picking, evidenced further by them being in a translations/
directory. Two matches in action.py
feels good. I pop in and take a look.
find-duplicates/action.py
|
|
Based on create_menu_action
, I’m guessing that this function, and the ones around it (all nearly identical in structure), are responsible for populating the main menu. I don’t know anything about create_menu_action
or how it’s invoked, but triggered=
seems straightforward enough, insomuch as it probably means “invoke this function when clicked.” The function in question being self.mark_groups_as_duplicate_exemptions
. Full text search brings me to that function’s definition:
find-duplicates/duplicates.py
|
|
Out of here, the duplicate_finder.get_current_duplicate_group_ids()
seems the most useful. I would expect that variable, duplicate_ids
, to then be passed into a function that marks them, which looks to be self.duplicate_finder.mark_groups_as_duplicate_exemptions
or self.duplicate_finder.mark_current_group_as_duplicate_exemptions
, but nope, those function calls aren’t taking any arguments. So begins a terrifying realization about how things happen in Calibre, when I do a quick search for where duplicate_ids
is used (oh, it’s right there on the next line…): the variable get passed into a select_rows
function? And the mark_groups...
functions don’t take arguments? So they presumably are working on some form of state set in this function, and that state will likely be… the selection state of the UI?! The method in question is self.gui.library_view.select_rows(duplicate_ids)
, so it looks like it’s some property of the gui, specifically of the library view, and the action performed looks to be selecting rows. Which, I guess, is the same as what I do when I manually shift-click to select a couple “rows” of books. Well, I guess, follow the logic deeper, see what’s going on inside of mark_groups_as_duplicate_exemptions
. Full text search to find where its def
is hiding. Oh, annoyingly, that’s the same name as the function I’m in right now. Luckily, looks like there’s another def
for this function name inside of duplicates.py
:
find-duplicates/duplicates.py
|
|
self._mark_group_ids_as_exemptions
seems to be the meat and potatoes function that will finally actually mark group IDs as exemptions, and it seems to get its list of group IDs from whatever self._books_for_group_map
is. I’m still not exactly sure about that naming convention. But I can guess from the name of the _mark_group_ids
function that it returns group IDs, and I need book IDs. I have no way to know how to debug yet, so I just get to shooting from the hip, looking up that I can indeed iterate through what’s returned by python’s .keys()
method, I end up with:
|
|
And now guessing that there’s a list of book IDs against each item in the _books_for_group_map
I end up with
|
|
Totally blindly grasping, I had no idea if it was going to work. In fact, I think I made some syntactical errors in the above before getting it right, but I’ll get to how I debugged that later.
With a list of book IDs in hand, now I need to figure out how to actually merge said books. No way I’m doing that manually: it was time to rest on someone else’s hard work.
Back to the frontend strategy that’s never failed me: full project string search for something I find on the View of an app. In this case, I wanna know how Calibre merges books under the hood, so I simply do a full project text search for what Calibre displays as the menu option to trigger a book merge: “Merge into first selected book”. That brings me to Calibre’s edit_metadata.py
:
/gui2/actions/edit_metadata.py
|
|
This has the same triggered
paradigm that the duplicate finder plugin used for its own menu items, so I search for the merge_books
function. This function is kinda large because of big blocks of template strings, so I stripped it down a bit in the following snippet:
src/calibre/gui2/actions/edit_metadata.py
|
|
Jumping to the if/elifs
, I know I don’t want to safe_merge
, or merge_only_formats
, I want a full merge where the duplicates are deleted after, so I look at the else
block. The first line (after the template string I got rid of) is the first action I want to be taking with my merge: add_formats
. That is, if I have multiple books selected, some with formats of MOBI and other in EPUB, the MOBI and EPUB will be stuck onto the merged, final book. So, looks like it takes at least one book ID, the dest_id
, which by the naming conventions and menu paradigms I’ve been seeing I take to mean the “first selected book” mentioned in the context menu when trying to merge. But what about the other book IDs? They’re taken as self.formats_for_books(rows)
. Rows?! Here is where I confirmed that yup, looks like in Calibre-world, the way to get things done is accessing UI state from the code. The rows
bit means whatever rows
are in a selected state in the UI. Why that book data isn’t stored as some sort of state cache is beyond me. Also weird, the next two actions, merge_metadata
and delete_books_after_merge
take src_ids
directly, which, looking above, comes from dest_id, src_ids = self.books_to_merge(rows)
. Not sure why formats_for_books
can’t take book IDs directly, but oh well, it doesn’t matter for now. I have what I need: I don’t need to pass book IDs to merge_books
after all, I instead need to select rows from the duplicate finder plugin, and then invoke calibre’s merge_books
. Seems SUPER weird to need to trigger UI state to pass around data, but whatever.
I need to select rows programmatically somehow, so that when I invoke the merge_books
function, said books are merged. I have no idea how to do this, but I guess that it’s done elsewhere in the Calibre code, and I hope that I can access said code from within the plugin somehow. At the end of the merge_books
function is a bit of code that seems to trigger selection state in the UI:
src/calibre/gui2/actions/edit_metadata.py
|
|
So on a totally wild guess, I hunt through the folder structure of Calibre. It says gui.library_view
, well there’s no “gui” folder but there is a “gui2” one (where edit_metadata.py
is anyway), and within said folder is a library
directory, and in said library
directory is a views.py
file. I guess that Calibre follows a kind of MVC structure, and that there’s code… somewhere… that maps self.gui
to gui2
and something_view
to a views.py
file within a something
directory. Real cowboy shit, but I was determined to get it done, or at least break code until I had nothing else to try.
Lo and behold, on a whim I text search “select rows” within views.py
and find a select_rows
function
calibre/src/calibre/gui2/library/views.py
|
|
Literally exactly what I need. I just need to figure out how to invoke it from a plugin, and I’m done! So I head to the top of the duplicates.py
file within the find-duplicates plugin to see how they’re importing Calibre functionality.
find-duplicates/duplicates.py
|
|
Looks straightforward enough. It appears that I can follow the same folder convention to import Calibre functionality that I had guessed at earlier to find the select_rows
function. So I add:
find-duplicates/duplicates.py
|
|
Ok, and now to also grab the Calibre merge_books
function. Again, totally guessing:
find-duplicates/duplicates.py
|
|
At this point I feel like I’m at the end of an impossibly thin tree branch just happily sawing away between me and the tree trunk. I’ve made a bunch of wild guesses and haven’t run a single line of code yet. I realize that if I’ve made even one mistake it’s only getting harder to debug where said mistake happened with every line I add, so I decide to get right into it. I add a merge_all_groups
function to the find-duplicates plugin.
find-duplicates/duplicates.py
|
|
Ok and uhhh how do I actually trigger this function? I go back to how I originally found mark_groups_as_duplicate_exemptions
, that bit of View code in action.py
:
find-duplicates/action.py
|
|
Looks pretty straightforward, first create a menu action with create_menu_action
, add a tooltip, and indicate which function will get invoked. After, of course, somehow importing that function, and… how is that done again? I follow back the triggered mark_groups_as_duplicate_exemptions
to remember that oh yes, that function is defined in this action.py
file, and is named identically to another function that I was looking at in duplicates.py
, which the action.py
version invokes as follows:
find-duplicates/action.py
|
|
I didn’t see mark_groups_as_duplicate_exemptions
getting exported in any way out of duplicates.py
, so I guess that the functions that are def
ined are automatically available on self.duplicate_finder
. Now, I have everything I need. First, I add one of these action.py
functions that I guess can just exist to do nothing other than invoke a duplicate.py
function.
find-duplicates/action.py
|
|
Now, add a menu item that will trigger my new function:
find-duplicates/action.py
|
|
And for the grand finale… somehow get my code into Calibre! Which I figure to be very hard, because I installed find-duplicates via interaction through the Calibre UI, so I guess there’s some kind of Calibre-managed repository that I have no idea how to update. So I duck duck go for “how to make a Calibre plugin” and am directed to a page that Calibre has helpfully created. The critical bit of information that makes it all very easy is:
That’s all. To add this code to calibre as a plugin, simply run the following in the folder in which you created
__init__.py
:calibre-customize -b .
So, I do just that. Calibre launches! And the instantly crashes, with no displayed errors. So now onto debugging! Luckily, Calibre’s website has a section on debugging. Unluckily, the creator offers nothing in the form of stepping through a program, and instead says:
You can insert print statements anywhere in your plugin code, they will be output in debug mode. Remember, this is Python, you really shouldn’t need anything more than print statements to debug ;) I developed all of calibre using just this debugging technique.
Remember friends, only nerds actually step through their code. Cool kids just throw print
statements everywhere :)
So anyway I run Calibre in debug mode with calibre-debug -g
, launch, and check out what errors I’m getting. Now for an admission: I’m actually writing this blog post from memory, and apparently I failed to record the various errors I was getting, so I unfortunately can’t display them verbatim. However, they were in effect import
errors, wherein python couldn’t find modules I was trying to import. So my little guess at how to import functions from Calibre was wrong. I played with the syntax a bit to see if maybe I had just guessed wrong at how the folder structure worked, but didn’t get any new results. So, I went back to the code, to see if there was something else I could copy from the plugin. Was there any place where something else from Calibre getting imported into the find-duplicates plugin?
I poke around the code for a while, and while re-examining mark_groups_as_duplicate_exemptions
, I stumble back upon the confusion I had earlier around seeing select_rows
there. Right in front of my face was the Calibre function I needed in order to select rows!
find-duplicates/action.py
|
|
I guess gui methods are available on self
under self.gui
! Nice. Is that true in duplicates.py
as well? I check to see if self.gui
can be found in duplicates.py
with a full text search and see that not only does it do so, there’s also a place there that uses a library_view
row selection function
find-duplicates/duplicates.py
|
|
Which I take to mean that the select_rows
function is also available. So I modify my merge_all_groups
function:
find-duplicates/duplicates.py
|
|
Now to invoke Calibre’s merge_books
however it wants that function invoked. This turns out to be way harder to find. I start by trying to see if any of the other functions inside of Calibre’s edit_metadata.py
file are invoked elsewhere in the find-duplicates plugin, and sadly, they aren’t. So I start doing full project text searches in Calibre itself to see how Calibre accesses those functions when it needs to. I find a hit with the edit_metadata
function, which is invoked in calibre/gui2/library/alternate_views.py
:
calibre/gui2/library/alternate_views.py
|
|
This is a totally new syntax I haven’t seen before, this iactions['Edit Metadata']
bit. Some way of accessing classes and methods? So I poke around the file system until I find the path that plausibly matches what I’m looking at in that structure: /calibre/src/calibre/gui2/actions/edit_metadata.py
, within witch there’s an edit_metadata
function. Well, shit, right in this same file is the merge_books
function I want to invoke anyway! So on a total wild guess, I give copying that syntax a shot, but from within the find-duplicates plugin. I already know I can access library_view.select_rows
via self.gui
, so I guess that self.gui
just maps to whatever the “gui” in gui.iactions
is from within Calibre. So my code now looks like:
find-duplicates/duplicates.py
|
|
I remove the import
statements that were throwing errors earlier. Now, I’m accessing stuff, hopefully, just the same way as code elsewhere in the plugin is doing. So, I run Calibre in debug mode again, and lo, no crashing. Good first sign. I pop open a test library with a couple groups of duplicates. I hit my menu item. And, holy shit, they actually merge without issue. (not totally without issue - the book descriptions append onto eachother, even if they’re duplicate ones. oh well)
I was probably more hype for this fix than any other in my engineering career. I had done something entirely on my own, no outside input, no stackoverflow questions, no senior engineer to help out. Sure, I lifted 90% of the code, and pretty much just blindly guessed my way through syntax, but I undeniably went from “I don’t like this aspect of this open source software” to “This open source software now works like I want it to” based on my action. If I remember correctly I marched around the apartment for about ten minutes chanting “I am a god.” Gotta celebrate the victories.
The End Result
So, the actual code is on this Github page. I couldn’t find a repo from the original creator against which to submit a PR, so for now I’ve just uploaded the code wholesale. The full diff of my changes can be seen in this commit. If you want to use the plugin, just download this code straight up, and install it as per the Calibre instructions for installing a custom plugin:
That’s all. To add this code to calibre as a plugin, simply run the following in the folder in which you created
__init__.py
:calibre-customize -b .