Python debugger(pdb) ; tldr pdb parse issue (invalid characters even in comments) solved by inverting a search on basic character set
Question: why 'utf-8' codec can't decode byte often appears when running pdb
Question: Why 'utf-8' codec can't decode byte often appears when running pdb on some script
pdb.set_trace()
or running
pdb.pm()
after a fail
Up until this year I have never see this issue, and now I see it constantly, and it really interferes with my debugging. I had mostly seen it in the past after some script had failed and raised some exception, then I would try my luck with `pdb.pm()` and sometimes it would return something weird like the below... at the time I assumed it was something weird/buggy about the script itself... but now I have seen this many times when i do not expect it, and now it popped up in a script I own, and I need to understand The error looks like this `
import pdb; pdb.set_trace()
File "C:\Python38\lib\bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "C:\Python38\lib\bdb.py", line 112, in dispatch_line
self.user_line(frame)
File "C:\Python38\lib\pdb.py", line 262, in user_line
self.interaction(frame, None)
File "C:\Python38\lib\pdb.py", line 356, in interaction
self.print_stack_entry(self.stack[self.curindex])
File "C:\Python38\lib\pdb.py", line 1460, in print_stack_entry
self.format_stack_entry(frame_lineno, prompt_prefix))
File "C:\Python38\lib\bdb.py", line 556, in format_stack_entry
line = linecache.getline(filename, lineno, frame.f_globals)
File "C:\Python38\lib\linecache.py", line 16, in getline
lines = getlines(filename, module_globals)
File "C:\Python38\lib\linecache.py", line 47, in getlines
return updatecache(filename, module_globals)
File "C:\Python38\lib\linecache.py", line 137, in updatecache
lines = fp.readlines()
File "C:\Python38\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 3603: invalid start byte
I have now seen this in many different code set. When it happened just now the code in question was running without fail... I just added a pdb.set_trace() call and then I see the failure.
Maybe there is some invalid character in the file? or pdb is trying to parse docstrings and a path inside is creating some kind of backslash char which pdb is trying to interpret as a char?
First I moved set_trace call to first line of execution and got same error…
I'm familiar with \x012 sort of numbers, but nothing like that in my code…
added another slash if there was just one (ie. path=r"C:\test\file.csv" -> path:r"C:\\test\\file.csv") …
When none of this changed the issue I just started deleting huge portions of the file; since pdb is now first line of code it won’t matter when I delete last quarter of file…
and it runs! Ah-hah. Now we’re getting somewhere… One way is to binary search the file… Deleting smaller and smaller segments till the character is question has been isolated… But that sounds rather tedious…
I sit there very confused, and wonder why pdb would not give me a better error code if it was really just a matter of some invalid char hiding somewhere in the file... why does it not at least point out to me what are some of the preceding chars, or give me better detail how to address the problem.
Oh, like normal when reading errors the first time, I missed was what python was trying to tell me.
position 3603
I’m using vim and wondered how could we have gotten here sooner?
I read we can use
:goto 3603
to go to the correct byte… but looks like maybe PDB is ignoring certain chars because the goto doesn’t take me to the correct location.
My favorite way I found was a simple search criteria, to search for any non ascii characters.
/[^\x00-\x7F]
using this basic search in vim will allow you to find any characters which are not allowed in the pdb character parse set. If you are not familiar with the terms I can breifly explain
/ search
[ set
^ invert set
\x00 character id 0
- automatic range
\x7F last char of ascii set
] end set
Probably this came from copying part of a document since it was in the comments, but note it could occur if you had copied code from some modern corporate communications editor (using another character encoding). Since the python interpreter didn’t seem to have any issue with the chars, but only PDB parser, you may not even notice when the stray quotation mark is introduced.
Thanks for reading.
Hope you learned something!