-
Notifications
You must be signed in to change notification settings - Fork 25
DMArchiver is broken #83
Comments
Hi, Indeed, it looks like that's the end of DMArchiver and its HTML parsing method. Now, it seems there is a difference between using the official API and using it "through the browser". If we were users of the API, we would face the same limitations and it would not be possible to retrieve all the DMs of a conversation. I just tried to stupidly scroll up in a conversation with thousands of messages and I was still able to retrieve everything. It does not mean however there is no limitation. When inspecting a
It looks like 900 requests with 20 messages per 15 minutes. If the API is called exactly as a browser, I think we could avoid Selenium. |
Aha... I see those GET request now in the inspector -- must just not be exposed via the normal API protocols. So maybe it's just a matter of passing the right authentication/cookies/headers? I will look into this tomorrow. |
The basic flow is the following:
The requests must have the session cookies. This is not the hard part, The requests must also have a Bearer that looks like this: It's to authorize the browser, as a client, to access the API. I found out that this bearer is returned in the response of Somewhere in the middle of the code:
I don't how how frequently this could change. From what I can find on Google, looks like almost a hardcoded value for some time now: Anyway, this would require a complete rewrite of DMArchiver since the parsing of the JSON is completely different of the HTML parsing. Maybe there are already libraries that do this kind of thing. Text could be ok, but attachments (images, videos, tweets, links require more work). |
I confirm that you may now be unable to read your own private messages in the browser. The response will look like this:
|
I tried setting up an app on the Twitter developer site and getting a bearer token using oauthlib, but I would get access denied with every OAuth2 token except for the public one, no matter what permissions I set. For now, GETting that .js file and parsing it with Anyway, using your workflow as an outline, I was eventually able to properly set the headers and iterate through some GET calls to a conversation endpoint. We can randomly generate the x-csrf-token and corresponding I don't mind working through some of the parser details; I think the general data delivered by the API is quite similar, it just needs to be parsed in JSON rather than through lxml.html, which should make things much simpler in the end. With some fiddling I was also able to login with 2FA enabled, so I can likely address #26 too. Careful attention should be paid to rate limiting, for sure... the rate limit works out 1 per second, but as you note, the worst thing that happens from the user's perspective is that they're locked out of their DMs for 15 minutes, which isn't so bad. I definitely think this is worth pursuing, and I'm hopeful that I'll have a branch to share with you late this week or next weekend. |
We can't go through the standard OAuth2 flow for DMs I think. It's still best to simulate a user in the browser. The downside is to enter login and password. A slightly better solution would be to manually extract and enter the I checked the And sure, dropping lxml can just be a good news. I think it was a poor decision for a multi-platform tool. Using the API may also prevent random parsing breaks due to HTML updates. My "calculation" was completely wrong also. You're right, that's just 20 messages per second in the end. To prevent lookup I think we need to use something like this with a default rate slightly below the maximum one, like 850 per 15 minutes. It should do the trick for the majority of the users. Anyway, a warning will be required. |
Well. Shoot. Accidentally locked myself out running the tool, got the account back only to have the tool broken. Thanks Twitter hackers. Realistically, how long are we looking at for a working rewrite? |
With Mincka#83, we need a new approach to allow this program to function. This is the first attempt. Some features have been broken or removed, and likely cannot be added back. Cards, embedded tweets, etc. have been dropped, and it's possible that stickers are broken too (do they even exist anymore?). I can't promise that this works robustly, but it was tested in Python 3.7 with a saved session; I'm also not sure authentication isn't totally broken as I tried to implement a 2FA fix, too, but locked my account out from too many login attempts before I managed to get it working.
Sorry for the delay here, real life gets in the way sometimes. As you'll read in the commit message above, I have a very rough draft of the new interpreter, which hasn't really been extensively tested, but it does work flawlessly on a <24h old group DM with a few hundred messages. I also tried to implement the 2FA process that I've been using, but my account has been locked for several hours now... hoping to be able to try again tomorrow. Some things that are gone:
The presentation for these the way the interpreter used to parse them was done by Twitter on-site, so these are things that are fundamentally missing unless we manage to decode that API through their .js calls, too; I've replaced them as normal longform links in the text output. Listing conversations/grabbing from all conversations is probably still broken? Might be able to figure that one out, now. It's much, much easier to identify the conversation ID on the modern Twitter web interface, though. Some other various comments... the I don't know yet how well the rate-limiter works, will need to scrape a larger chat box that would take several hours to grab to see how well it behaves; right now it's set to the max (900 calls in 900 seconds). This would also test if all of the Open to feedback on whatever; this is definitely not a finished product. |
With Mincka#83, we need a new approach to allow this program to function. This is the first attempt. Some features have been broken or removed, and likely cannot be added back. Cards, embedded tweets, etc. have been dropped, and it's possible that stickers are broken too (do they even exist anymore?). I can't promise that this works robustly, but it was tested in Python 3.7 with a saved session; I'm also not sure authentication isn't totally broken as I tried to implement a 2FA fix, too, but locked my account out from too many login attempts before I managed to get it working.
Turns out that tweets are an embedded type, though only when the tweeter hasn't blocked you, interesting! Also figured out what I was missing on the latest ID thing. My branch should have more complete functionality now, as far as I can tell. Still needs more testing and cleanup, likely. |
One of the issues I'm encountering is that the image download links ( |
I have been facing the same issue since a few months ago. Is there any working version or a work in progress that could be checked out? |
Not sure what kind of testing @Mincka would want to do, but I've been running my forked branch (json_overhaul) within an existing larger framework (processing of the txt output into db + web front end) successfully for over a month now. I don't believe the hack I described for the image API rate limiting is in, but I can include it tomorrow. Think it needs some care before it would see an official release, though. |
@cajuncooks thank you. Is it publicly available on GitHub? |
|
With #83, we need a new approach to allow this program to function. This is the first attempt. Some features have been broken or removed, and likely cannot be added back. Cards, embedded tweets, etc. have been dropped, and it's possible that stickers are broken too (do they even exist anymore?). I can't promise that this works robustly, but it was tested in Python 3.7 with a saved session; I'm also not sure authentication isn't totally broken as I tried to implement a 2FA fix, too, but locked my account out from too many login attempts before I managed to get it working.
Thanks for the overhaul @cajuncooks. 👍 Can you confirm that you did not implement the retrieval of all conversations?
With a conversation id specified, it worked for about 20000 processed tweets and then stopped without saving with:
I need to check how you handle the rate limiting. I think that's because the API_LIMIT is the best case scenario at 900 but if you browsed a bit on Twitter before or at the same time, the counter is lower than 900 and it encounters the error before the local throttling. You implemented the silent wait on 429 for images only from what I see. The logic must be the same for all the API calls. Maybe it would be simpler to drop the local rate limiting and only use the 429 response code to wait when necessary. Finally, I have two types of parsing error that I need to investigate: The first looks related to handling of stickers.
This second one looks related to a parsing error when the name of the conversation is updated.
Still, it's a great work that brings new hopes for DMArchiver! 🎉 |
Do something guys you are champ!! |
Did DMArchiver die? |
First of all, thanks @Mincka for all your work on this, plus other contributors. We used dmarchiver for a while to export DMs for analysis for a research project. The other option for us is exporting the messages via Twitter's export tool, but that can take a few days to get the email. Now unfortunately it looks like Twitter has disabled crawling/scraping by requiring javascript to do anything, even to get the authenticity_token. I'm not sure if there's a way to get around that without some major rework, so even the recent update by @cajuncooks is broken now. I'm going to look into other options, but it seems like exporting messages older than 30 days (older than the API allows) might be tricky. I wonder if anyone has tried using a headless chrome browser for something like this. |
Hi @jeffhuang, did you try to change the user agent? Maybe it could help in this case. Thank you for the heads up anyway. Indeed, looks like a headless browser in the next hack. |
That's an interesting finding @Mincka and thank you for the suggestion. I'll look into it, but have to be cautious since our project is for federally-funded research, so we might not be so comfortable with mimicking the googlebot user agent. But if I try it, I'll post an update here. |
Jeff could you possibly at least do a Proof of Concept on this and let other users decide if this falls within proper use of the boundaries of their programs? I mean no disrespect and fully understand where you're coming from, but this has some uses for reporters and others in very specific use-cases that supersede the stigma attached to "Spoofing" Googlebot, which isn't illegal or even unethical in my opinion. Cheers, and thank you for your time. |
Has anyone found a way to fix this? I am not a coder, but I am trying to learn. I need to pull my own deleted DM's and I think this would really help if it still works. I will need help with this though - anyone willing to help a lady out? |
I don't think so. This used to work on the old Twitter front end. It has changed a lot since then. So, one should write a totally new scrapper. Though, it definitely cannot archive deleted DM's. For that, I guess you could download your Twitter Archive and probably parse XML stuff from what I remember. |
Hi, how do you parse xml stuff? See - I am totally confused about all of
this. I have downloaded my archive several times, but the deleted dm's
don't come over. Are you saying I could use the archive to "parse it" and
it may pull the deleted dm's? How would I do this?
Do you have any sample code you could share with me? Or, do you think I
could hire you to pull this information for me? Sorry, but I am desperate
to pull these deleted dm's.
Thanks for the response.
…On Tue, Feb 28, 2023 at 11:13 AM ᴍᴀᴍᴀᴅᴏᴜ ʙᴀʙᴀᴇɪ ***@***.***> wrote:
I don't think so. This used to work on the old Twitter front end. It has
changed a lot since then. So, one should write a totally new scrapper.
Though, it definitely cannot archive deleted DM's. For that, I guess you
could download your Twitter Archive and probably parse XML stuff from what
I remember.
—
Reply to this email directly, view it on GitHub
<#83 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5NDIEHVJD7ENSNMBETKIZTWZYP2ZANCNFSM4QARZRDQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I'm really just kind of hoping to open a dialogue here; I have no idea if there's anything we can reasonably do to solve this issue, but I mostly just want to hear that this is actually affecting somebody else. Earlier this week, this started happening to me:
The last line there is the JSON parser failing, because the POST request doesn't give a valid response. This happens whether or not you specify a conversation ID; it seems that all messages URLs that were in use now fail. Authentication still works, but ~nothing else related to DMs, as far as I can tell. I tried several different changes to the headers passed into the request, but nothing that produced actual fruitful results.
I went through some of #79 seeking an alternate solution, but the API endpoints mentioned in there seem to not exist anymore, or are locked behind some kind of additional authentication, despite my API application having permissions for DM access.
This is not specific to twurl, either... as I noted over in bear/python-twitter#665 (a PR which updates a deprecated DM endpoint in
python-twitter
), I get useless output there, an emptyevents
array (when according to Twitter's own documentation, "[i]n rare cases the events array may be empty"). e.g.:This matches my experience using twurl as suggested in that documentation, too.
I'm hypothesizing that this all has something to do with the breach Twitter experienced last month and their development for the v2.0 API, but the gut punch is that API access to DMs is listed under "Nesting" (the least-developed column, it seems) on the roadmap, which means that we may be months from a solution if the methods used in this application are no longer viable. I'd love to contribute to a solution here that doesn't involve an always-running selenium webdriver or some other related nonsense, but I'm not sure how to approach it.
The text was updated successfully, but these errors were encountered: